Machine Learning Performance Engineer

3 Months ago • All levels • Research Development

Job Summary

Job Description

The job involves optimizing the performance of machine learning models, both training and inference, focusing on efficient large-scale training, low-latency inference, and high-throughput inference. This includes improving CUDA and taking a whole-systems approach, considering storage systems, networking, and host- and GPU-level considerations. The role requires debugging training run performance end-to-end and understanding the intricacies of GPU hardware and networking technologies.
Must have:
  • Understanding of modern ML techniques and toolsets.
  • Experience to debug a training run's performance end to end.
  • Low-level GPU knowledge of PTX, SASS, warps, etc.
  • Debugging and optimization experience using tools like CUDA GDB.
Good to have:
  • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
  • Intuition about the latency and throughput characteristics
  • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization.
  • An understanding of the collective algorithms supporting distributed GPU training.

Job Details

We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team. 

Machine learning is a critical pillar of Jane Street's global business. Our ever-evolving trading environment serves as a unique, rapid-feedback platform for ML experimentation, allowing us to incorporate new ideas with relatively little friction.

Your part here is optimising the performance of our models – both training and inference. We care about efficient large-scale training, low-latency inference in real-time systems and high-throughput inference in research. Part of this is improving straightforward CUDA, but the interesting part needs a whole-systems approach, including storage systems, networking and host- and GPU-level considerations. Zooming in, we also want to ensure our platform makes sense even at the lowest level – is all that throughput actually goodput? Does loading that vector from the L2 cache really take that long?

If you’ve never thought about a career in finance, you’re in good company. Many of us were in the same position before working here. If you have a curious mind and a passion for solving interesting problems, we have a feeling you’ll fit right in. 

There’s no fixed set of skills, but here are some of the things we’re looking for:

  • An understanding of modern ML techniques and toolsets
  • The experience and systems knowledge required to debug a training run’s performance end to end
  • Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy
  • Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute
  • Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
  • Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads
  • Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters
  • An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
  • An inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools
  • Fluency in English

 

If you're a recruiting agency and want to partner with us, please reach out to agency-partnerships@janestreet.com.

Similar Jobs

Alphawave Semi - Principal Engineer - STA

Alphawave Semi

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Alpha Sense - Senior AI Engineer

Alpha Sense

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Rackspace Technology - Site Reliability Engineer / Observability Engineer

Rackspace Technology

India (Remote)
6 Months ago
binance - Senior Java Developer

binance

Taipei City, Taiwan (Remote)
11 Months ago
Adobe - Senior Cyber Defense Analyst

Adobe

Sydney, New South Wales, Australia (On-Site)
3 Months ago
Findhelp - Senior Staff AI Engineer

Findhelp

Austin, Texas, United States (On-Site)
1 Month ago
DevRev - Software Engineer - Applied AI Support

DevRev

Buenos Aires, Buenos Aires, Argentina (On-Site)
3 Months ago
Autodesk - Sr. Principal Construction Research Scientist

Autodesk

Toronto, Ontario, Canada (Hybrid)
2 Months ago
FICO - Senior Manager - AI Engineering - Applied AI

FICO

United States (Remote)
1 Month ago
Tide - Lead Machine Learning Engineer (MLOps)

Tide

Romania (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Wrike - Talent Sourcer

Wrike

Nicosia, Nicosia, Cyprus (Hybrid)
1 Month ago
CME Group - ECHO Analyst

CME Group

Hong Kong (On-Site)
2 Months ago
Remote - Staff Product Designer - Design System

Remote

(Remote)
3 Months ago
Forcepoint - Senior Software Engineer – Dashboarding, Reporting & Data Analytics

Forcepoint

Mumbai, Maharashtra, India (On-Site)
2 Months ago
Rippling - Software Engineer I, Global Payroll Calculation

Rippling

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
The Walt Disney Company - Lead Software Engineer - Front End/Roku

The Walt Disney Company

Santa Monica, California, United States (On-Site)
4 Months ago
InMobiInMobi - Senior Information Security Analyst (IT Risk)

InMobiInMobi

Bengaluru, Karnataka, India (On-Site)
11 Months ago
Sabre India - Principal Sales & Account Management

Sabre India

Mexico City, Mexico (On-Site)
1 Month ago
Interactive Brokers - Information Security Controls Manager

Interactive Brokers

Greenwich, Connecticut, United States (Hybrid)
3 Months ago
welevel  - Senior Gameplay Programmer

welevel

Munich, Bavaria, Germany (On-Site)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in London, England, United Kingdom

Triple dot studios - Business Intelligence Engineer

Triple dot studios

London, England, United Kingdom (Hybrid)
1 Month ago
Triggerfish - Set Designer (UK based)

Triggerfish

United Kingdom (Remote)
3 Weeks ago
Lighthouse Games - VFX Artist

Lighthouse Games

Royal Leamington Spa, England, United Kingdom (Hybrid)
4 Months ago
Epic Games - International Payroll Manager

Epic Games

London, England, United Kingdom (On-Site)
5 Months ago
Alpha Sense - Technical Support Engineer

Alpha Sense

United Kingdom (Remote)
1 Month ago
Foster and partners  - Graduate Electrical Engineer

Foster and partners

London, England, United Kingdom (On-Site)
1 Month ago
playground - Lighting Artist

playground

Royal Leamington Spa, England, United Kingdom (Hybrid)
3 Months ago
Canva - Staff Frontend Engineer - Data Workflows Team - Canva UK

Canva

London, England, United Kingdom (Remote)
8 Months ago
Wolters Kluwer - Technical Support Analyst (German speaking) – (all genders)– Hybrid

Wolters Kluwer

London, England, United Kingdom (Hybrid)
3 Weeks ago
GoMotive - Sales Development Representative

GoMotive

United Kingdom (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Research Development Jobs

DraftKings - Senior Machine Learning Engineer

DraftKings

Boston, Massachusetts, United States (On-Site)
2 Months ago
Honor - Senior AI Engineer

Honor

United States (Remote)
3 Weeks ago
conjointely - Quantitative Market Researcher (German-English)

conjointely

Amsterdam, North Holland, Netherlands (Remote)
1 Month ago
Reddit - Machine Learning Manager - Ads Engagement Modeling

Reddit

Canada (Remote)
2 Months ago
DevRev - Architect - Applied AI Engineer

DevRev

(Remote)
3 Months ago
Niantic - Machine Learning Scientist

Niantic

Sunnyvale, California, United States (Hybrid)
1 Month ago
Adyen - Founding Research Engineer, AI

Adyen

San Francisco, California, United States (On-Site)
1 Month ago
Match Group - Machine Learning Engineer

Match Group

New York, New York, United States (Hybrid)
10 Months ago
zoox - Senior Machine Learning Engineer - Perception

zoox

Foster City, California, United States (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Jane Street is a quantitative trading firm with offices in New York, London, Hong Kong, Singapore, and Amsterdam. We are always recruiting top candidates and we invest heavily in teaching and training. The environment at Jane Street is open, informal, intellectual, and fun. People grow into long careers here because there are always new and interesting problems to solve, systems to build, and theories to test.



New York, United States (On-Site)

New York, United States (On-Site)

New York, United States (On-Site)

New York, United States (On-Site)

Hong Kong (On-Site)

New York, United States (On-Site)

New York, United States (On-Site)

London, England, United Kingdom (On-Site)

New York, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Jane Street

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug