Machine Learning Engineer (Training Infrastructure)

1 Month ago • 3 Years + • Devops

Job Summary

Job Description

We are seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves working on distributed training systems, with a focus on optimizing performance, implementing parallelization strategies, and ensuring fault tolerance in multi-GPU and multi-node environments. Key responsibilities include performance engineering for large language model training, implementing various parallelization techniques, profiling and optimizing training runs, and building robust, fault-tolerant training systems with checkpointing and recovery.
Must have:
  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for training code
  • Multi-node, multi-GPU training experience
  • Experience with distributed training frameworks
  • GPU memory management and optimization skills
Good to have:
  • Experience training large multi-modal models
  • Deep knowledge of NCCL
  • Experience with mixture of experts architectures
  • Strong NVIDIA GPU programming experience
  • Custom CUDA kernel development
  • Debugging training instability and numerical issues
  • Designing test runs for optimizations
  • Hands-on experience with FP8 or FP4 training
  • Open-source contributions

Job Details

We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:

  • Performance engineering of training infrastructure for large language models
  • Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
  • Profiling distributed training runs and optimizing performance bottlenecks
  • Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:

  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for performant and fault-tolerant training code
  • Multi-node, multi-GPU training experience with debugging skills
  • Experience with distributed training frameworks and cluster management
  • Deep understanding of GPU memory management and optimization techniques

Preferred:

  • Experience with distributed training of large multi-modal models, including those with separate vision encoders.
  • Deep knowledge of NCCL (e.g. symmetric memory)
  • Experience with mixture of experts architectures and expert parallelism
  • Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
  • Custom CUDA kernel development for training operations
  • Proven ability to debug training instability and numerical issues
  • Experience designing test runs to de-risk large-scale optimizations
  • Hands-on experience with FP8 or FP4 training
  • Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

Similar Jobs

onwards Search - Paid Media Campaign Manager

onwards Search

Hoboken, New Jersey, United States (Hybrid)
1 Year ago
Qualcomm - Wireless R&D Systems Engineer (Staff Level)

Qualcomm

Provence-Alpes-Côte D'Azur, France (On-Site)
3 Months ago
Airlab Inc  - C++ & Python Programmer

Airlab Inc

Quebec, Canada (On-Site)
4 Months ago
Tencent - Senior IT Operation Engineer

Tencent

Los Angeles, California, United States (On-Site)
3 Weeks ago
Rockstar Games - Senior Animation R&D Programmer: Retargeting

Rockstar Games

New York, United States (On-Site)
2 Months ago
Nice - Cloud Operations Engineer

Nice

Hoboken, New Jersey, United States (On-Site)
1 Month ago
T systems - Azure Architect

T systems

Pune, Maharashtra, India (On-Site)
3 Weeks ago
Sonar Source - Sales Solutions Engineer - EMEA

Sonar Source

Geneva, Geneva, Switzerland (On-Site)
10 Months ago
Ion - Cloud Engineer Kubernetes

Ion

Milan, Lombardy, Italy (Hybrid)
10 Months ago
BigID - Senior Solutions/Presales Engineer

BigID

Frankfurt Am Main, Hessen, Germany (Remote)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Lilith games - Overseas Advertising Manager

Lilith games

Shanghai, China (On-Site)
3 Weeks ago
Pingle studios - Middle Render Developer

Pingle studios

(Remote)
3 Months ago
Capgemini - O365/M365 Administrator

Capgemini

Bengaluru, Karnataka, India (On-Site)
4 Weeks ago
Everi - Field Service Technician – Keep the Game Running!

Everi

Palm Springs, California, United States (Hybrid)
3 Weeks ago
Thales - IVVQ Integration Engineer

Thales

Bucharest, Bucharest, Romania (On-Site)
2 Months ago
GoTo Group - Site Reliability Engineer - EP (SE4)

GoTo Group

Bengaluru, Karnataka, India (On-Site)
9 Months ago
Tesla - Construction Manager, Life Safety Systems

Tesla

Brandenburg, Germany (On-Site)
6 Months ago
Apple - Secrecy Program Manager

Apple

Cupertino, California, United States (On-Site)
1 Month ago
ISG - SAP WalkMe Senior Learning Consultant – US Southeast

ISG

United States (Remote)
3 Weeks ago
Palo Alto Networks - Senior Technical Support Engineer, Focused Services, Cortex XDR

Palo Alto Networks

Tokyo, Japan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Worldwide

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Devops Jobs

bytedance - Software Engineer in ML Engineering Platform

bytedance

Seattle, Washington, United States (On-Site)
9 Months ago
Ion - Cloud Engineer Kubernetes

Ion

Collecchio, Emilia-Romagna, Italy (Hybrid)
10 Months ago
endava - DevOps Engineer

endava

Bogotá, Bogota, Colombia (On-Site)
2 Months ago
Syniverse - Principal Microsoft Infrastructure Engineer

Syniverse

San José Province, Costa Rica (On-Site)
4 Weeks ago
bytedance - Solutions Architect

bytedance

Gurugram, Haryana, India (On-Site)
4 Months ago
Assystems - Automation Engineer

Assystems

Bois-Colombes, Île-de-France, France (On-Site)
1 Month ago
Tesla - Automation & Robotics Engineer

Tesla

Brandenburg, Germany (On-Site)
6 Months ago
Contentstack - Senior Engineer I - DevOps

Contentstack

Chennai, Tamil Nadu, India (Hybrid)
3 Months ago
Nagarro - Principal Engineer, Cloud

Nagarro

(On-Site)
9 Months ago
bytedance - Machine Learning Platform Engineer

bytedance

San Jose, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded