Machine Learning Engineer (Training Infrastructure)

2 Months ago • 3 Years +

Devops

Job Description

We are seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves working on distributed training systems, with a focus on optimizing performance, implementing parallelization strategies, and ensuring fault tolerance in multi-GPU and multi-node environments. Key responsibilities include performance engineering for large language model training, implementing various parallelization techniques, profiling and optimizing training runs, and building robust, fault-tolerant training systems with checkpointing and recovery.

Good To Have:

Experience training large multi-modal models
Deep knowledge of NCCL
Experience with mixture of experts architectures
Strong NVIDIA GPU programming experience
Custom CUDA kernel development
Debugging training instability and numerical issues
Designing test runs for optimizations
Hands-on experience with FP8 or FP4 training
Open-source contributions

Must Have:

3+ years training large neural networks in production
Expert-level PyTorch or JAX for training code
Multi-node, multi-GPU training experience
Experience with distributed training frameworks
GPU memory management and optimization skills

Add these skills to join the top 1% applicants for this job

problem-solving

cuda

pytorch

neural-networks

machine-learning

We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:

Performance engineering of training infrastructure for large language models
Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
Profiling distributed training runs and optimizing performance bottlenecks
Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:

3+ years training large neural networks in production
Expert-level PyTorch or JAX for performant and fault-tolerant training code
Multi-node, multi-GPU training experience with debugging skills
Experience with distributed training frameworks and cluster management
Deep understanding of GPU memory management and optimization techniques

Preferred:

Experience with distributed training of large multi-modal models, including those with separate vision encoders.
Deep knowledge of NCCL (e.g. symmetric memory)
Experience with mixture of experts architectures and expert parallelism
Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
Custom CUDA kernel development for training operations
Proven ability to debug training instability and numerical issues
Experience designing test runs to de-risk large-scale optimizations
Hands-on experience with FP8 or FP4 training
Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)