Machine Learning Engineer (Training Infrastructure)

Nousresearch

Job Summary

We are seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves working on distributed training systems, with a focus on optimizing performance, implementing parallelization strategies, and ensuring fault tolerance in multi-GPU and multi-node environments. Key responsibilities include performance engineering for large language model training, implementing various parallelization techniques, profiling and optimizing training runs, and building robust, fault-tolerant training systems with checkpointing and recovery.

Must Have

  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for training code
  • Multi-node, multi-GPU training experience
  • Experience with distributed training frameworks
  • GPU memory management and optimization skills

Good to Have

  • Experience training large multi-modal models
  • Deep knowledge of NCCL
  • Experience with mixture of experts architectures
  • Strong NVIDIA GPU programming experience
  • Custom CUDA kernel development
  • Debugging training instability and numerical issues
  • Designing test runs for optimizations
  • Hands-on experience with FP8 or FP4 training
  • Open-source contributions

Job Description

We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:

  • Performance engineering of training infrastructure for large language models
  • Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
  • Profiling distributed training runs and optimizing performance bottlenecks
  • Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:

  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for performant and fault-tolerant training code
  • Multi-node, multi-GPU training experience with debugging skills
  • Experience with distributed training frameworks and cluster management
  • Deep understanding of GPU memory management and optimization techniques

Preferred:

  • Experience with distributed training of large multi-modal models, including those with separate vision encoders.
  • Deep knowledge of NCCL (e.g. symmetric memory)
  • Experience with mixture of experts architectures and expert parallelism
  • Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
  • Custom CUDA kernel development for training operations
  • Proven ability to debug training instability and numerical issues
  • Experience designing test runs to de-risk large-scale optimizations
  • Hands-on experience with FP8 or FP4 training
  • Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

5 Skills Required For This Role

Problem Solving Cuda Pytorch Neural Networks Machine Learning

Similar Jobs