We are seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves working on distributed training systems, with a focus on optimizing performance, implementing parallelization strategies, and ensuring fault tolerance in multi-GPU and multi-node environments. Key responsibilities include performance engineering for large language model training, implementing various parallelization techniques, profiling and optimizing training runs, and building robust, fault-tolerant training systems with checkpointing and recovery.
Good To Have:- Experience training large multi-modal models
- Deep knowledge of NCCL
- Experience with mixture of experts architectures
- Strong NVIDIA GPU programming experience
- Custom CUDA kernel development
- Debugging training instability and numerical issues
- Designing test runs for optimizations
- Hands-on experience with FP8 or FP4 training
- Open-source contributions
Must Have:- 3+ years training large neural networks in production
- Expert-level PyTorch or JAX for training code
- Multi-node, multi-GPU training experience
- Experience with distributed training frameworks
- GPU memory management and optimization skills