We are seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves working on distributed training systems, with a focus on optimizing performance, implementing parallelization strategies, and ensuring fault tolerance in multi-GPU and multi-node environments. Key responsibilities include performance engineering for large language model training, implementing various parallelization techniques, profiling and optimizing training runs, and building robust, fault-tolerant training systems with checkpointing and recovery.