MLE (General Training Infrastructure)

Nousresearch

3+ Years | On Site | Full Time | 1 day ago

Apply Now

Job Summary

Nous Research is seeking a Machine Learning Engineer to enhance the training infrastructure for large transformer-based models. The role involves scaling distributed training environments, optimizing performance, implementing parallelization strategies across various dimensions, and building fault-tolerant systems with robust checkpointing and recovery mechanisms. The ideal candidate will have experience with multi-GPU and multi-node setups, focusing on performance bottlenecks and GPU memory management.

Must Have

Performance engineering of training infrastructure for large language models
Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
Profiling distributed training runs and optimizing performance bottlenecks
Building fault-tolerant training systems with checkpointing and recovery mechanisms
3+ years training large neural networks in production
Expert-level PyTorch or JAX for performant and fault-tolerant training code
Multi-node, multi-GPU training experience with debugging skills
Experience with distributed training frameworks and cluster management
Deep understanding of GPU memory management and optimization techniques

Good to Have

Experience with distributed training of large multi-modal models, including those with separate vision encoders
Deep knowledge of NCCL (e.g. symmetric memory)
Experience with mixture of experts architectures and expert parallelism
Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
Custom CUDA kernel development for training operations
Proven ability to debug training instability and numerical issues
Experience designing test runs to de-risk large-scale optimizations
Hands-on experience with FP8 or FP4 training
Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

Job Description

We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.

Responsibilities:

Performance engineering of training infrastructure for large language models
Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
Profiling distributed training runs and optimizing performance bottlenecks
Building fault-tolerant training systems with checkpointing and recovery mechanisms

Qualifications:

3+ years training large neural networks in production
Expert-level PyTorch or JAX for performant and fault-tolerant training code
Multi-node, multi-GPU training experience with debugging skills
Experience with distributed training frameworks and cluster management
Deep understanding of GPU memory management and optimization techniques

Preferred:

Experience with distributed training of large multi-modal models, including those with separate vision encoders.
Deep knowledge of NCCL (e.g. symmetric memory)
Experience with mixture of experts architectures and expert parallelism
Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
Custom CUDA kernel development for training operations
Proven ability to debug training instability and numerical issues
Experience designing test runs to de-risk large-scale optimizations
Hands-on experience with FP8 or FP4 training
Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)

5 Skills Required For This Role

Problem Solving Game Texts Cuda Pytorch Neural Networks

Similar Jobs

Research Development

Senior AI/ML Engineer

appzen • Pune, Maharashtra, India (On Site)

Team Lead – Linguistics

Lightcast • Chennai, Tamil Nadu, India (Remote)

Software Engineer (Machine Learning)

Nahc.io • Hong Kong (On Site)

Staff Machine Learning Engineer

Tekion Corp • Bangalore, Karnataka, India (On Site)

Machine Learning Engineer Intern

Match Group • Palo Alto, California, United States (Hybrid)

Data Scientist II, Growth

Match Group • Los Angeles, California, United States (Hybrid)

Machine Learning Software Engineer, Backend (Tinder Seoul)

Match Group • Seoul, South Korea (Hybrid)

Senior Machine Learning Software Engineer, Backend (Tinder Seoul)

Match Group • Seoul, South Korea (Hybrid)

Senior Machine Learning Engineer

Match Group • Vancouver, British Columbia, Canada (Hybrid)

Staff Machine Learning Engineer, Dating Outcomes

Match Group • New York, New York, United States (Hybrid)

Software Development & Engineering

Principal Software Engineer - Data Discovery and Removal Automation Team (Node.js, US-based)

Optery • United States (Remote)

Full Stack Developer - Data Discovery and Removal Automation Team (Node.js, LATAM-based)

Optery • Remote

Senior Manager, Engineering (Guarded Containers)

ChainGuard • United States (Remote)

Engineering Lead, Data and AI Devops Infra

AirWallex • Singapore, Singapore (On Site)

Senior React Native Engineer

Aristocrat Leisure Limited • London, United Kingdom (Hybrid)

Software Developer SSR

Salesforce • Mexico City, Mexico (Hybrid)

Senior Software Engineer

Salesforce • Mexico City, Mexico (Hybrid)

Senior Manager, Solution Engineering (SMB)

Salesforce • San Francisco, California, United States (Hybrid)

Tencent • Singapore, Singapore (On Site)

Senior System ASIC Engineer - Speed and Reliability

NVIDIA • Santa Clara, California, United States (Hybrid)