Distributed Training Engineer

4 Hours ago • All levels
Teaching

Job Description

Periodic Labs is an AI + physical sciences lab focused on novel scientific discoveries, building state-of-the-art models. The Distributed Training Engineer will optimize, operate, and develop large-scale distributed LLM training systems for AI scientific research. This role involves collaborating with researchers on mid-training and reinforcement learning workflows, building tools, and supporting frontier-scale experiments to advance the lab's mission. The engineer will also contribute to open-source large-scale LLM training frameworks.
Must Have:
  • Experience training on clusters with ","5,000 GPUs
  • Experience with 5D parallel LLM training
  • Proficiency in distributed training frameworks like Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Ability to optimize training throughput for large scale Mixture-of-Expert models

Add these skills to join the top 1% applicants for this job

game-texts
reinforcement-learning

About Periodic Labs

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.

About the role

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world’s best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.

You might thrive in this role if you have experience with:

  • Training on clusters with ","5,000 GPUs
  • 5D parallel LLM training
  • Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Optimizing training throughput for large scale Mixture-of-Expert models

Set alerts for more jobs like Distributed Training Engineer
Set alerts for new jobs by Periodic Labs
Set alerts for new Teaching jobs in United States
Set alerts for new jobs in United States
Set alerts for Teaching (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙