Distributed Systems Engineer

8 Months ago • 5 Years + • $150,000 PA - $300,000 PA
System Design

Job Description

We are seeking individuals with robust Machine Learning (ML) and Distributed Systems expertise. This role involves working within our Research team, collaborating closely with researchers to develop platforms for training our next-generation foundation models. Responsibilities include scaling systems for training on multi-thousand GPU clusters, profiling and optimizing model training code for hardware efficiency, building systems for efficient work distribution across large GPU clusters, designing methods for robust training amidst hardware failures, and creating tooling to analyze issues in large training jobs. The ideal candidate will have experience with multi-modal ML pipelines, high-performance computing, and low-level systems, with a passion for deep systems implementation and performance improvement.
Good To Have:
  • Experience with C++ or CUDA
Must Have:
  • Work with researchers to scale systems
  • Optimize model training code
  • Build distributed systems for GPUs
  • Design for hardware failure tolerance
  • Build tooling for training jobs
  • Experience with multi-modal ML pipelines
  • Experience with HPC/low-level systems
  • Strong Python and software skills
  • Significant experience with Pytorch
Perks:
  • Offers Equity

Add these skills to join the top 1% applicants for this job

cpp
cuda
pytorch
python

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.

  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.

  • Build systems to distribute work across massive GPU clusters efficiently.

  • Design and implement methods to robustly train models in the presence of hardware failures.

  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.

  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.

  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.

  • Experience building stable and highly efficient distributed systems.

  • Strong generalist Python and Software skills including significant experience with Pytorch.

  • Good to have experience working with high performance C++ or CUDA.

Your application is reviewed by real people.

Set alerts for more jobs like Distributed Systems Engineer
Set alerts for new jobs by Luma
Set alerts for new System Design jobs in United States
Set alerts for new jobs in United States
Set alerts for System Design (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙