Distributed Systems Engineer

6 Months ago • 5 Years + • System Design • $150,000 PA - $300,000 PA

Job Summary

Job Description

We are seeking individuals with robust Machine Learning (ML) and Distributed Systems expertise. This role involves working within our Research team, collaborating closely with researchers to develop platforms for training our next-generation foundation models. Responsibilities include scaling systems for training on multi-thousand GPU clusters, profiling and optimizing model training code for hardware efficiency, building systems for efficient work distribution across large GPU clusters, designing methods for robust training amidst hardware failures, and creating tooling to analyze issues in large training jobs. The ideal candidate will have experience with multi-modal ML pipelines, high-performance computing, and low-level systems, with a passion for deep systems implementation and performance improvement.
Must have:
  • Work with researchers to scale systems
  • Optimize model training code
  • Build distributed systems for GPUs
  • Design for hardware failure tolerance
  • Build tooling for training jobs
  • Experience with multi-modal ML pipelines
  • Experience with HPC/low-level systems
  • Strong Python and software skills
  • Significant experience with Pytorch
Good to have:
  • Experience with C++ or CUDA
Perks:
  • Offers Equity

Job Details

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.

  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.

  • Build systems to distribute work across massive GPU clusters efficiently.

  • Design and implement methods to robustly train models in the presence of hardware failures.

  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.

  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.

  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.

  • Experience building stable and highly efficient distributed systems.

  • Strong generalist Python and Software skills including significant experience with Pytorch.

  • Good to have experience working with high performance C++ or CUDA.

Your application is reviewed by real people.

Similar Jobs

whoop - Senior Test Development Engineer (Software)

whoop

Boston, Massachusetts, United States (On-Site)
1 Month ago
gs studio - Unreal Engine Network Developer

gs studio

(Remote)
3 Months ago
IO Interactive - Senior Audio Programmer

IO Interactive

Malmö, Skåne County, Sweden (Hybrid)
10 Months ago
bytedance - Machine Learning Engineer - MLDev

bytedance

Seattle, Washington, United States (On-Site)
4 Months ago
Liquid Robotics - Systems Engineer

Liquid Robotics

Herndon, Virginia, United States (On-Site)
3 Months ago
Apple - Camera Systems Engineer

Apple

Cupertino, California, United States (On-Site)
3 Months ago
NVIDIA - Senior Software Engineer - System Customization Team

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
7 Months ago
Apple - Software Engineer - System Scheduling Performance

Apple

San Diego, California, United States (On-Site)
2 Months ago
Regent craft - Senior Systems Safety Engineer

Regent craft

North Kingstown, Rhode Island, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Amber - Level Designer 3D (Project Based)

Amber

Bucharest, Bucharest, Romania (Remote)
9 Months ago
flix interactive - Senior Engineer

flix interactive

United Kingdom (Remote)
3 Months ago
Go Fund Me - Staff Data Engineer

Go Fund Me

Buenos Aires, Buenos Aires, Argentina (Hybrid)
1 Month ago
Rockstar Games - Animation R&D Programmer: Retargeting (Mid / Senior Level)

Rockstar Games

Edinburgh, Scotland, United Kingdom (On-Site)
3 Months ago
Sega (UK) - Lead AI Programmer

Sega (UK)

Horsham, England, United Kingdom (Hybrid)
2 Months ago
Anavation - Software Engineer

Anavation

Chantilly, Virginia, United States (On-Site)
6 Months ago
Unity - Manager, Software Engineering, Web Platform

Unity

Bellevue, Washington, United States (On-Site)
1 Year ago
Cloud Imperium Games - Senior Gameplay Programmer (Vehicle Physics)

Cloud Imperium Games

Manchester, England, United Kingdom (On-Site)
8 Months ago
jetbrains - Game Development Product Manager

jetbrains

Yerevan, Yerevan, Armenia (On-Site)
3 Months ago
The Mill - Technical Artist

The Mill

New York, New York, United States (On-Site)
1 Year ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

FlockSafety - Regional Account Executive

FlockSafety

Madison, Wisconsin, United States (Remote)
2 Months ago
zoox - Director, Procurement and Strategic Partnerships

zoox

Foster City, California, United States (Hybrid)
10 Months ago
InMobiInMobi - Director, Product Management

InMobiInMobi

New York, United States (On-Site)
1 Month ago
Riot Games - Senior Software Engineer - VALORANT, Live Design

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
The story mob  - Communications Director

The story mob

Los Angeles, California, United States (On-Site)
3 Months ago
Bungie - Marathon Gameplay Engineering Lead

Bungie

Bellevue, Washington, United States (Hybrid)
3 Months ago
Toast - Principal Software Engineer, Delivery Services

Toast

United States (Remote)
2 Months ago
Nintendo - Associate Concept Artist (NST)

Nintendo

Redmond, Washington, United States (Hybrid)
11 Months ago
Critical mass - Junior Account Manager

Critical mass

New York, United States (Hybrid)
3 Weeks ago
Putnam - Principal, Data Strategy, Analytics & AI Practice

Putnam

Boston, Massachusetts, United States (Hybrid)
3 Weeks ago

Get notifed when new similar jobs are uploaded

System Design Jobs

Apple - Wireless Systems DSP Engineer

Apple

Sunnyvale, California, United States (On-Site)
1 Month ago
Eventbrite - Staff Design Engineer/Technical Lead, Design Systems

Eventbrite

United States (Remote)
1 Month ago
Gather AI - Special Projects Engineer, Autonomous Systems

Gather AI

Pittsburgh, Pennsylvania, United States (On-Site)
2 Months ago
Survay Monkey - Senior Systems Engineer - DevPlatform Team

Survay Monkey

Bengaluru, Karnataka, India (Hybrid)
3 Months ago
Anthology  Inc  - System Engineer

Anthology Inc

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Motorola solutions - Systems Engineer

Motorola solutions

Somerville, Massachusetts, United States (Remote)
3 Weeks ago
Ajmera Infotech - Flutter Application Developer (Mobile)

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
4 Months ago
Workato - Systems Engineer, Business Technology

Workato

Singapore (On-Site)
3 Weeks ago
Nintendo - Systems Engineer

Nintendo

Redmond, Washington, United States (Hybrid)
9 Months ago
Apple - Distinguished Software Engineer, ML Systems Evaluation Engineering

Apple

Cupertino, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Luma

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug