Distributed Systems Engineer

5 Months ago • 5 Years + • System Design • $150,000 PA - $300,000 PA

Job Summary

Job Description

We are seeking individuals with robust Machine Learning (ML) and Distributed Systems expertise. This role involves working within our Research team, collaborating closely with researchers to develop platforms for training our next-generation foundation models. Responsibilities include scaling systems for training on multi-thousand GPU clusters, profiling and optimizing model training code for hardware efficiency, building systems for efficient work distribution across large GPU clusters, designing methods for robust training amidst hardware failures, and creating tooling to analyze issues in large training jobs. The ideal candidate will have experience with multi-modal ML pipelines, high-performance computing, and low-level systems, with a passion for deep systems implementation and performance improvement.
Must have:
  • Work with researchers to scale systems
  • Optimize model training code
  • Build distributed systems for GPUs
  • Design for hardware failure tolerance
  • Build tooling for training jobs
  • Experience with multi-modal ML pipelines
  • Experience with HPC/low-level systems
  • Strong Python and software skills
  • Significant experience with Pytorch
Good to have:
  • Experience with C++ or CUDA
Perks:
  • Offers Equity

Job Details

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.

  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.

  • Build systems to distribute work across massive GPU clusters efficiently.

  • Design and implement methods to robustly train models in the presence of hardware failures.

  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.

  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.

  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.

  • Experience building stable and highly efficient distributed systems.

  • Strong generalist Python and Software skills including significant experience with Pytorch.

  • Good to have experience working with high performance C++ or CUDA.

Your application is reviewed by real people.

Similar Jobs

Epoch Games - Unreal Engine C++ Programmer

Epoch Games

Winston-Salem, North Carolina, United States (Remote)
1 Year ago
Trend Micro - Sr. Software Engineer (XDR for Networks)

Trend Micro

Taipei City, Taiwan (On-Site)
9 Months ago
HoYoverse - Senior Gameplay Programmer AI

HoYoverse

Québec City, Quebec, Canada (Remote)
3 Months ago
Apple - Neural Engine HW Modeling Architect, Platform Architecture

Apple

Seattle, Washington, United States (On-Site)
1 Month ago
Coda - Senior/Staff Software Engineer

Coda

Manila, Metro Manila, Philippines (Remote)
3 Years ago
Enphase Energy - Principal Systems Engineer – C&I

Enphase Energy

Fremont, California, United States (On-Site)
1 Month ago
Apple - Wireless PHY System Bringup Engineer

Apple

San Diego, California, United States (On-Site)
1 Month ago
Apple - Hardware System Design Engineer

Apple

San Diego, California, United States (On-Site)
1 Month ago
Qualcomm - Engineer - System Performance

Qualcomm

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Accenture - Application Developer

Accenture

Bengaluru, Karnataka, India (On-Site)
17 Hours ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Biblex games - Programmer

Biblex games

(Remote)
2 Months ago
Visual Concepts - Senior Backend Engineer - NBA 2K

Visual Concepts

Novato, California, United States (Remote)
2 Months ago
Sesame - Embedded Engineer

Sesame

San Francisco, California, United States (On-Site)
2 Weeks ago
Genies.io - Senior Pipeline Engineer

Genies.io

Los Angeles, California, United States (Hybrid)
1 Month ago
aspyr - Senior Software Engineer

aspyr

Austin, Texas, United States (On-Site)
2 Months ago
moonmana - 2D Game Artist

moonmana

Gdańsk, Pomeranian Voivodeship, Poland (On-Site)
2 Weeks ago
Spatial Studio - Animal Company - Unity Gameplay Engineer

Spatial Studio

(Remote)
6 Months ago
Lytx,  Inc  - Machine Learning Engineer II

Lytx, Inc

Bengaluru, Karnataka, India (On-Site)
1 Year ago
Milestone - Lead Software Engineer

Milestone

Portland, Oregon, United States (Remote)
3 Months ago
Epic Games - Senior C++ Engineer, Developer Relations

Epic Games

United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

Go Fund Me - Staff Product Manager (NPO Subscriptions)

Go Fund Me

San Francisco, California, United States (Hybrid)
2 Months ago
Rippling - Manager, Sales Development

Rippling

New York, United States (Hybrid)
1 Month ago
Marvell - Senior Staff Engineer, Physical Design

Marvell

Santa Clara, California, United States (Hybrid)
1 Month ago
Apple - Data Center Technician

Apple

Prineville, Oregon, United States (On-Site)
3 Weeks ago
Snap Mobile INC - Account Executive

Snap Mobile INC

St. Cloud, Minnesota, United States (On-Site)
2 Months ago
Ansys - Software Engineer II - Systems Integration

Ansys

Exton, Pennsylvania, United States (On-Site)
1 Month ago
Nintendo - Inventory & Receiving Coordinator - Nintendo San Francisco Store

Nintendo

San Francisco, California, United States (On-Site)
8 Months ago
Cognite - Senior Site Reliability Engineer

Cognite

Austin, Texas, United States (Hybrid)
1 Year ago
GHX - Inventory Specialist

GHX

Jacksonville, Florida, United States (On-Site)
2 Months ago
frames store - FREELANCE: NUKE - CHICAGO

frames store

Chicago, Illinois, United States (On-Site)
1 Year ago

Get notifed when new similar jobs are uploaded

System Design Jobs

bytedance - System Engineer, STE Intern - 2025 Start

bytedance

Singapore (On-Site)
2 Months ago
Crowd Strick - Senior Engineer - Content Systems

Crowd Strick

United States (Remote)
5 Days ago
Apple - Hardware System Design Engineer

Apple

San Diego, California, United States (On-Site)
1 Month ago
GHX - Integration System Engineer II-Provider

GHX

United States (Remote)
2 Days ago
extreme network - Sr. SLED Systems Engineer

extreme network

Washington, District Of Columbia, United States (On-Site)
8 Months ago
caliogo - Senior IT Systems Engineer

caliogo

Lucknow, Uttar Pradesh, India (On-Site)
3 Weeks ago
Accenture - Application Developer

Accenture

Chennai, Tamil Nadu, India (On-Site)
2 Months ago
Apple - RF Modeling Systems Engineer

Apple

San Diego, California, United States (On-Site)
1 Day ago
Cubic corporation - System Support Engineer

Cubic corporation

Salfords, England, United Kingdom (On-Site)
1 Year ago
Apple - RF System Integration Engineer - Cellular

Apple

Cupertino, California, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Luma

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug