Senior Distributed Systems Engineer

2 Months ago β€’ 3 Years + β€’ Research & Development β€’ $175,000 PA - $250,000 PA

Job Summary

Job Description

This Senior Distributed Systems Engineer role involves collaborating with researchers to build and optimize platforms for training next-generation foundation models on massive GPU clusters. Key responsibilities include scaling and optimizing systems for training large-scale models across thousands of GPUs, profiling and enhancing training code performance, developing efficient workload distribution systems, designing robust solutions for handling hardware failures, building diagnostic tools, optimizing inference workloads, implementing high-performance CUDA, Triton, and PyTorch code, and collaborating with researchers on system design. The ideal candidate will have extensive experience in ML pipelines, distributed systems, or high-performance computing, along with proficiency in Python and PyTorch, and expertise in CUDA/Triton programming and optimization techniques. Experience with generative models and prototype development is a plus.
Must have:
  • 3+ years experience in ML pipelines, distributed systems, or HPC
  • Experience training large models using Python and PyTorch
  • Expertise in optimizing and deploying inference workloads
  • Understanding of distributed systems and frameworks (DDP, FSDP, tensor parallelism)
  • High-performance parallel C++ and custom PyTorch kernels
  • CUDA and Triton optimization techniques
Good to have:
  • Experience with generative models (Transformers, Diffusion Models, GANs)
  • Prototype development (Gradio, Docker)
Perks:
  • Competitive equity packages (stock options)
  • Comprehensive benefits plan

Job Details

We are seeking highly skilled engineers with expertise in machine learning, distributed systems, and high-performance computing to join our Research team. In this role, you will collaborate closely with researchers to build and optimize platforms that train next-generation foundation models on massive GPU clusters. Your work will play a critical role in advancing the efficiency and scalability of cutting-edge generative AI technologies.

Key Responsibilities

  • Scale and optimize systems for training large-scale models across multi-thousand GPU clusters.
  • Profile and enhance the performance of training codebases to achieve best-in-class hardware efficiency.
  • Develop systems to distribute workloads efficiently across massive GPU clusters.
  • Design and implement robust solutions to enable model training in the presence of hardware failures.
  • Build tools to diagnose issues, visualize processes, and evaluate datasets at scale.
  • Optimize and deploy inference workloads for throughput and latency across the entire stack, including data processing, model inference, and parallel processing.
  • Implement and improve high-performance CUDA, Triton, and PyTorch code to address efficiency bottlenecks in memory, speed, and utilization.
  • Collaborate with researchers to ensure systems are designed with optimal efficiency from the ground up.
  • Prototype cutting-edge applications using multimodal generative AI.

Qualifications

  • Experience:
    • 3+ years of professional experience in ML pipelines, distributed systems, or high-performance computing.
    • Hands-on experience training large models using Python and PyTorch, with familiarity in the full pipeline: data processing, loading, training, and inference.
    • Proven expertise in optimizing and deploying inference workloads, with experience in profiling GPU/CPU code (e.g., Nvidia Nsight).
    • Deep understanding of distributed systems and frameworks, such as DDP, FSDP, and tensor parallelism.
    • Strong experience writing high-performance parallel C++ and custom PyTorch kernels, with knowledge of CUDA and Triton optimization techniques.
    • Bonus: Experience with generative models (e.g., Transformers, Diffusion Models, GANs) and prototype development (e.g., Gradio, Docker).
  • Technical Skills:
    • Proficiency in Python, with significant experience using PyTorch.
    • Advanced skills in CUDA/Triton programming, including custom kernel development and tensor core optimization.
    • Strong generalist software engineering skills and familiarity with distributed and parallel computing systems.

Note: This position is not intended for recent graduates.

Compensation

The salary range for this role in California is $175,000–$250,000 per year. Actual compensation will depend on job-related knowledge, skills, experience, and candidate location. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.

Similar Jobs

N-iX - 2D/3D Visualization Engineer

N-iX

Ukraine (Remote)
β€’ 5 Days ago
Microsoft - Software Engineer

Microsoft

Noida, Uttar Pradesh, India (On-Site)
β€’ 2 Weeks ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model - Speech Understanding) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
β€’ 3 Months ago
The Walt Disney Company - Lead Compositor

The Walt Disney Company

Sydney, New South Wales, Australia (On-Site)
β€’ 2 Days ago
Snowed In Studios - Advanced Software Developer - Montreal

Snowed In Studios

Quebec, Canada (Remote)
β€’ 3 Months ago
PhonePe - Firmware Engineer (Exp. Bucket 3-5 Yrs)

PhonePe

Bengaluru, Karnataka, India (On-Site)
β€’ 2 Months ago
Hawk Eye Innovations - Senior Machine Learning Research Engineer

Hawk Eye Innovations

Budapest, Hungary (Hybrid)
β€’ 3 Weeks ago
ByteDance - GPU/AI Application Platform Engineer Graduate (Server Platform)- 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
β€’ 3 Months ago
Intelsat - Senior Software Engineer

Intelsat

Chennai, Tamil Nadu, India (Hybrid)
β€’ 3 Months ago
Krafton  - [Publishing Div. 2] Brand Management Team Lead (8λ…„ 이상)

Krafton

Seoul, South Korea (On-Site)
β€’ 1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Genies - Lead Backend Engineer - Developer Platform

Genies

Los Angeles, California, United States (Hybrid)
β€’ 2 Weeks ago
Playrix - Senior C++ Software Engineer (Tools)

Playrix

Armenia (Remote)
β€’ 3 Months ago
Visual Concepts - Senior Server Engineer

Visual Concepts

Novato, California, United States (On-Site)
β€’ 4 Months ago
Hypixel Studios - SENIOR SOFTWARE ENGINEER, GAMEPLAY

Hypixel Studios

(Remote)
β€’ 2 Months ago
Dambuster Studios - Lead Build Engineer

Dambuster Studios

Nottingham, England, United Kingdom (Hybrid)
β€’ 2 Days ago
Wind River Systems - Senior Test Framework Engineer – Embedded Software

Wind River Systems

Bengaluru, Karnataka, India (On-Site)
β€’ 3 Months ago
Nordcurrent - Gameplay Programmer

Nordcurrent

Vilnius, Vilnius County, Lithuania (On-Site)
β€’ 3 Months ago
Luxoft - Android Automotive Developer

Luxoft

Brazil, Indiana, United States (Remote)
β€’ 2 Months ago
Playrix - Senior C++/Python Software Engineer (Engine)

Playrix

Armenia (Remote)
β€’ 3 Months ago
Rockstar Games - Senior Technical Artist: Animation

Rockstar Games

Leeds, England, United Kingdom (On-Site)
β€’ 4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

Mattel  Inc  - Director, Corporate Strategy & Development

Mattel Inc

California, United States (On-Site)
β€’ 1 Month ago
Paypal - Lead UX Designer

Paypal

San Jose, California, United States (Hybrid)
β€’ 4 Months ago
Fliff  Inc  - Junior Finance

Fliff Inc

Philadelphia, Pennsylvania, United States (On-Site)
β€’ 6 Months ago
Patel greene - Roadway Project Manager

Patel greene

Bartow, Florida, United States (On-Site)
β€’ 3 Months ago
ByteDance - Benefits Business Partner - San Jose

ByteDance

San Jose, California, United States (On-Site)
β€’ 2 Months ago
Extreme Network - HR Business System Analyst (Global)

Extreme Network

North Carolina, United States (Remote)
β€’ 3 Months ago
ByteDance - Research Scientist, Foundation Model, Speech & Audio

ByteDance

San Jose, California, United States (On-Site)
β€’ 3 Months ago
Zoox - Software Engineering Manager: Operating Systems and Vehicle Configuration

Zoox

Foster City, California, United States (Hybrid)
β€’ 3 Months ago
Intel Corporation - Utilities Procurement Manager

Intel Corporation

Hillsboro, Oregon, United States (Hybrid)
β€’ 1 Month ago
Blizzard Entertainment - Game Production Co-Op

Blizzard Entertainment

Albany, New York, United States (On-Site)
β€’ 3 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Riot Games - Manager Infrastructure Engineering

Riot Games

Dublin, County Dublin, Ireland (On-Site)
β€’ 2 Months ago
Renesas Electronics - Staff Engineer - Physical Design

Renesas Electronics

Hyderabad, Telangana, India (On-Site)
β€’ 4 Months ago
Rivos - Platform FPGA Design

Rivos

Santa Clara, California, United States (On-Site)
β€’ 4 Months ago
Power Integrations - Senior Failure Analysis Engineer

Power Integrations

Penang, Malaysia (On-Site)
β€’ 3 Months ago
Google - Technical Lead, Embedded Systems, Silicon

Google

New Taipei, New Taipei City, Taiwan (On-Site)
β€’ 1 Month ago
MediaTek - Linux Device Driver Engineer

MediaTek

Bengaluru, Karnataka, India (On-Site)
β€’ 4 Months ago
Microsoft - Data and Applied Scientist II

Microsoft

Hyderabad, Telangana, India (On-Site)
β€’ 3 Weeks ago
JMA - Senior Engineer - Firmware

JMA

Bologna, Emilia-Romagna, Italy (On-Site)
β€’ 4 Months ago
ByteDance - Backend Engineer, Ark Large Model Platform - 2025 Start

ByteDance

Singapore (On-Site)
β€’ 3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

An idea-to-video platform that brings your creativity to motion.

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Pika

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug