Senior Distributed Systems Engineer

2 Months ago • 3 Years + • Research & Development • $175,000 PA - $250,000 PA

Job Summary

Job Description

This Senior Distributed Systems Engineer role involves collaborating with researchers to build and optimize platforms for training next-generation foundation models on massive GPU clusters. Key responsibilities include scaling and optimizing systems for training large-scale models across thousands of GPUs, profiling and enhancing training code performance, developing efficient workload distribution systems, designing robust solutions for handling hardware failures, building diagnostic tools, optimizing inference workloads, implementing high-performance CUDA, Triton, and PyTorch code, and collaborating with researchers on system design. The ideal candidate will have extensive experience in ML pipelines, distributed systems, or high-performance computing, along with proficiency in Python and PyTorch, and expertise in CUDA/Triton programming and optimization techniques. Experience with generative models and prototype development is a plus.
Must have:
  • 3+ years experience in ML pipelines, distributed systems, or HPC
  • Experience training large models using Python and PyTorch
  • Expertise in optimizing and deploying inference workloads
  • Understanding of distributed systems and frameworks (DDP, FSDP, tensor parallelism)
  • High-performance parallel C++ and custom PyTorch kernels
  • CUDA and Triton optimization techniques
Good to have:
  • Experience with generative models (Transformers, Diffusion Models, GANs)
  • Prototype development (Gradio, Docker)
Perks:
  • Competitive equity packages (stock options)
  • Comprehensive benefits plan

Job Details

We are seeking highly skilled engineers with expertise in machine learning, distributed systems, and high-performance computing to join our Research team. In this role, you will collaborate closely with researchers to build and optimize platforms that train next-generation foundation models on massive GPU clusters. Your work will play a critical role in advancing the efficiency and scalability of cutting-edge generative AI technologies.

Key Responsibilities

  • Scale and optimize systems for training large-scale models across multi-thousand GPU clusters.
  • Profile and enhance the performance of training codebases to achieve best-in-class hardware efficiency.
  • Develop systems to distribute workloads efficiently across massive GPU clusters.
  • Design and implement robust solutions to enable model training in the presence of hardware failures.
  • Build tools to diagnose issues, visualize processes, and evaluate datasets at scale.
  • Optimize and deploy inference workloads for throughput and latency across the entire stack, including data processing, model inference, and parallel processing.
  • Implement and improve high-performance CUDA, Triton, and PyTorch code to address efficiency bottlenecks in memory, speed, and utilization.
  • Collaborate with researchers to ensure systems are designed with optimal efficiency from the ground up.
  • Prototype cutting-edge applications using multimodal generative AI.

Qualifications

  • Experience:
    • 3+ years of professional experience in ML pipelines, distributed systems, or high-performance computing.
    • Hands-on experience training large models using Python and PyTorch, with familiarity in the full pipeline: data processing, loading, training, and inference.
    • Proven expertise in optimizing and deploying inference workloads, with experience in profiling GPU/CPU code (e.g., Nvidia Nsight).
    • Deep understanding of distributed systems and frameworks, such as DDP, FSDP, and tensor parallelism.
    • Strong experience writing high-performance parallel C++ and custom PyTorch kernels, with knowledge of CUDA and Triton optimization techniques.
    • Bonus: Experience with generative models (e.g., Transformers, Diffusion Models, GANs) and prototype development (e.g., Gradio, Docker).
  • Technical Skills:
    • Proficiency in Python, with significant experience using PyTorch.
    • Advanced skills in CUDA/Triton programming, including custom kernel development and tensor core optimization.
    • Strong generalist software engineering skills and familiarity with distributed and parallel computing systems.

Note: This position is not intended for recent graduates.

Compensation

The salary range for this role in California is $175,000–$250,000 per year. Actual compensation will depend on job-related knowledge, skills, experience, and candidate location. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.

Similar Jobs

Meta - Production Engineer

Meta

Warsaw, Masovian Voivodeship, Poland (On-Site)
• 3 Months ago
Meta - Software Engineer, Product

Meta

Washington, District Of Columbia, United States (Remote)
• 3 Months ago
Meta - Software Engineer, Infrastructure

Meta

San Francisco, California, United States (Remote)
• 3 Months ago
NVIDIA - Electronics Failure Analysis Hardware Engineer

NVIDIA

Shenzhen, Guangdong Province, China (On-Site)
• 1 Month ago
ION - Software Developer/Engineer - Graduate Development Program

ION

Milan, Lombardy, Italy (On-Site)
• 4 Months ago
Actian - Core Java Developer - Pune

Actian

Pune, Maharashtra, India (On-Site)
• 4 Months ago
Luxoft - Regular Embedded C++ Developer

Luxoft

Italy, New York, United States (Remote)
• 2 Months ago
NVIDIA - Hardware Senior Manager, Switch Design

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
• 3 Weeks ago
NVIDIA - Senior ASIC Design Engineer

NVIDIA

Massachusetts, United States (Hybrid)
• 1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Playrix - Lead Technical Designer

Playrix

Ukraine (Remote)
• 4 Months ago
SmileGate - LOST ARK Mobile Client Developer (Content)

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
• 1 Month ago
NVIDIA - Senior GPU Kernel Performance Lead

NVIDIA

Santa Clara, California, United States (On-Site)
• 1 Month ago
ByteDance - Software Engineer, UI Framework

ByteDance

Seattle, Washington, United States (On-Site)
• 3 Months ago
Bohemia Interactive - QA Tester

Bohemia Interactive

Brno, South Moravian Region, Czechia (On-Site)
• 6 Months ago
Playrix - Director of Engineering

Playrix

Portugal (Remote)
• 4 Months ago
Trend Micro - (Sr.) Threat Researcher

Trend Micro

Taipei City, Taiwan (On-Site)
• 4 Months ago
Grimlore Games - C++ Programmer RTS Games (m/f/d)

Grimlore Games

Bavaria, Germany (Remote)
• 2 Months ago
Keen Software House - Senior Render Programmer

Keen Software House

Prague, Prague, Czechia (Remote)
• 2 Weeks ago
Meta - Software Engineer, iOS

Meta

Boston, Massachusetts, United States (On-Site)
• 3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

Sphere Entertainment Co - Sphere Interactive Project Management Summer 2025 Student Associate (Burbank, CA)

Sphere Entertainment Co

Burbank, California, United States (Hybrid)
• 3 Months ago
Scopely - QA Manager - Unannounced AAA Action RPG

Scopely

United States (Remote)
• 3 Months ago
Zoox - Senior Site Reliability Engineer

Zoox

Foster City, California, United States (On-Site)
• 4 Months ago
Aristocrat Gaming - Field Service Technician

Aristocrat Gaming

Columbus, Ohio, United States (Hybrid)
• 5 Days ago
Luxoft - Senior GPU Compiler Software Development Engineer

Luxoft

Brazil, Indiana, United States (Remote)
• 3 Months ago
Fluence - Director of Planning, Americas

Fluence

Houston, Texas, United States (Hybrid)
• 4 Months ago
Sphere Entertainment Co - Analyst Data Governance

Sphere Entertainment Co

Las Vegas, Nevada, United States (On-Site)
• 2 Months ago
Rivos - Silicon Microarchitecture & Logic Design - Intern

Rivos

Santa Clara, California, United States (On-Site)
• 4 Months ago
Zoox - System Integration & Verification Engineer

Zoox

Foster City, California, United States (Hybrid)
• 4 Months ago
GoMotive - Senior Manager, FP&A

GoMotive

United States (Remote)
• 6 Days ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Meta - Silicon Architect

Meta

Sunnyvale, California, United States (On-Site)
• 3 Months ago
Nielsen Holdings - Staff Machine learning Engineer

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)
• 1 Month ago
Netflix - Research Scientist L4/L5, Algorithms Engineering

Netflix

United States (Remote)
• 1 Month ago
NVIDIA - Senior ASIC Verification Engineer, Coherent High Speed Interconnect

NVIDIA

Taipei City, Taiwan (On-Site)
• 1 Month ago
Riot Games - Senior Strategic Sourcing Partner - SaaS/Technology

Riot Games

Dublin, County Dublin, Ireland (On-Site)
• 3 Months ago
Nagarro - Principal Engineer, Scrum Master

Nagarro

India (On-Site)
• 4 Months ago
Synopsys  Inc  - Mac OS Virtualization Specialist

Synopsys Inc

Bengaluru, Karnataka, India (On-Site)
• 3 Months ago
Krafton  - [Global Ops & HR Div.] Recruiting Sourcer (3 ~ 5년 / 계약직)

Krafton

Seoul, South Korea (On-Site)
• 1 Month ago
ByteDance - Research Scientist/Engineer, Large Language Model - 2025 Start

ByteDance

Singapore (On-Site)
• 2 Months ago
Fabric - Applied Cryptographer, ZKP Research

Fabric

British Columbia, Canada (Remote)
• 4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

An idea-to-video platform that brings your creativity to motion.

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Pika

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug