Software Engineer, ML Infrastructure - Training Platform

1 Day ago • 4 Years + • $160,000 PA - $225,600 PA

Job Summary

Job Description

Scale is looking for an AI/ML Infrastructure Engineer to join their Machine Learning Infrastructure team to build out the Training Platform. This role involves close collaboration with Machine Learning researchers to understand their requirements and leverage domain expertise and compute resources to accelerate experimentation. The engineer will be responsible for building highly available, observable, performant, and cost-effective APIs for model training, participating in the team’s on-call process, and owning projects end-to-end from requirements to implementation. The ideal candidate will have experience with machine learning, backend system design, and prior ML Infrastructure experience, including experience with distributed training techniques such as DeepSpeed and FSDP.
Must have:
  • Experience building machine learning training pipelines in a production setting.
  • Experience with distributed training techniques.
  • Experience building and monitoring microservice architectures.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code.
Good to have:
  • Experience with LLM inference latency optimization techniques.
  • Experience working with a cloud technology stack (e.g., AWS or GCP).

Job Details

Scale is looking for an AI/ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers to understand their requirements and apply your own domain expertise and our compute resources to accelerate experimentation throughput.

The ideal candidate is someone who has strong fundamentals in machine learning, backend system design, and has prior ML Infrastructure experience. You should also be comfortable with infrastructure and large scale system design, as well as diagnosing both model performance and system failures.

You will:

  • Build highly available, observable, performant, and cost-effective APIs for model training.
  • Participate in our team’s on call process to ensure the availability of our services.
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
  • Exercise good taste in building systems and tools and know when to make build vs. buy tradeoffs, with an eye for cost efficiency.

Ideally you'd have:

  • 4+ years of experience building machine learning training pipelines or inference services in a production setting.
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience building, deploying, and monitoring complex microservice architectures.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform).

Nice to haves:

  • Experience with LLM inference latency optimization techniques, e.g. kernel fusion, quantization, dynamic batching, etc.
  • Experience working with a cloud technology stack (eg. AWS or GCP).

Similar Jobs

Veeam Software - Enterprise Senior Systems Engineer

Veeam Software

Mexico City, Mexico (On-Site)
1 Day ago
Aristocrat Gaming - .NET Team Lead

Aristocrat Gaming

Sofia, Sofia City Province, Bulgaria (Hybrid)
2 Months ago
Extreme Network - Cloud Database Administrator (9466)

Extreme Network

Toronto, Ontario, Canada (Hybrid)
6 Months ago
Cognite - Principal Front-end Engineer

Cognite

Austin, Texas, United States (Hybrid)
10 Months ago
Playtika - QA Automation Engineer

Playtika

Ukraine (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Gaming Innovation Group  - System Administrator

Gaming Innovation Group

Catalonia, Spain (Hybrid)
1 Month ago
Playgendary - DevOps (Cloud Engineer)

Playgendary

Limassol, Limassol, Cyprus (Remote)
2 Months ago
The Walt Disney Company - Lead Software Engineer - Big Data Infrastructure

The Walt Disney Company

California, United States (On-Site)
1 Month ago
Calix - Staff Software Engineer - Cloud Platform

Calix

(Remote)
19 Hours ago
Click Therapeutics - Senior Cloud Engineer (Contract)

Click Therapeutics

New York, New York, United States (Hybrid)
9 Hours ago
Synechron - Senior Java Developer

Synechron

Bengaluru, Karnataka, India (On-Site)
1 Year ago
Electronic Arts - Technical Director - Tools & Technology (TnT)

Electronic Arts

Vancouver, British Columbia, Canada (On-Site)
2 Weeks ago
Every matrix - Senior Java Developer

Every matrix

Lviv, Lviv Oblast, Ukraine (Hybrid)
3 Months ago
FICO - Solution Support Engineer-II(Java application support)

FICO

Bengaluru, Karnataka, India (On-Site)
20 Hours ago
Zscaler - Principal Software Development Engineer

Zscaler

Bellevue, Washington, United States (Hybrid)
9 Hours ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Samsung Semiconductor - Senior Engineer, Design Verification

Samsung Semiconductor

San Jose, California, United States (On-Site)
1 Month ago
The Walt Disney Company - Sr Analyst, Research

The Walt Disney Company

Burbank, California, United States (On-Site)
3 Days ago
Google - Senior Software Engineer, Machine Learning, Google Cloud AI

Google

Kirkland, Washington, United States (On-Site)
2 Weeks ago
Spell Brush - Front-End Engineer (Anime)

Spell Brush

San Francisco, California, United States (On-Site)
1 Month ago
Apple - Machine Learning Engineer -- Backend/Data Engineer: Agentic Workflows

Apple

Sunnyvale, California, United States (On-Site)
14 Hours ago
Epic Games - Senior Rendering Engineer, Fortnite Tech

Epic Games

Cary, North Carolina, United States (On-Site)
2 Weeks ago
Meta - Hardware Systems Engineer, NPI

Meta

Austin, Texas, United States (On-Site)
5 Months ago
Instawork - Retail Enterprise Account Executive

Instawork

Phoenix, Arizona, United States (Hybrid)
1 Day ago
Epic Games - Senior QA Analyst

Epic Games

Cary, North Carolina, United States (On-Site)
1 Month ago
Google - Software Engineering Manager II

Google

Sunnyvale, California, United States (On-Site)
2 Days ago

Get notifed when new similar jobs are uploaded

Similar Category Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Doha, Doha Municipality, Qatar (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Scale AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug