Software Engineer, ML Infrastructure - Training Platform

2 Months ago • 4 Years + • Devops • $160,000 PA - $225,600 PA

Job Summary

Job Description

Scale is looking for an AI/ML Infrastructure Engineer to join their Machine Learning Infrastructure team to build out the Training Platform. This role involves close collaboration with Machine Learning researchers to understand their requirements and leverage domain expertise and compute resources to accelerate experimentation. The engineer will be responsible for building highly available, observable, performant, and cost-effective APIs for model training, participating in the team’s on-call process, and owning projects end-to-end from requirements to implementation. The ideal candidate will have experience with machine learning, backend system design, and prior ML Infrastructure experience, including experience with distributed training techniques such as DeepSpeed and FSDP.
Must have:
  • Experience building machine learning training pipelines in a production setting.
  • Experience with distributed training techniques.
  • Experience building and monitoring microservice architectures.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code.
Good to have:
  • Experience with LLM inference latency optimization techniques.
  • Experience working with a cloud technology stack (e.g., AWS or GCP).

Job Details

Scale is looking for an AI/ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers to understand their requirements and apply your own domain expertise and our compute resources to accelerate experimentation throughput.

The ideal candidate is someone who has strong fundamentals in machine learning, backend system design, and has prior ML Infrastructure experience. You should also be comfortable with infrastructure and large scale system design, as well as diagnosing both model performance and system failures.

You will:

  • Build highly available, observable, performant, and cost-effective APIs for model training.
  • Participate in our team’s on call process to ensure the availability of our services.
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
  • Exercise good taste in building systems and tools and know when to make build vs. buy tradeoffs, with an eye for cost efficiency.

Ideally you'd have:

  • 4+ years of experience building machine learning training pipelines or inference services in a production setting.
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience building, deploying, and monitoring complex microservice architectures.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform).

Nice to haves:

  • Experience with LLM inference latency optimization techniques, e.g. kernel fusion, quantization, dynamic batching, etc.
  • Experience working with a cloud technology stack (eg. AWS or GCP).

Similar Jobs

Luxoft - Commodity Project Manager

Luxoft

(On-Site)
7 Months ago
WebTech Corporation - Sr. Signal Engineer

WebTech Corporation

Wayne, Pennsylvania, United States (On-Site)
1 Month ago
Luni - Head of Legal

Luni

Bordeaux, Nouvelle-Aquitaine, France (Hybrid)
1 Week ago
Univision - Senior Product Manager, Gamification

Univision

Los Angeles, California, United States (On-Site)
4 Weeks ago
Blazesoft - Senior Manager, Performance Marketing (iGaming)

Blazesoft

Vaughan, Ontario, Canada (On-Site)
3 Months ago
ISS Stoxx - Principal Platform Engineer

ISS Stoxx

London, England, United Kingdom (On-Site)
3 Weeks ago
Argus - Site Reliability Engineer

Argus

Calgary, Alberta, Canada (Remote)
3 Months ago
Rippling - Staff Software Engineer - Compute Infrastructure

Rippling

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Rebellion - Senior DevOps Engineer (AWS/Azure)

Rebellion

Oxford, England, United Kingdom (Hybrid)
3 Months ago
SSC Technologies - Cloud Platform Engineer

SSC Technologies

Melbourne, Victoria, Australia (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

NCR Atleos - Software Engineer II

NCR Atleos

Hyderabad, Telangana, India (On-Site)
2 Weeks ago
Like Card - Head of Digital Marketing

Like Card

Amman, Amman Governorate, Jordan (On-Site)
2 Weeks ago
Apple - Sr. Software Engineering Manager - Test

Apple

Cupertino, California, United States (On-Site)
2 Months ago
Ciklum - Senior Data Scientist

Ciklum

Chennai, Tamil Nadu, India (Hybrid)
8 Months ago
Granicus - Implementation Project Manager

Granicus

Costa Rica (Remote)
2 Weeks ago
Apple - Media Processing Accelerator Architect, Platform Architecture

Apple

San Diego, California, United States (On-Site)
1 Month ago
Side - Senior Test Automation Engineer

Side

Braga, Braga, Portugal (Remote)
1 Year ago
PayPal - Director, Large Enterprise Commercial

PayPal

San Jose, California, United States (Hybrid)
4 Weeks ago
Flexra Software - Senior UX Designer

Flexra Software

United Kingdom (Hybrid)
2 Months ago
Nice - Services Sales Manager (EMEA)

Nice

London, England, United Kingdom (Hybrid)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

NVIDIA - Senior Product Manager – AI Networking Orchestration

NVIDIA

Santa Clara, California, United States (On-Site)
3 Months ago
Next Level Business Services - Database Developer

Next Level Business Services

Bellevue, Washington, United States (On-Site)
8 Months ago
GlobalStep - Director of HR

GlobalStep

Richardson, Texas, United States (On-Site)
1 Month ago
IGN - National Sales Manager, Consumer Sales

IGN

New York, United States (Hybrid)
3 Months ago
Rockstar Games - Senior Software Engineer (C#)

Rockstar Games

Carlsbad, California, United States (On-Site)
9 Months ago
Zinnia - Analyst IV, Client Success

Zinnia

Topeka, Kansas, United States (Hybrid)
2 Months ago
Apple - US-Operations Lead

Apple

Torrance, California, United States (On-Site)
1 Month ago
Star schema - Delivery Driver

Star schema

Silverdale, Washington, United States (On-Site)
2 Weeks ago
Google - Staff Software Engineer, Databases, Google Cloud

Google

Sunnyvale, California, United States (On-Site)
2 Months ago
Mercury - Senior Software Engineer - Treasury

Mercury

San Francisco, California, United States (Remote)
3 Days ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Nagarro - Associate Staff Engineer, Cloud

Nagarro

Bengaluru, Karnataka, India (On-Site)
8 Months ago
Kulfi Collective - Lead AI & Platform Engineer

Kulfi Collective

Mumbai, Maharashtra, India (On-Site)
4 Weeks ago
P99 soft - Senior DevOps Engineer

P99 soft

Hyderabad, Telangana, India (On-Site)
2 Months ago
Canonical - Site Reliability Engineering Manager

Canonical

(Remote)
1 Month ago
Nice - Cloud Software Architect

Nice

Ra'anana, Center District, Israel (Hybrid)
2 Weeks ago
Ansys - Lead SPDM Application Engineer - Customer Solutions Engineer

Ansys

Canonsburg, Pennsylvania, United States (Remote)
1 Week ago
Mashgin - Software Engineer, Infrastructure

Mashgin

Palo Alto, California, United States (Hybrid)
8 Months ago
LeoVegas - Domain Software Architect - Sportsbook Trading System

LeoVegas

Warsaw, Masovian Voivodeship, Poland (Hybrid)
2 Weeks ago
Apple - Sr Software Engineer - Infrastructure and operations

Apple

Cupertino, California, United States (On-Site)
1 Month ago
bytedance - Senior Backend Software Engineer - Customer Service Platform

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Scale AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug