AI Infrastructure Engineer, Model Serving Platform

2 Months ago • 4 Years + • Devops • $175,000 PA - $220,000 PA

Job Summary

Job Description

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs and AI agents. The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to deliver seamless experiences to our customers and accelerate innovation across the company. You will build and maintain fault-tolerant, high-performance systems, collaborate with researchers and engineers, conduct architecture and design reviews, develop monitoring and observability solutions, and lead projects end-to-end.
Must have:
  • 4+ years of experience building large-scale backend systems.
  • Strong programming skills in one or more languages.
  • Deep understanding of concurrency and distributed systems.
  • Experience with containers and orchestration tools.
  • Familiarity with cloud infrastructure and infrastructure as code.
  • Proven ability to solve complex problems independently.
Good to have:
  • Experience with modern LLM serving frameworks.
  • Knowledge of ML frameworks and optimization.
  • Experience with model inference optimizations.
  • Familiarity with emerging agent frameworks.

Job Details

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs and AI agents. Our platform powers cutting-edge research and production systems, supporting both internal and external use cases across various environments.

The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to  deliver seamless experiences to our customers and accelerate innovation across the company.

You will:

  • Build and maintain fault-tolerant, high-performance systems for serving LLMs and agent-based workloads at scale.
  • Collaborate with researchers and engineers to integrate and optimize models for production and research use cases.
  • Conduct architecture and design reviews to uphold best practices in system design and scalability.
  • Develop monitoring and observability solutions to ensure system health and performance.
  • Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment. 

Ideally you'd have:

  • 4+ years of experience building large-scale, high-performance backend systems.
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
  • Deep understanding of concurrency, memory management, networking, and distributed systems.
  • Experience with containers, virtualization, and orchestration tools (e.g., Docker, Kubernetes).
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
  • Proven ability to solve complex problems and work independently in fast-moving environments.

Nice to haves:

  • Experience with modern LLM serving frameworks such as vLLM, SGLang, TensorRT-LLM, or text-generation-inference.
  • Knowledge of ML frameworks (e.g., PyTorch or TensorFlow) and how to optimize them for production serving.
  • Experience with model inference optimizations such as quantization, distillation, speculative decoding, etc.
  • Familiarity with emerging agent frameworks such as OpenHands, Agent2Agent, MCP.

Similar Jobs

Netflix - Director, Product Management Supply & Insertion (Ads)

Netflix

Los Angeles, California, United States (On-Site)
2 Months ago
Survay Monkey - Senior Software Engineer in Test I

Survay Monkey

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
PwC - Country Finance Officer

PwC

Baghdad, Baghdad Governorate, Iraq (On-Site)
9 Months ago
Make - Community Events Manager

Make

Madrid, Community Of Madrid, Spain (On-Site)
1 Month ago
WongDoody - (CX) Customer Experience Consultant

WongDoody

Melbourne, Victoria, Australia (On-Site)
2 Months ago
Nagarro - SAP SuccessFactors Solution Architect with German

Nagarro

Romania (Remote)
8 Months ago
GoTo Group - Senior Software Engineer - Event Platform

GoTo Group

Bengaluru, Karnataka, India (On-Site)
8 Months ago
Bright Edge - Devops Engineer

Bright Edge

Hyderabad, Telangana, India (On-Site)
8 Months ago
Google - Software Engineer, PhD, Cloud Platforms

Google

Taipei City, Taiwan (On-Site)
2 Months ago
Autodesk - Site Reliability Engineer

Autodesk

Bengaluru, Karnataka, India (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Sailpoint - Senior Staff Data Engineer

Sailpoint

Pune, Maharashtra, India (On-Site)
3 Weeks ago
Mojang - Head of Vanilla Minecraft

Mojang

Stockholm, Stockholm County, Sweden (On-Site)
2 Months ago
Ethos Life - Product Operations Analyst

Ethos Life

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Winzo - UI UX Designer

Winzo

New Delhi, Delhi, India (On-Site)
2 Months ago
NVIDIA - Senior Site Reliability Engineer

NVIDIA

Westford, Massachusetts, United States (On-Site)
3 Months ago
Kwalee - Illustrator

Kwalee

Bengaluru, Karnataka, India (On-Site)
4 Weeks ago
Blinkhealth - Senior Certified Pharmacy Technician

Blinkhealth

Chesterfield, Missouri, United States (On-Site)
2 Months ago
Netflix - Events Manager - Korea

Netflix

Seoul, South Korea (On-Site)
6 Months ago
Ion - NetSuite ERP Developer

Ion

Dublin, County Dublin, Ireland (On-Site)
4 Months ago
Patreon - iOS Engineer

Patreon

San Francisco, California, United States (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Riot Games - Motion Graphics Artist II - League of Legends, Seasons

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
Apple - Machine Learning Engineer, Maps

Apple

Cupertino, California, United States (On-Site)
1 Month ago
level ai - HR Manager

level ai

California, United States (Hybrid)
1 Month ago
Apple - Machine Learning Engineer, GenAI & LLM - AiDP - IS&T

Apple

Sunnyvale, California, United States (On-Site)
4 Weeks ago
Zinnia - Finance Specialist

Zinnia

Topeka, Kansas, United States (Hybrid)
1 Month ago
Apple - SoC Silicon Reliability Engineer

Apple

Cupertino, California, United States (On-Site)
2 Months ago
Backbone - Product Designer

Backbone

Atherton, California, United States (Hybrid)
11 Months ago
Star schema - Delivery Driver

Star schema

Perryville, Missouri, United States (On-Site)
1 Week ago
bytedance - Research Scientist Graduate (High-Performance Computing (Inference Optimization) - Vision AI Platform)

bytedance

San Jose, California, United States (On-Site)
4 Months ago
Nintendo - Intern - Corporate Social Responsibility

Nintendo

Redmond, Washington, United States (On-Site)
8 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Wind River - Senior Engineer - Cloud

Wind River

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
Veeam Software - Site Reliability Engineer

Veeam Software

Prague, Czechia (On-Site)
1 Month ago
Ajmera Infotech - Android Developer II– Build Mission-Critical Health-Tech Apps

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
1 Month ago
Riot Games - Staff Software Engineer - Infrastructure Reliability

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
Apple - Senior ML Infrastructure Engineer

Apple

Seattle, Washington, United States (On-Site)
4 Weeks ago
Nice - Senior Automation Engineer, Actimize

Nice

Pune, Maharashtra, India (Hybrid)
2 Days ago
ISS Stoxx - Senior Software Engineer in Python and AWS

ISS Stoxx

Mumbai, Maharashtra, India (On-Site)
2 Days ago
Anavation - DevOps Engineer

Anavation

Lorton, Virginia, United States (Hybrid)
3 Months ago
Zones - Client Solution Architect

Zones

Atlanta, Georgia, United States (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Scale AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug