AI Infrastructure Engineer, Model Serving Platform

3 Months ago • 4 Years + • Devops • $175,000 PA - $220,000 PA

Job Summary

Job Description

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs and AI agents. The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to deliver seamless experiences to our customers and accelerate innovation across the company. You will build and maintain fault-tolerant, high-performance systems, collaborate with researchers and engineers, conduct architecture and design reviews, develop monitoring and observability solutions, and lead projects end-to-end.
Must have:
  • 4+ years of experience building large-scale backend systems.
  • Strong programming skills in one or more languages.
  • Deep understanding of concurrency and distributed systems.
  • Experience with containers and orchestration tools.
  • Familiarity with cloud infrastructure and infrastructure as code.
  • Proven ability to solve complex problems independently.
Good to have:
  • Experience with modern LLM serving frameworks.
  • Knowledge of ML frameworks and optimization.
  • Experience with model inference optimizations.
  • Familiarity with emerging agent frameworks.

Job Details

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs and AI agents. Our platform powers cutting-edge research and production systems, supporting both internal and external use cases across various environments.

The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to  deliver seamless experiences to our customers and accelerate innovation across the company.

You will:

  • Build and maintain fault-tolerant, high-performance systems for serving LLMs and agent-based workloads at scale.
  • Collaborate with researchers and engineers to integrate and optimize models for production and research use cases.
  • Conduct architecture and design reviews to uphold best practices in system design and scalability.
  • Develop monitoring and observability solutions to ensure system health and performance.
  • Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment. 

Ideally you'd have:

  • 4+ years of experience building large-scale, high-performance backend systems.
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
  • Deep understanding of concurrency, memory management, networking, and distributed systems.
  • Experience with containers, virtualization, and orchestration tools (e.g., Docker, Kubernetes).
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
  • Proven ability to solve complex problems and work independently in fast-moving environments.

Nice to haves:

  • Experience with modern LLM serving frameworks such as vLLM, SGLang, TensorRT-LLM, or text-generation-inference.
  • Knowledge of ML frameworks (e.g., PyTorch or TensorFlow) and how to optimize them for production serving.
  • Experience with model inference optimizations such as quantization, distillation, speculative decoding, etc.
  • Familiarity with emerging agent frameworks such as OpenHands, Agent2Agent, MCP.

Similar Jobs

Apple - Director of Semiconductor Engineering

Apple

Cupertino, California, United States (On-Site)
1 Month ago
Roblox - Senior Product Manager, Communities

Roblox

San Mateo, California, United States (Hybrid)
1 Week ago
endava - Senior Test Automation Engineer

endava

Brisbane, Queensland, Australia (On-Site)
1 Month ago
Ion - Data Center Architect, Italy

Ion

Italy (Hybrid)
9 Months ago
Kabam - Senior Software Engineer (1 Year Contract)

Kabam

Montreal, Quebec, Canada (Hybrid)
9 Months ago
Shield AI - Hivemind Solutions Architect

Shield AI

Washington, District Of Columbia, United States (On-Site)
1 Week ago
Shield AI - Sales Solution Engineer, Europe (R3661)

Shield AI

Oslo, Oslo, Norway (On-Site)
1 Week ago
Sonar Source - Lead Cloud Platform Engineer

Sonar Source

Geneva, Geneva, Switzerland (On-Site)
5 Months ago
Hitachi - Kubernetes Engineer

Hitachi

Pune, Maharashtra, India (On-Site)
9 Months ago
Malabar Gold & Diamonds - Executive - Cloud Engineer

Malabar Gold & Diamonds

Sri Vijaya Puram, Andaman And Nicobar Islands, India (On-Site)
1 Year ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Roblox - Senior Software Engineer (Frontend)

Roblox

San Mateo, California, United States (On-Site)
1 Month ago
Tide - Lead Treasury Planning and Strategy

Tide

United Kingdom (Hybrid)
3 Months ago
Marvell - Senior Staff Software Engineer

Marvell

Santa Clara, California, United States (On-Site)
2 Weeks ago
Visa - Financial Education Product Manager

Visa

Atlanta, Georgia, United States (Hybrid)
1 Week ago
Gigamon - Regional Sales Director - Desert Region

Gigamon

United States (On-Site)
1 Month ago
Zuora - Product Manager - Enterprise Solutions Integrations

Zuora

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Canva - Engineering Manager (Front End) - Events & Launches

Canva

Brisbane, Queensland, Australia (Remote)
3 Weeks ago
EvenUp - Account Executive

EvenUp

Atlanta, Georgia, United States (Remote)
3 Months ago
Salesforce - Director of Product Marketing, Sales Cloud (Portfolio & Keynote)

Salesforce

Chicago, Illinois, United States (On-Site)
2 Weeks ago
fortis games - Security Engineering Manager

fortis games

United Kingdom (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Dialpad AI - Senior Web Producer

Dialpad AI

San Ramon, California, United States (On-Site)
1 Month ago
Saama - Director, Sales

Saama

Boston, Massachusetts, United States (On-Site)
2 Months ago
hogarth - Content Manager

hogarth

Sunnyvale, California, United States (Hybrid)
1 Week ago
Nordson Corporation - Manager, Finance (FP&A)

Nordson Corporation

Minneapolis, Minnesota, United States (On-Site)
3 Months ago
The Walt Disney Company - Producer - Part Time, ABC News Live

The Walt Disney Company

New York, New York, United States (On-Site)
3 Months ago
Apple - Client Insights Specialist - Apple Ads

Apple

Cupertino, California, United States (On-Site)
4 Weeks ago
dbt Labs - Staff Software Engineer

dbt Labs

United States (Remote)
2 Weeks ago
Matic Robots - Senior Mechanical Design Engineer

Matic Robots

Mountain View, California, United States (On-Site)
1 Week ago
EMA - Partner Manager

EMA

Mountain View, California, United States (Hybrid)
3 Months ago
Next Level Business Services - Java Tech Lead

Next Level Business Services

Chicago, Illinois, United States (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Nice - DevOps Engineer

Nice

Pune, Maharashtra, India (Hybrid)
4 Weeks ago
upwork - Senior Database Automation Engineer (APAC)

upwork

(Remote)
3 Months ago
Mistral AI - Software Engineer, Deployment

Mistral AI

Paris, Île-de-France, France (Hybrid)
4 Months ago
Intel  - Cloud Software Development Engineer

Intel

Folsom, California, United States (On-Site)
1 Year ago
Ion - Cloud Engineer - Graduate Development Program

Ion

Pisa, Tuscany, Italy (On-Site)
9 Months ago
Vercel - Software Engineer, CI/CD

Vercel

New York, United States (Remote)
2 Months ago
bytedance - Cloud Site Reliability Engineer

bytedance

San Jose, California, United States (On-Site)
4 Months ago
Cadence - Software Architect

Cadence

Shanghai, China (On-Site)
3 Months ago
Next Level Business Services - OSS/BSS Solution Architect (Full Time)

Next Level Business Services

Philadelphia, Pennsylvania, United States (On-Site)
9 Months ago
Gaijin Entertainment - Site Reliability Engineering Engineer

Gaijin Entertainment

(Remote)
3 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

London, England, United Kingdom (On-Site)

London, England, United Kingdom (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Scale AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug