Senior Site Reliability Engineer - Fleet Reliability

4 Months ago • 7 Years + • Devops • $255,000 PA - $405,000 PA

Job Summary

Job Description

Lambda is seeking a Senior Site Reliability Engineer to join their Fleet Reliability team. The role involves defining and measuring fleet health metrics to improve system availability, collaborating with the observability team on monitoring and alerting systems, creating runbooks and automated remediations for common failures, and building automation and auditing for compliance and efficiency. The engineer will also participate in on-call rotations and integrate logging and metrics across platforms like Datadog, Prometheus, and Grafana. Lambda offers a GPU Cloud for ML/AI teams and is experiencing high demand for its systems.
Must have:
  • 7+ years of experience in SRE/DevOps
  • Understanding of AI infrastructure and GPU architectures
  • Strong Linux-based systems knowledge
  • Proficiency in Python and Go
  • Experience with monitoring tools (Prometheus, Grafana)
  • Proficiency in automation tools (Ansible, Terraform)
  • Excellent problem-solving skills
  • Strong communication and collaboration skills
Good to have:
  • Experience in machine learning or computer hardware
  • Knowledge of Docker and Kubernetes
  • Experience with HPC resources
  • Background in chaos engineering
  • Understanding of compliance frameworks
Perks:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage
  • Wellness and Commuter stipends
  • 401k Plan with company match
  • Flexible Paid Time Off Plan

Job Details

Lambda is the #1 GPU Cloud for ML/AI teams training, fine-tuning and inferencing AI models, where engineers can easily, securely and affordably build, test and deploy AI products at scale. Lambda’s product portfolio includes on-prem GPU systems, hosted GPUs across public & private clouds and managed inference services – servicing government, researchers, startups and Enterprises world-wide.


If you'd like to build the world's best deep learning cloud, join us. 


*Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.

Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

What You’ll Do

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability

  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies

  • Create runbooks and automated remediations for common failure scenarios

  • Build in automation and auditing to ensure compliance and improve efficiency and productivity

  • Participate in on-call rotations and provide support for incident response and resolution

  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

You

  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role

  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization

  • Strong understanding of Linux-based systems in a distributed environment

  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling.

  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)

  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)

  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)

  • Excellent problem-solving and troubleshooting skills

  • Strong communication and collaboration skills

  • Passion for continuous improvement and innovation

Nice to Have

  • Experience in the machine learning or computer hardware industry

  • Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)

  • Experience building and/or operating HPC resources.

  • Background in chaos engineering or similar reliability testing methodologies

  • Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)

Salary Range Information

Based on market data and other factors, the annual salary range for this position is $255,000-$405,000. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

About Lambda

  • Founded in 2012, ~350 employees (2024) and growing fast

  • We offer generous cash & equity compensation

  • Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.

  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability

  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG

  • Health, dental, and vision coverage for you and your dependents

  • Wellness and Commuter stipends for select roles

  • 401k Plan with 2% company match (USA employees)

  • Flexible Paid Time Off Plan that we all actually use

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Similar Jobs

Ion - Lead Python Engineer, New York

Ion

New York, New York, United States (Hybrid)
10 Months ago
undefined - Senior Customer Success Engineer, West

United States (Remote)
10 Months ago
Interactive Brokers - Senior Desktop Support (L2)

Interactive Brokers

Kowloon, Hong Kong (On-Site)
3 Months ago
Roblox - Developer Engagement Team (Contract)

Roblox

United States (Remote)
1 Month ago
bytedance - Procurement Specialist-2025 Start

bytedance

Singapore (On-Site)
9 Months ago
Reddit - Senior Software Engineer, Ads Experimentation Platform

Reddit

Ontario, Canada (Remote)
2 Months ago
Gusto - Sr Site Reliability Engineer

Gusto

Denver, Colorado, United States (Remote)
2 Weeks ago
Addepar - Staff Backend Software Engineer - Partner Platform

Addepar

Edinburgh, Scotland, United Kingdom (On-Site)
1 Month ago
Apple - Tooling & Automation Engineer, Retail Interactive Technology

Apple

Sunnyvale, California, United States (On-Site)
2 Months ago
Patreon - iOS Platform Engineer

Patreon

San Francisco, California, United States (Hybrid)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Ello - Senior Design Engineer

Ello

San Francisco, California, United States (Hybrid)
1 Month ago
CloudLinux - Senior Ruby Engineer - Linux Platform & Automation

CloudLinux

(Remote)
1 Month ago
ISS Stoxx - ESG Research Analyst

ISS Stoxx

Manila, Metro Manila, Philippines (Hybrid)
1 Month ago
Nousresearch - Research Scientist

Nousresearch

(On-Site)
1 Month ago
Bazaar Voice - Senior Sales Operations Analyst - Tools Admin

Bazaar Voice

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Palo Alto Networks - Staff E-TAC Engineer

Palo Alto Networks

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Apple - Analog Mixed Signal IP Post Silicon Validation

Apple

Cupertino, California, United States (On-Site)
3 Months ago
Google - Software Engineer III, Infrastructure, Core

Google

Seattle, Washington, United States (On-Site)
3 Months ago
Guardian - Audit Operations Senior

Guardian

Bethlehem, Pennsylvania, United States (Hybrid)
2 Months ago
Netomi - SDET

Netomi

Gurugram, India (Remote)
8 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Next Level Business Services - Technical Lead (ASP.NET / Site core)

Next Level Business Services

Philadelphia, Pennsylvania, United States (On-Site)
9 Months ago
Bonfire Studios - Senior Gameplay Animator

Bonfire Studios

California, United States (Hybrid)
4 Months ago
CharacterAI - Software Engineer, Backend

CharacterAI

San Francisco, California, United States (On-Site)
3 Months ago
Synthesia - Sales Operations Manager

Synthesia

United States (Hybrid)
1 Month ago
Roblox - Senior Software Engineer, Shopping

Roblox

San Mateo, California, United States (On-Site)
3 Weeks ago
Marsh McLennan - Billing Manager - Investments

Marsh McLennan

Boston, Massachusetts, United States (Hybrid)
3 Weeks ago
Apple - Hardware Technology Platform Integration Engineering Program Manager

Apple

Cupertino, California, United States (On-Site)
2 Months ago
dbt Labs - Solutions Architect, Commercial (Portuguese Speaking)

dbt Labs

Austin, Texas, United States (On-Site)
1 Month ago
Apple - Location Software Engineer

Apple

Cupertino, California, United States (On-Site)
2 Months ago
Next Level Business Services - Salesforce Developer

Next Level Business Services

San Francisco, California, United States (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Xsolla - Software Architect

Xsolla

Los Angeles, California, United States (Hybrid)
3 Months ago
PwC - Senior Associate Solution Architect

PwC

Gurugram, India (On-Site)
3 Weeks ago
Whatnot - Software Engineer, Notifications Platform

Whatnot

San Francisco, California, United States (On-Site)
2 Months ago
Zuora - Senior Enterprise Solution Architect

Zuora

Milan, Lombardy, Italy (Remote)
1 Month ago
CD PROJEKT RED - Senior DevOps Software Engineer

CD PROJEKT RED

Warsaw, Masovian Voivodeship, Poland (Hybrid)
2 Months ago
Sonar Source - Senior Cloud Solutions Engineer

Sonar Source

Austin, Texas, United States (On-Site)
3 Months ago
bytedance - Site Reliability Engineer, Edge Services

bytedance

Boston, Massachusetts, United States (On-Site)
9 Months ago
Epic Games - Senior Platform Engineer

Epic Games

Cary, North Carolina, United States (On-Site)
4 Months ago
Britive - Senior Software Engineer (Cloud)

Britive

Bengaluru, Karnataka, India (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

San Francisco, California, United States (Hybrid)

San Jose, California, United States (Hybrid)

San Francisco, California, United States (Hybrid)

San Francisco, California, United States (Hybrid)

San Jose, California, United States (Hybrid)

San Jose, California, United States (Hybrid)

San Francisco, California, United States (Hybrid)

San Francisco, California, United States (Hybrid)

San Jose, California, United States (Hybrid)

San Francisco, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Lambda

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug