Senior Site Reliability Engineer - GPU Clusters

1 Month ago • 7-10 Years • DevOps • $184,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads. Responsibilities include infrastructure provisioning, automation, ensuring high uptime, defining SLOs/SLIs, conducting RCAs, participating in on-call rotations, and integrating new GPU and cloud technologies. The ideal candidate possesses strong expertise in cloud services, Kubernetes, Docker, scripting languages (Python, Go, Ruby), Linux, CI/CD, and IaC tools (Terraform, Ansible). Experience with Slurm and BCM is a plus.
Must have:
  • 7+ years software engineering experience (3+ years managing GPU clusters)
  • Expertise in designing, deploying, and running production-level cloud services
  • Proficiency with Kubernetes, Docker
  • Proficiency in Linux OS and TCP/IP fundamentals
  • Proficient in CI/CD, GitOps, and IaC (Terraform or Ansible)
Good to have:
  • Experience managing large-scale Slurm and/or BCM deployments
  • Expertise in modern container networking and storage architectures
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.

The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment. You will collaborate with a diverse and experienced team, constantly improving infrastructure provisioning and resiliency to ensure a high level of service availability.

What you will be doing:

  • Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.

  • Continuously improve infrastructure provisioning, management, and monitoring through automation.

  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.

  • Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.

  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.

  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.

  • Participate in the team's on-call rotation to support critical infrastructure.

  • Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

  • Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.

  • Expertise in designing, deploying, and running production-level cloud services.

  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.

  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).

  • Strong proficiency with Linux operating systems and TCP/IP fundamentals.

  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.

  • Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

  • Experience managing large-scale Slurm and/or BCM deployments in production environments.

  • Expertise in modern container networking and storage architectures.

  • Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Fluxon - Staff Software Engineer

Fluxon

Hyderabad, Telangana, India (Remote)
5 Months ago
Maersk Careers - Senior Elixir Software Engineer - Energy Transition

Maersk Careers

Porto, Porto District, Portugal (Remote)
3 Months ago
Nintendo - DevOps Engineer

Nintendo

Redmond, Washington, United States (On-Site)
2 Months ago
PwC - IN_Senior Associate_ Golang _Advisory Corporate_Advisory_Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Maersk Careers - Elixir Software Engineer

Maersk Careers

Pune, Maharashtra, India (Remote)
2 Months ago
Luxoft - Senior Software Support Engineer

Luxoft

(Remote)
4 Months ago
PwC - IN-Associate_ Azure DevOps Engineer_OneCloud_Advisory_Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Rackspace Technology - AWS Support Engineer II

Rackspace Technology

Bengaluru, Karnataka, India (Remote)
2 Days ago
G5 Games - Monitoring Engineer

G5 Games

Astana, Astana, Kazakhstan (Remote)
4 Days ago
Synamedia - Software Engineer (Node JS, GoLang, AWS)

Synamedia

Bengaluru, Karnataka, India (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Interactive Brokers - Senior Systems Engineer- Microsoft M365/Active Directory

Interactive Brokers

Fort Lauderdale, Florida, United States (Hybrid)
5 Months ago
ByteDance - Software Engineer

ByteDance

Seattle, Washington, United States (On-Site)
3 Days ago
Life church - Ruby Staff Engineer

Life church

Edmond, Oklahoma, United States (On-Site)
5 Months ago
Egnyte - Database Administrator

Egnyte

India (Remote)
1 Month ago
Enphase Energy - Staff Software Engineer

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
3 Months ago
GoMotive - Technical Support Engineer

GoMotive

Pakistan (Remote)
1 Week ago
Ruby game studio - Game Designer (on-site)

Ruby game studio

İzmir, İzmir, Türkiye (On-Site)
5 Months ago
The Walt Disney Company - Media Systems Engineer II

The Walt Disney Company

Bristol, Connecticut, United States (On-Site)
1 Month ago
Nielsen Holdings - Senior /Lead/ DOE-Full stack ( Java, Go lang, Ruby, Javascript, Reactjs, AWS, DBMS, Postgres)

Nielsen Holdings

Mumbai, Maharashtra, India (Hybrid)
5 Months ago
Life church - Ruby Staff Engineer

Life church

Edmond, Oklahoma, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Austin, Texas, United States

BANDAI NAMCO - Ratings and Compliance Analyst

BANDAI NAMCO

Santa Clara, California, United States (Hybrid)
2 Months ago
Onward Search - Sr Brand Manager

Onward Search

Irvine, California, United States (Hybrid)
3 Days ago
Blinkhealth - Certified Pharmacy Technician

Blinkhealth

Chesterfield, Missouri, United States (On-Site)
2 Weeks ago
Epic Games - Senior Platform Programmer

Epic Games

Cary, North Carolina, United States (On-Site)
3 Days ago
The Walt Disney Company - Sr Software Engineer (Roku Engineer)

The Walt Disney Company

New York, New York, United States (On-Site)
4 Months ago
NVIDIA - Datacenter GPU Power Architect - New College Grad 2025

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Sinch - Senior UX Researcher

Sinch

United States (Remote)
1 Day ago
Nintendo - Localization Product Specialist III - Spanish

Nintendo

Redmond, Washington, United States (Hybrid)
4 Months ago
ByteDance - Finance Manager - Data Center

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Meta - Global Sales Analytics Lead

Meta

New York, New York, United States (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Canva - Senior Software Engineer -Cloud Platform- - Remote across ANZ

Canva

Sydney, New South Wales, Australia (Remote)
4 Months ago
The Walt Disney Company - Senior Software Engineer - Data Platform

The Walt Disney Company

Santa Monica, California, United States (On-Site)
1 Month ago
NVIDIA - Senior System Software Engineer - MLOps

NVIDIA

California, United States (Hybrid)
2 Days ago
Canva - Senior Software Engineer - Cloud Security & Compliance, remote across ANZ

Canva

Sydney, New South Wales, Australia (Remote)
3 Months ago
Inworld AI - Staff Platform Engineer - USA

Inworld AI

Mountain View, California, United States (On-Site)
3 Months ago
Truecaller - Senior MLOps Engineer

Truecaller

Stockholm, Stockholm County, Sweden (On-Site)
4 Months ago
Wipro - Azure AD

Wipro

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Nielsen Holdings - SENIOR DEVOPS ENGINEER

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)
5 Months ago
Ubisoft - Linux DevOps Systems Administrator

Ubisoft

Montreal, Quebec, Canada (On-Site)
1 Month ago
Info Stretch - Lead Data Engineer

Info Stretch

Pune, Maharashtra, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Yokne'am Illit, North District, Israel (On-Site)

Hyderabad, Telangana, India (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug