Senior Site Reliability Engineer - GPU Clusters

3 Months ago • 7-10 Years • DevOps • $184,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads. Responsibilities include infrastructure provisioning, monitoring, and automation; ensuring high uptime and QoS; supporting a multi-cloud environment; defining SLOs/SLIs; conducting RCA; and participating in on-call rotations. The ideal candidate possesses strong expertise in Kubernetes, Docker, Linux, CI/CD, and IaC, and experience managing GPU clusters in production environments. They will collaborate with researchers, engineers, and infrastructure teams to optimize cluster performance and reliability.
Must have:
  • 7+ years software engineering experience, 3+ years managing GPU clusters
  • Expertise in designing, deploying, and running cloud services
  • Proficiency with Kubernetes, Docker, Python, Go, or similar
  • Strong Linux OS and TCP/IP knowledge
  • Proficient in CI/CD, GitOps, and IaC (Terraform or Ansible)
Good to have:
  • Experience with Slurm and/or BCM deployments
  • Expertise in container networking and storage
  • Proven track record of operational excellence in high-performance environments
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.

The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment. You will collaborate with a diverse and experienced team, constantly improving infrastructure provisioning and resiliency to ensure a high level of service availability.

What you will be doing:

  • Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.

  • Continuously improve infrastructure provisioning, management, and monitoring through automation.

  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.

  • Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.

  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.

  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.

  • Participate in the team's on-call rotation to support critical infrastructure.

  • Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

  • Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.

  • Expertise in designing, deploying, and running production-level cloud services.

  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.

  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).

  • Strong proficiency with Linux operating systems and TCP/IP fundamentals.

  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.

  • Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

  • Experience managing large-scale Slurm and/or BCM deployments in production environments.

  • Expertise in modern container networking and storage architectures.

  • Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Larian Studios - DevOps Full-Stack Engineer

Larian Studios

Warsaw, Masovian Voivodeship, Poland (On-Site)
2 Months ago
ByteDance - Backend Software Engineer

ByteDance

San Jose, California, United States (On-Site)
4 Weeks ago
The Walt Disney Company - Sr Streaming Media Engineer

The Walt Disney Company

Glendale, California, United States (On-Site)
1 Month ago
The Walt Disney Company - Senior Security Engineer - Security Operations

The Walt Disney Company

Burbank, California, United States (Remote)
4 Days ago
NVIDIA - Senior Full-Stack Software Engineer

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Months ago
Microsoft - Principal Software Engineer

Microsoft

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
The Mill Adventure - Senior DevOps Engineer

The Mill Adventure

St. Julian's, Malta (Remote)
4 Weeks ago
Next Level Business Services - Cloud Architect

Next Level Business Services

Jersey City, New Jersey, United States (On-Site)
6 Months ago
Google - Technical Solutions Engineer, Apigee

Google

Maharashtra, India (On-Site)
1 Week ago
Moon Active - Site Reliability Engineer

Moon Active

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

The Walt Disney Company - Senior Systems Reliability Operations Engineer

The Walt Disney Company

Mumbai, Maharashtra, India (On-Site)
5 Months ago
Sinch - System Engineer

Sinch

Uttar Pradesh, India (On-Site)
1 Month ago
GoMotive - Senior Software Engineer, Backend

GoMotive

Pakistan (Remote)
1 Month ago
Enphase Energy - Sr. Staff Engineer Cloud

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Twitch - Software Engineer - Payments

Twitch

San Francisco, California, United States (On-Site)
1 Month ago
Microsoft - Member of Technical Staff - Backend Engineer, Product

Microsoft

Mountain View, California, United States (Hybrid)
1 Week ago
Velotio Technologies - Software Engineer (ROR)

Velotio Technologies

Maharashtra, India (Remote)
3 Weeks ago
Egnyte - Database Administrator

Egnyte

India (Remote)
2 Months ago
Bounteous - Product Manager, B2B

Bounteous

Bernards, New Jersey, United States (Hybrid)
6 Months ago
Enphase Energy - Sr. Software Engineer - Enlighten Cloud Backend

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Google - Software Developer III, Front End, Google Cloud AI

Google

Sunnyvale, California, United States (On-Site)
6 Days ago
Google - Data Transformation Lead, Media and Entertainment

Google

New York, New York, United States (On-Site)
1 Week ago
Beghou Consulting - Sr. Consultant

Beghou Consulting

Boston, Massachusetts, United States (Hybrid)
6 Months ago
Google - Senior Software Engineer, Machine Learning, Core

Google

Sunnyvale, California, United States (On-Site)
1 Week ago
NVIDIA - Senior Data Engineer, Cloud Operations Engineering

NVIDIA

California, United States (Remote)
3 Days ago
Evolution - Human Resources Generalist

Evolution

Philadelphia, Pennsylvania, United States (On-Site)
2 Days ago
ByteDance - Machine Learning Research Scientist, AI for Science

ByteDance

Seattle, Washington, United States (On-Site)
4 Months ago
Rockstar Games - Senior Graphic Designer

Rockstar Games

New York, New York, United States (On-Site)
7 Months ago
ByteDance - Tech Lead - Architect / Researcher - DPU

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
Google - Senior Test Engineer

Google

San Bruno, California, United States (On-Site)
4 Days ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Saviynt - Sr. Principal Software Engineer - Privileged Access Management (PAM)

Saviynt

El Segundo, California, United States (Hybrid)
6 Months ago
Google - Technical Solutions Engineer, Infrastructure, Serverless

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
Electronic Arts - DevOps Engineer II

Electronic Arts

Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia (On-Site)
3 Weeks ago
Luxoft - Orchestrade - Azure infrastructure cloud Regular engineer

Luxoft

Poland, Ohio, United States (Remote)
5 Months ago
Gunzilla - DevOps/Build Engineer

Gunzilla

Kyiv, Kyiv City, Ukraine (On-Site)
1 Month ago
Luxoft - Principal/Lead GCP Cloud Modernization Engineer

Luxoft

New Delhi, Delhi, India (Remote)
4 Months ago
Tesla - Sr. Software Developer (PowerShell)

Tesla

North Holland, Netherlands (On-Site)
2 Months ago
Egnyte - Senior Build Engineer - Python - Jenkins

Egnyte

India (Remote)
3 Months ago
Trend Micro - (Sr.) Software Engineer in Linux

Trend Micro

Taipei City, Taiwan (On-Site)
6 Months ago
SmileGate - [AI센터] DevOps, 인프라 엔지니어

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug