Jobs Courses Resources Companies Placements

Home >

Jobs >

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Texas, United States (On-site)

Senior Site Reliability Engineer - GPU Clusters

6 Months ago • 7-10 Years • Devops • $184,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads. Responsibilities include infrastructure provisioning, automation, ensuring high uptime, defining SLOs/SLIs, conducting RCAs, participating in on-call rotations, and integrating new GPU and cloud technologies. The ideal candidate possesses strong expertise in cloud services, Kubernetes, Docker, scripting languages (Python, Go, Ruby), Linux, CI/CD, and IaC tools (Terraform, Ansible). Experience with Slurm and BCM is a plus.

Must have:

7+ years software engineering experience (3+ years managing GPU clusters)
Expertise in designing, deploying, and running production-level cloud services
Proficiency with Kubernetes, Docker
Proficiency in Linux OS and TCP/IP fundamentals
Proficient in CI/CD, GitOps, and IaC (Terraform or Ansible)

Good to have:

Experience managing large-scale Slurm and/or BCM deployments
Expertise in modern container networking and storage architectures

Perks:

Equity
Benefits

11 skills required

11 skills required for this role

Add these skills to join the top 1% applicants for this job

ruby

ci-cd

kubernetes

azure

aws

python

docker

terraform

linux

ansible

networking

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.

The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment. You will collaborate with a diverse and experienced team, constantly improving infrastructure provisioning and resiliency to ensure a high level of service availability.

What you will be doing:

Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.
Continuously improve infrastructure provisioning, management, and monitoring through automation.
Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.
Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
Participate in the team's on-call rotation to support critical infrastructure.
Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.
Expertise in designing, deploying, and running production-level cloud services.
Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
Strong proficiency with Linux operating systems and TCP/IP fundamentals.
Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

Experience managing large-scale Slurm and/or BCM deployments in production environments.
Expertise in modern container networking and storage architectures.
Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

DevOps Engineer

Nintendo

Redmond, Washington, United States (On-Site)

• 7 Months ago

Senior Release Engineer

The Walt Disney Company

New York, New York, United States (On-Site)

• 4 Months ago

Cloud Practice Engineer III

Rackspace Technology

Jalisco, Mexico (Remote)

• 4 Months ago

Senior Systems Reliability Engineer (SRE)

Aristocrat Gaming

Austin, Texas, United States (Hybrid)

• 5 Months ago

Senior Software QA Engineer

Tesla

Brandenburg, Germany (On-Site)

• 6 Months ago

Senior Engineer

Info Stretch

Mumbai, Maharashtra, India (On-Site)

• 10 Months ago

Systems Development Engineer III

Google

Reston, Virginia, United States (On-Site)

• 4 Months ago

Senior Site Reliability Engineer

NVIDIA

Santa Clara, California, United States (On-Site)

• 5 Months ago

Senior DevOps Engineer

N-iX

Ukraine (Remote)

• 4 Months ago

Site Reliability Engineer - Data Infrastructure (San Jose)

ByteDance

San Jose, California, United States (On-Site)

• 10 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior Site Reliability Engineer

NVIDIA

Westford, Massachusetts, United States (On-Site)

• 5 Months ago

Product Security Engineer

DevOps Engineer

Wargaming

Belgrade, Serbia (On-Site)

• 8 Months ago

APIs Staff Engineer

Life church

Edmond, Oklahoma, United States (On-Site)

• 10 Months ago

DevOps Engineer

Ajmera Infotech

San Jose, California, United States (On-Site)

• 11 Months ago

Software Engineer

ByteDance

Seattle, Washington, United States (On-Site)

• 5 Months ago

Elixir Software Engineer

Maersk Careers

Bengaluru, Karnataka, India (Remote)

• 5 Months ago

Linux Systems Engineer IV

Rackspace Technology

India (Remote)

• 4 Months ago

Full-Stack Engineer

Larian Studios

Warsaw, Masovian Voivodeship, Poland (On-Site)

• 5 Months ago

Member Technical Staff - Site Reliability Engineer

Flexera

Bengaluru, Karnataka, India (Hybrid)

• 11 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Austin, Texas, USA

Software Development Engineer - Safety ML

Twitch

San Francisco, California, United States (On-Site)

• 5 Months ago

Senior Staff Engineer, Java Fullstack

Nagarro

Jacksonville, Florida, United States (On-Site)

• 10 Months ago

Senior Technical Compliance Specialist

Aristocrat Gaming

Las Vegas, Nevada, United States (Hybrid)

• 5 Months ago

HR Business Partner (Temporary)

Activision

Los Angeles, California, United States (On-Site)

• 5 Months ago

Senior Director, US Retail Sales

Universal Music

New York, New York, United States (On-Site)

• 4 Months ago

Jr. Paid Ads and Analytics Specialist

WebFX

Harrisburg, Pennsylvania, United States (On-Site)

• 10 Months ago

Lead Graphics Engineer

Light Speed Studios

Irvine, California, United States (On-Site)

• 9 Months ago

Senior System Software Engineer - AI Performance and Efficiency Tools

NVIDIA

Santa Clara, California, United States (Hybrid)

• 5 Months ago

Account Executive, Mid-Market

Notion

San Francisco, California, United States (On-Site)

• 10 Months ago

Associate Art Director / Art Manager

Snail Games

Beverly Hills, California, United States (Hybrid)

• 7 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Java Developer

Zazz

(Remote)

• 6 Months ago

Senior DevOps Engineer

Easygo

Belgrade, Serbia (On-Site)

• 5 Months ago

Game Data Engineer (Platform Development)

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)

• 7 Months ago

Software Engineer

Egnyte

India (Remote)

• 6 Months ago

Site Reliability Engineer - EP (SE4)

GoTo Group

Bengaluru, Karnataka, India (On-Site)

• 10 Months ago

Senior Cloud Engineer - BOT

Bounteous

India (Remote)

• 10 Months ago

Monitoring Engineer

G5 Games

(Remote)

• 5 Months ago

DevOps/SRE Engineer

N-iX

Poland (Remote)

• 6 Months ago

Systems Development Engineer, Google Distributed Cloud

Google

Kirkland, Washington, United States (On-Site)

• 4 Months ago

Application Engineer/Administrator

ARHS

The Hague, South Holland, Netherlands (On-Site)

• 10 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

76 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior Site Reliability Engineer - GPU Clusters

Job Summary

Job Description

11 skills required

11 skills required for this role

Job Details

Similar Jobs

DevOps Engineer

Senior Release Engineer

Cloud Practice Engineer III

Senior Systems Reliability Engineer (SRE)

Senior Software QA Engineer

Senior Engineer

Systems Development Engineer III

Senior Site Reliability Engineer

Senior DevOps Engineer

Site Reliability Engineer - Data Infrastructure (San Jose)

Similar Skill Jobs

Senior Site Reliability Engineer

Product Security Engineer

DevOps Engineer

APIs Staff Engineer

DevOps Engineer

Software Engineer

Elixir Software Engineer

Linux Systems Engineer IV

Full-Stack Engineer

Member Technical Staff - Site Reliability Engineer

Jobs in Austin, Texas, USA

Software Development Engineer - Safety ML

Senior Staff Engineer, Java Fullstack

Senior Technical Compliance Specialist

HR Business Partner (Temporary)

Senior Director, US Retail Sales

Jr. Paid Ads and Analytics Specialist

Lead Graphics Engineer

Senior System Software Engineer - AI Performance and Efficiency Tools

Account Executive, Mid-Market

Associate Art Director / Art Manager

Devops Jobs

Java Developer

Senior DevOps Engineer

Game Data Engineer (Platform Development)

Software Engineer

Site Reliability Engineer - EP (SE4)

Senior Cloud Engineer - BOT

Monitoring Engineer

DevOps/SRE Engineer

Systems Development Engineer, Google Distributed Cloud

Application Engineer/Administrator

About The Company

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Network Site Reliability Engineer

ASIC Engineer

Senior ASIC Design Engineer

Physical Design CAD Team Manager

Engineering Farm Engineer

Senior Mixed Signal Design Verification Engineer

Senior Solutions Architect, Cloud Infrastructure and DevOps

Level Up Your Career in Game Development!