Senior Site Reliability Engineer - GPU Cloud

15 Minutes ago • 8 Years + • DevOps

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to manage and enhance the reliability of their GPU cloud platform. This role involves automating infrastructure, implementing monitoring and analytics solutions, and providing customer support. Responsibilities include owning the lifecycle of new tools and services, from requirements to deployment. The ideal candidate has 8+ years of experience in large-scale distributed systems, proficiency in languages like Go/Python, and expertise with Terraform, Kubernetes, and cloud infrastructure. The role requires strong debugging, troubleshooting, and communication skills, along with a collaborative mindset.
Must have:
  • 8+ years experience in large-scale distributed systems
  • Proficiency in Go/Python/Perl/C++/Java/C
  • Expertise in Terraform, Kubernetes, cloud infra
  • Excellent debugging and troubleshooting skills
  • Experience in automating infrastructure
Good to have:
  • Ability to decompose complex requirements
  • Proven record of maintaining platform SLAs
  • Experience with unit testing and benchmarking
  • Ability to choose optimal algorithms for scaling

Job Details

NVIDIA has been a pioneer in Accelerated Computing and has been paving the way with innovations in Generative AI, Large Language Model (LLM), Autonomous Vehicles, Robotics, High-Performance Computing (HPC), Gaming/Visualization, and Edge/Data Center/Cloud Computing. NVIDIA provides automakers, research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.

We are a fast paced, dynamic and dedicated Site Reliability Engineering (SRE) team serving the forefront of the latest science and technology trends on cloud and on-prem infrastructure management for High-Performance & Distributed Computing. Working closely with the development teams, we provide hosted solutions for our internal and external customers. Are you passionate about infrastructure and enjoy working on and resolving intricate multi-faceted issues? Are you eager to have your hands on the engines of the next generation of cloud services? Do you get a buzz from identifying and eliminating toil, designing and coding innovative solutions that address the needs of a whole organization? If so, read on and give us a shout.

What you'll be doing:

The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers. This SRE team is accountable for the setup, management, reliability and availability of this infrastructure spanning 1000s of GPU nodes.

As a senior SRE, you are responsible for:

  • Providing scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure.

  • You will own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.

  • Provide customer support on a rotation basis.

What we need to see:

  • Minimum of 8 years of experience ce in automating and handling large-scale distributed system software deployments in on-prem/cloud environments.

  • Proficiency in any language - Go/Python/Perl/C++/Java/C.

  • Strong command on terraform, Kubernetes and cloud infra administration.

  • Excellent debugging and troubleshooting skills.

  • Ability to design simple and reliable systems that can work without much support.

  • Outstanding teammate who can collaborate and influence in a multifaceted environment.

  • Excellent interpersonal, and written communication skills.

  • M.Sc or B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics)

Ways to stand out from the crowd:

  • Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.

  • Proven record of maintaining platform SLAs through accurate resolutions.

  • Unit testing and benchmarking are an integral part of your code.

  • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

USE Insider - Senior Android Developer

USE Insider

İstanbul, İstanbul, Türkiye (Remote)
5 Months ago
Knuddels - Business/Data Analyst Intern

Knuddels

Baden-Württemberg, Germany (Remote)
1 Week ago
Epic Games - Senior Backend Engineer

Epic Games

Bellevue, Washington, United States (On-Site)
1 Week ago
Warner Bros Games - Senior Software Engineer - Full stack developer(MSC team),Bangalore

Warner Bros Games

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
Assystems - Développeur Junior - H/F

Assystems

Lyon, Auvergne-Rhône-Alpes, France (Hybrid)
5 Months ago
DraftKings - Manager, System DBA Operations

DraftKings

Sofia, Sofia City Province, Bulgaria (On-Site)
4 Months ago
NVIDIA - Senior DevOps Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Nielsen Holdings - Software Engineer - Bigdata ( Java or Scala or  Python, Spark, SQL, AWS )

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Canva - Senior Platform Engineer - Workload Integration

Canva

Surry Hills, New South Wales, Australia (Remote)
1 Month ago
Virtana Corp - Senior Software Engineer

Virtana Corp

Pune, Maharashtra, India (Remote)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Playrix - Lead SDET

Playrix

Portugal (Remote)
5 Months ago
Knuddels - Werkstudent*in Data Science und Backend-Entwicklung (m/w/d) - remote oder Karlsruhe

Knuddels

Karlsruhe, Baden-Württemberg, Germany (Remote)
8 Months ago
Zazz - Artificial Intelligence Engineer

Zazz

(Remote)
1 Month ago
Saviynt - Technical Lead, Professional Services - NA

Saviynt

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Playrix - Technical Director (Game Project)

Playrix

Armenia (Remote)
5 Months ago
The Walt Disney Company - Senior Systems Engineer, Data Services [Database Administration]

The Walt Disney Company

Burbank, California, United States (On-Site)
3 Months ago
PlayStation Global - Senior Site Reliability Engineer

PlayStation Global

Aliso Viejo, California, United States (On-Site)
5 Months ago
NVIDIA - Senior Firmware Engineer - Memory Subsystem

NVIDIA

Canada (On-Site)
1 Month ago
ION - Lead Software Engineer, Italy

ION

Collecchio, Emilia-Romagna, Italy (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

InMobiInMobi - Account Manager

InMobiInMobi

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Blissclub - Garment Technologist

Blissclub

Bengaluru, Karnataka, India (On-Site)
5 Months ago
AML RightSource - Workday Technical Lead

AML RightSource

Noida, Uttar Pradesh, India (Hybrid)
6 Months ago
ION - Analyst - LCM - Mumbai - 763

ION

Mumbai, Maharashtra, India (On-Site)
5 Months ago
Paytm - KAM - Visakhapatnam

Paytm

Visakhapatnam, Andhra Pradesh, India (On-Site)
3 Months ago
Sportskeeda - Social Media Manager - News and Politics

Sportskeeda

India (Remote)
1 Month ago
Virtusa - UI Developer

Virtusa

Andhra Pradesh, India (Hybrid)
6 Months ago
Paytm - Paytm Ads ops (Monetization)

Paytm

Bengaluru, Karnataka, India (On-Site)
5 Months ago
PwC - IN-Associate_SAP MM_Enterprise Apps SAP_Advisory_Mumbai

PwC

Mumbai, Maharashtra, India (On-Site)
5 Months ago
Scopely - User Experience Designer

Scopely

Bengaluru, Karnataka, India (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Samsung Semiconductor - Staff DevOps Engineer

Samsung Semiconductor

San Jose, California, United States (Hybrid)
2 Months ago
Sonar Source - Solutions Engineer - Strategic Accounts

Sonar Source

Austin, Texas, United States (Hybrid)
5 Months ago
Glean - Software Engineer (Support Tools Developer)

Glean

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Meta - Production Engineer

Meta

Warsaw, Masovian Voivodeship, Poland (On-Site)
4 Months ago
NVIDIA - Senior AI-HPC Storage Engineer

NVIDIA

Austin, Texas, United States (On-Site)
1 Month ago
Google - Staff Software Engineer, Site Reliability Engineering, Google Cloud

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
3 Months ago
PwC - ETIC, GCP Technical Support Engineer - Senior Associate

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
5 Months ago
Scorewarrior - Build & CI Engineer

Scorewarrior

Limassol, Limassol, Cyprus (On-Site)
1 Week ago
EXUSIA - Google Cloud Platform Technical Lead

EXUSIA

United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


California, United States (Hybrid)

Taipei City, Taiwan (On-Site)

Santa Clara, California, United States (On-Site)

United Kingdom (Remote)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Austin, Texas, United States (Remote)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug