Site Reliability Engineer - GPU Cloud

1 Month ago • 3 Years + • DevOps

Job Summary

Job Description

NVIDIA seeks a Site Reliability Engineer (SRE) for its GPU cloud platform, supporting internal R&D and external AI/ML customers. The SRE will manage and improve the reliability of a large-scale GPU infrastructure (1000+ nodes), automating deployments, monitoring, and analytics. Responsibilities include designing, building, and maintaining tools and services, contributing to infrastructure automation, and providing customer support on a rotation. This role demands expertise in large-scale distributed systems, cloud infrastructure (Kubernetes, Terraform), and strong programming skills (Go/Python/Perl/C++/Java/C).
Must have:
  • 3+ years experience in large-scale distributed systems
  • Proficiency in Go/Python/Perl/C++/Java/C
  • Strong command of Terraform, Kubernetes, cloud infra
  • Excellent debugging and troubleshooting skills
Good to have:
  • Ability to decompose complex requirements
  • Unit testing and benchmarking experience
  • Algorithm design for scaling and availability

Job Details

NVIDIA has been a pioneer in Accelerated Computing and has been paving the way with innovations in Generative AI, Large Language Model (LLM), Autonomous Vehicles, Robotics, High-Performance Computing (HPC), Gaming/Visualization, and Edge/Data Center/Cloud Computing. NVIDIA provides automakers, research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.

We are a fast paced, dynamic and dedicated Site Reliability Engineering (SRE) team serving the forefront of the latest science and technology trends on cloud and on-prem infrastructure management for High-Performance & Distributed Computing. Working closely with the development teams, we provide hosted solutions for our internal and external customers. Are you passionate about infrastructure and enjoy working on and resolving intricate multi-faceted issues? Are you eager to have your hands on the engines of the next generation of cloud services? Do you get a buzz from identifying and eliminating toil, designing and coding innovative solutions that address the needs of a whole organization? If so, read on and give us a shout.

What you’ll be doing:

The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers. This SRE team is accountable for the setup, management, reliability and availability of this infrastructure spanning 1000s of GPU nodes.

As an SRE, you are responsible for:

  • Providing scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure.

  • You will own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.

  • Provide customer support on a rotation basis.

What we need to see:

  • Minimum of 3 years Experience in automating and handling large-scale distributed system software deployments in on-prem/cloud environments.

  • Proficiency in any language - Go/Python/Perl/C++/Java/C.

  • Strong command on terraform, Kubernetes and cloud infra administration.

  • Excellent debugging and troubleshooting skills.

  • Excellent interpersonal, and written communication skills.

  • B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics)

Ways to stand out from the crowd:

  • Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.

  • Unit testing and benchmarking are an integral part of your code.

  • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

PwC - Senior Associate_Android/Flutter Developer_Data & Analytics_Advisory_PAN India

PwC

Kolkata, West Bengal, India (On-Site)
4 Months ago
Glean - Solutions Architect - Central

Glean

(Remote)
2 Months ago
Next Level Business Services - Full Stack Java Developer

Next Level Business Services

Boston, Massachusetts, United States (On-Site)
4 Months ago
Next Level Business Services - Adobe CQ5/AEM Architect (Full Time)

Next Level Business Services

Sunnyvale, California, United States (On-Site)
3 Months ago
Unity - Senior Software Developer, QA

Unity

Montreal, Quebec, Canada (On-Site)
4 Months ago
Scopely - Senior DevOps Engineer - Unannounced Project

Scopely

Dublin, County Dublin, Ireland (Hybrid)
1 Month ago
AbZorba Games  - Dev Ops Engineer

AbZorba Games

Athens, Greece (On-Site)
8 Months ago
Playtech - Java Developer

Playtech

London, England, United Kingdom (On-Site)
2 Months ago
Unity - Senior Data Infrastructure Engineer

Unity

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago
Info Stretch - Senior Engineer

Info Stretch

Chennai, Tamil Nadu, India (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Interactive Brokers - Senior Java Developer

Interactive Brokers

Budapest, Hungary (Hybrid)
4 Months ago
Kaiying Network - Java Backend Developer (Online Video)

Kaiying Network

Shanghai, Shanghai, China (On-Site)
2 Weeks ago
Playrix - Senior Engineering Manager

Playrix

Serbia (Remote)
4 Months ago
Epic Games - Engineering Lead

Epic Games

Vancouver, British Columbia, Canada (On-Site)
1 Month ago
Yahoo - Sports App Android Engineer

Yahoo

United States (Hybrid)
6 Months ago
PwC - IN-Senior Associate_ JAVA_Utility Transformation _Advisory_Kolkata

PwC

Kolkata, West Bengal, India (On-Site)
2 Months ago
Starkflow - Java Architect

Starkflow

Las Vegas, Nevada, United States (On-Site)
3 Weeks ago
Playtech - Senior Embedded Software Engineer

Playtech

Manchester, England, United Kingdom (On-Site)
4 Months ago
Salesforce - Distributed Systems Software Engineer - Public Cloud (Senior/Lead/Principal)

Salesforce

San Francisco, California, United States (On-Site)
5 Months ago
Starkflow - Android Developer

Starkflow

(Remote)
1 Week ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Unity - Payroll Specialist

Unity

Karnataka, India (Hybrid)
5 Months ago
Ajmera Infotech - HR Manager

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
5 Months ago
Self-employed - 3D Designer

Self-employed

Gurugram, Haryana, India (On-Site)
5 Months ago
ION - Analyst - LCM - Mumbai - 763

ION

Mumbai, Maharashtra, India (On-Site)
4 Months ago
InMobiInMobi - Senior Associate -Content Marketing

InMobiInMobi

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Ubisoft - Retainer - Technical Animator (1 Year Contract)

Ubisoft

Mumbai, Maharashtra, India (On-Site)
2 Months ago
CloudHire - SAP ABAP Workflow - Consultant

CloudHire

Bengaluru, Karnataka, India (Remote)
3 Months ago
PwC - IN-Senior Associate_Corporate Strategy_Strategy&_Advisory_Gurgaon

PwC

Gurugram, Haryana, India (On-Site)
4 Months ago
Continental - User Experience Designer(Exp:6-8yrs)

Continental

Karnataka, India (Hybrid)
5 Months ago
Barclays - Data Scientist

Barclays

Pune, Maharashtra, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

DOTSOFT SA - Solutions Architect

DOTSOFT SA

Greece (On-Site)
4 Months ago
Nielsen Holdings - Software Engineer - Bigdata ( Java/Scala ,Python, Spark, SQL, AWS )

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Dynamics - Cloud Architect (SEVIS)

Dynamics

(Remote)
2 Months ago
Luxoft - Senior Software Support Engineer

Luxoft

(Remote)
3 Months ago
GoTo Group - Principal SRE Engineer (SE5)

GoTo Group

Gurugram, Haryana, India (On-Site)
4 Months ago
Peak - Summer Intern, DevOps Engineer

Peak

İstanbul, Türkiye (On-Site)
1 Month ago
Omnissa - Member of Technical Staff (C++ Windows)

Omnissa

Chennai, Tamil Nadu, India (On-Site)
4 Months ago
Nielsen Holdings - SENIOR DEVOPS ENGINEER

Nielsen Holdings

Mumbai, Maharashtra, India (Hybrid)
4 Months ago
ByteDance - Global SRE Lead, Security Engineering

ByteDance

Singapore (On-Site)
3 Months ago
PearlAbyss - Junior System Engineer

PearlAbyss

(On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug