Site Reliability Engineer - GPU Cloud

3 Months ago • 3 Years + • DevOps

Job Summary

Job Description

NVIDIA seeks a Site Reliability Engineer (SRE) for its GPU cloud platform, supporting internal R&D and external AI/ML customers. The SRE will manage and improve the reliability of a large-scale GPU infrastructure (1000+ nodes), automating deployments, monitoring, and analytics. Responsibilities include designing, building, and maintaining tools and services, contributing to infrastructure automation, and providing customer support on a rotation. This role demands expertise in large-scale distributed systems, cloud infrastructure (Kubernetes, Terraform), and strong programming skills (Go/Python/Perl/C++/Java/C).
Must have:
  • 3+ years experience in large-scale distributed systems
  • Proficiency in Go/Python/Perl/C++/Java/C
  • Strong command of Terraform, Kubernetes, cloud infra
  • Excellent debugging and troubleshooting skills
Good to have:
  • Ability to decompose complex requirements
  • Unit testing and benchmarking experience
  • Algorithm design for scaling and availability

Job Details

NVIDIA has been a pioneer in Accelerated Computing and has been paving the way with innovations in Generative AI, Large Language Model (LLM), Autonomous Vehicles, Robotics, High-Performance Computing (HPC), Gaming/Visualization, and Edge/Data Center/Cloud Computing. NVIDIA provides automakers, research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.

We are a fast paced, dynamic and dedicated Site Reliability Engineering (SRE) team serving the forefront of the latest science and technology trends on cloud and on-prem infrastructure management for High-Performance & Distributed Computing. Working closely with the development teams, we provide hosted solutions for our internal and external customers. Are you passionate about infrastructure and enjoy working on and resolving intricate multi-faceted issues? Are you eager to have your hands on the engines of the next generation of cloud services? Do you get a buzz from identifying and eliminating toil, designing and coding innovative solutions that address the needs of a whole organization? If so, read on and give us a shout.

What you’ll be doing:

The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers. This SRE team is accountable for the setup, management, reliability and availability of this infrastructure spanning 1000s of GPU nodes.

As an SRE, you are responsible for:

  • Providing scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure.

  • You will own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.

  • Provide customer support on a rotation basis.

What we need to see:

  • Minimum of 3 years Experience in automating and handling large-scale distributed system software deployments in on-prem/cloud environments.

  • Proficiency in any language - Go/Python/Perl/C++/Java/C.

  • Strong command on terraform, Kubernetes and cloud infra administration.

  • Excellent debugging and troubleshooting skills.

  • Excellent interpersonal, and written communication skills.

  • B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics)

Ways to stand out from the crowd:

  • Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.

  • Unit testing and benchmarking are an integral part of your code.

  • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

ION - Software Developer/Engineer - Graduate Development Program

ION

Milan, Lombardy, Italy (On-Site)
6 Months ago
Google - Senior Software Engineering Manager, Google Cloud

Google

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Meta - Production Engineer

Meta

New York, New York, United States (Remote)
5 Months ago
Next Level Business Services - Big Data Engineer

Next Level Business Services

Phoenix, Arizona, United States (On-Site)
6 Months ago
Nagarro - SAP Basis Consultant (m/w/d)

Nagarro

Germany (Remote)
6 Months ago
Nielsen Holdings - DevOps Engineer (Terraform, Jenkins, GitLab CI/CD, Python, Airflow)

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
Lulalend - Senior Site Reliability Engineer

Lulalend

Cape Town, Western Cape, South Africa (On-Site)
6 Months ago
Brillio - Azure Kubernetes Architect - R01530963

Brillio

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
Trend Micro - Backend Engineer

Trend Micro

Manila, Metro Manila, Philippines (On-Site)
16 Years ago
Cadence - IT -Sr Staff Systems Engineer

Cadence

Noida, Uttar Pradesh, India (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Hashlist - Senior Data Engineer

Hashlist

Pune, Maharashtra, India (Hybrid)
5 Months ago
Info Stretch - Guidewire Developer

Info Stretch

Mechanicsburg, Pennsylvania, United States (On-Site)
4 Months ago
Hawk Eye Innovations - Mid-Level Java Engineer

Hawk Eye Innovations

Budapest, Hungary (Hybrid)
2 Months ago
Playrix - Lead SDET

Playrix

Portugal (Remote)
6 Months ago
Velotio Technologies - Senior Android Engineer

Velotio Technologies

Maharashtra, India (Remote)
2 Months ago
Zeta - Software Development Engineer _ II Backend

Zeta

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Meta - Software Engineer (Technical Leadership) - Machine Learning

Meta

Seattle, Washington, United States (On-Site)
5 Months ago
Smarsh - Lead Machine Learning Engineer

Smarsh

India (Hybrid)
6 Months ago
PENN Interactive - Senior Software Developer, Pricing Engine

PENN Interactive

Philadelphia, Pennsylvania, United States (Hybrid)
3 Months ago
Technorizen Software Solutions - Exp. Android Developer (1-2 years)

Technorizen Software Solutions

Indore, Madhya Pradesh, India (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Sinch - DevOps Engineer (Email)

Sinch

Uttar Pradesh, India (Hybrid)
4 Months ago
PwC - IN-Manager –Agile Coach -Advisory Corporate– Advisory –Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Normalyze - Customer Success Engineer - Data Security - Implementation - DSPM - Bangalore

Normalyze

Bengaluru, Karnataka, India (Remote)
6 Months ago
Glean - Solutions Architect ( EMEA/US East Customer hours )

Glean

Bengaluru, Karnataka, India (On-Site)
5 Months ago
CloudHire - VBA Automation Engineer - CloudHire Consulting

CloudHire

Mumbai, Maharashtra, India (Hybrid)
6 Months ago
Nagarro - Engineer, QA Manual

Nagarro

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Buckman - Marketing and Technology Manager, Paper

Buckman

India (On-Site)
5 Months ago
InMobiInMobi - Account Manager - Microsoft Advertising

InMobiInMobi

Bengaluru, Karnataka, India (On-Site)
2 Months ago
PwC - Senior Associate

PwC

Gurugram, Haryana, India (On-Site)
6 Months ago
Trek - Hybris Developer

Trek

Haryana, India (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

NVIDIA - Senior DevOps Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Ubisoft - Engine Programmer [Snowdrop]

Ubisoft

Bucharest, Bucharest, Romania (Hybrid)
5 Months ago
Warner Bros Discovery - Software Engineer II - Platform Engineering

Warner Bros Discovery

Stockholm, Stockholm County, Sweden (Hybrid)
5 Months ago
ION - Cloud Engineer/Architect (DevOps)

ION

London, England, United Kingdom (On-Site)
6 Months ago
SciPlay - Senior Cloud Engineer

SciPlay

Austin, Texas, United States (Hybrid)
1 Month ago
PwC - Power Platform Developer Associate

PwC

Milan, Lombardy, Italy (On-Site)
2 Months ago
Hitachi - Azure Infra Consultant

Hitachi

Pune, Maharashtra, India (Remote)
6 Months ago
Egnyte - Senior DevOps Engineer - Azure

Egnyte

India (Remote)
2 Months ago
Meltwater - Backend & Cloud Engineer – Javascript

Meltwater

Hyderabad, Telangana, India (Hybrid)
6 Months ago
Social Discovery Group - ML Ops Engineer

Social Discovery Group

(Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

India (Remote)

Santa Clara, California, United States (Remote)

Santa Clara, California, United States (Remote)

Santa Clara, California, United States (Remote)

California, United States (Remote)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug