Senior Site Reliability Engineer - DGX Cloud

1 Week ago • 5 Years + • DevOps • $144,000 PA - $333,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to design, implement, and support operational aspects of large-scale Kubernetes clusters, focusing on performance, monitoring, and alerting. Responsibilities include the entire service lifecycle, from inception to refinement, ensuring high availability and uptime. This role involves system design consulting, developing software tools, capacity management, and incident response. The ideal candidate possesses 5+ years of experience in infrastructure automation, distributed systems, and cloud systems (private/public). Proficiency in languages like Python, Go, Perl, or Ruby, along with in-depth Linux, networking, and container knowledge is essential. The role involves on-call rotation and a systematic problem-solving approach.
Must have:
  • 5+ years experience
  • Infrastructure automation expertise
  • Distributed systems design
  • Kubernetes, OpenStack experience
  • Python, Go, Perl, or Ruby proficiency
  • Linux, Networking, Container knowledge
Good to have:
  • Large-scale distributed system experience
  • Public/Private cloud experience
  • Debugging and optimization skills
  • Automation of routine tasks
Perks:
  • Equity
  • Benefits

Job Details

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack. SRE at NVIDIA ensures that our internal and external facing GPU cloud services run maximum reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency and performance. SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations. Much of our software development focuses on eliminating manual work through automation, performance tuning and growing efficiency of production systems. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.

  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity

  • Practice sustainable incident response and blameless postmortems

  • Be part of an on call rotation to support production systems

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.

  • 5+ years of experience.

  • Experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production

  • Experience in one or more of the following: Python, Go, Perl or Ruby

  • In depth knowledge on Linux, Networking and Containers

Ways to stand out from the crowd:

  • Interest in crafting, analyzing and fixing large-scale distributed systems.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Ability to debug and optimize code and automate routine tasks.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.

The base salary range is 144,000 USD - 333,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Meta - Product Security Engineer

Meta

New York, New York, United States (On-Site)
5 Months ago
GoTo Group - Lead Software Engineer (IC)

GoTo Group

Jakarta, Jakarta, Indonesia (On-Site)
6 Months ago
Google - Systems Development Engineer, Edge Infrastructure Operations

Google

Dublin, County Dublin, Ireland (On-Site)
1 Week ago
Highspot - Sr. Software Development Engineer, Coaching

Highspot

Hyderabad, Telangana, India (Hybrid)
9 Months ago
Voodoo - Senior Fullstack Engineer (Ruby)

Voodoo

Paris, Île-de-France, France (Hybrid)
1 Month ago
Zazz - Java Developer

Zazz

(Remote)
2 Months ago
Wargaming - DevOps Engineer (Deployment team)

Wargaming

Vilnius, Vilnius County, Lithuania (On-Site)
1 Month ago
Google - Senior Staff Software Engineer, Site Reliability Engineering, Google Cloud

Google

Kirkland, Washington, United States (On-Site)
1 Week ago
Info Stretch - Lead Data Engineer

Info Stretch

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
Extreme Network - SR PROGRAMMER - Oracle Fusion Cloud- VBCS/ BI Reports/ OTBI/FRS & SmartView

Extreme Network

Chennai, Tamil Nadu, India (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Gitlab - Support Engineer (AMER)

Gitlab

(Remote)
7 Hours ago
Instrumental - Growth Engineer

Instrumental

(Remote)
1 Day ago
People Can Fly - Senior Backend Programmer

People Can Fly

Warsaw, Masovian Voivodeship, Poland (Remote)
1 Week ago
Drivetrain - SDE (Automation & Quality Focus)

Drivetrain

Kolkata, West Bengal, India (Remote)
9 Months ago
Workato - AI Solutions Architect

Workato

Hyderabad, Telangana, India (On-Site)
7 Hours ago
GoMotive - Software Engineer, Backend

GoMotive

India (Remote)
1 Month ago
Workato - Senior Software Engineer (Platform, Ruby)

Workato

Sofia, Sofia City Province, Bulgaria (On-Site)
7 Hours ago
Diligent - Staff Software Engineer - MERN

Diligent

Bengaluru, Karnataka, India (On-Site)
1 Day ago
Maersk Careers - Elixir Developer

Maersk Careers

Bengaluru, Karnataka, India (Remote)
3 Months ago
Velotio Technologies - Cloud Security Engineer

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Microsoft - Platform Engineering Manager

Microsoft

Redmond, Washington, United States (Hybrid)
2 Weeks ago
Rackspace Technology - Principal Java Engineer (GCP)

Rackspace Technology

United States (Remote)
1 Month ago
Scientific Games  - Software Quality Assurance Tester III

Scientific Games

Oklahoma City, Oklahoma, United States (On-Site)
3 Weeks ago
ION - Senior Technical Consultant - Endur

ION

Houston, Texas, United States (On-Site)
6 Months ago
Glean - Corporate Account Executive

Glean

Nashville, Tennessee, United States (Hybrid)
7 Hours ago
Rockstar Games - Senior Illustrator

Rockstar Games

New York, New York, United States (On-Site)
3 Months ago
Jane Street - Market Data Specialist, Feeds

Jane Street

New York, New York, United States (On-Site)
7 Hours ago
Bonfire Studios - Senior Gameplay Animator

Bonfire Studios

California, United States (Hybrid)
1 Month ago
Redhorse Corp - Project Analyst II - Active Secret Clearance Required

Redhorse Corp

Arlington, Virginia, United States (On-Site)
1 Week ago
Postman - Senior Engineering Manager, Ecosystems

Postman

San Francisco, California, United States (On-Site)
1 Day ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Nagarro - Staff Engineer (Cloud)

Nagarro

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Google - Customer Engineer, Data Management, Google Cloud

Google

Riyadh, Riyadh Province, Saudi Arabia (On-Site)
2 Weeks ago
Epic Games - Senior DevOps Programmer

Epic Games

Cary, North Carolina, United States (On-Site)
2 Months ago
USE Insider - DevOps Engineer

USE Insider

İstanbul, İstanbul, Türkiye (Remote)
5 Months ago
Beyond Sports  - Unity Developer

Beyond Sports

Alkmaar, North Holland, Netherlands (On-Site)
4 Weeks ago
Google - Customer Engineer II, Infrastructure Modernization, Biotech, Google Cloud

Google

Cambridge, Massachusetts, United States (On-Site)
2 Weeks ago
DEVOTEAM - Data Driven | MLOps Engineer

DEVOTEAM

Lisbon, Lisbon, Portugal (Remote)
6 Months ago
Trend Micro - Sr. Engineer

Trend Micro

Taipei City, Taiwan (On-Site)
7 Months ago
G5 Games - Monitoring Engineer

G5 Games

Astana, Astana, Kazakhstan (Remote)
1 Month ago
ByteDance - Production System Engineer, Infrastructure Engineering

ByteDance

Singapore (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug