Senior Site Reliability Engineer

2 Months ago • 10 Years + • DevOps • $168,000 PA - $322,000 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to guarantee the smooth operation of their cutting-edge technologies. Responsibilities include owning solution implementation, collaborating with cross-functional teams, automating provisioning and management, improving service resiliency, detecting and resolving performance issues, conducting capacity planning, participating in incident reviews, and delivering SRE solutions in a multi-cloud environment (AWS, GCP, On-prem). The role demands ensuring high uptime and QoS for internal customers and participation in on-call rotations.
Must have:
  • 10+ years experience in building and supporting critical services
  • Kubernetes administration, CI/CD, IaC proficiency
  • Linux OS and TCP/IP expertise
  • Experience with at least one major cloud provider (AWS, GCP, Azure)
  • 5+ years coding/scripting (Python, Go, Ruby, or Groovy)
  • Excellent debugging and communication skills
Good to have:
  • Linux certification
  • Large-scale Kubernetes deployment experience
  • Modern container networking and storage architecture skills
  • Cloud certifications
  • Slurm/LSF environment experience
Perks:
  • Equity
  • Benefits

Job Details

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

  • Continuously improve solution provisioning and management through automation.

  • Identify areas to improve service resiliency using industry-standard practices.

  • Detect performance issues and recommend solutions to maintain world-class service quality.

  • Conduct capacity management and planning to meet ongoing operational needs.

  • Participate in incident reviews, assist in root cause identification, and write RCA reports.

  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.

  • Participate in the team's on-call rotation.

What we need to see:

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.

  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

  • Deep understanding of Linux operating systems and TCP/IP fundamentals.

  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.

  • Demonstrated proficiency with end-to-end SRE capabilities and observability.

  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.

  • Prior experience managing large-scale Kubernetes deployment in production.

  • Strong skills in modern container networking and storage architecture.

  • Well-known Cloud Certification(s).

  • Hands-on experience working with Slurm/LSF environments.

The base salary range is 168,000 USD - 322,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

nubank - Mobile Security Engineer

nubank

State Of São Paulo, Brazil (Remote)
2 Weeks ago
gitlab - Intermediate Support Engineer (US Federal)

gitlab

United States (Remote)
2 Weeks ago
GoTo Group - Lead Software Engineer (IC)

GoTo Group

Bengaluru, Karnataka, India (On-Site)
7 Months ago
bounteous - Product Manager

bounteous

New Jersey, United States (Hybrid)
3 Days ago
Fluxon - Senior Software Engineer

Fluxon

Gurugram, Haryana, India (Remote)
2 Years ago
Escape Velocity Entertainment - Site Reliability Engineer

Escape Velocity Entertainment

(Remote)
2 Months ago
Rackspace Technology - DEVOP Engineer (AWS Terraform)-PSDE III

Rackspace Technology

India (Remote)
6 Months ago
Interactive Brokers - Senior DevOps/Software Engineer

Interactive Brokers

Greenwich, Connecticut, United States (Hybrid)
7 Months ago
Rackspace Technology - Python Software Engineer IV

Rackspace Technology

India (Remote)
2 Months ago
Google - Cloud Technical Solutions Engineer, Networking

Google

Tokyo, Japan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

GoTo Group - Software Engineer - Identity Platform

GoTo Group

Gurugram, Haryana, India (On-Site)
6 Months ago
Workato - Senior Infrastructure Engineer

Workato

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
Nine - Engineering Manager - Cloud Operations

Nine

North Sydney, New South Wales, Australia (On-Site)
3 Weeks ago
ELk studios - Senior Java Engineer

ELk studios

Franklin, Tennessee, United States (On-Site)
2 Months ago
gitlab - Staff Backend Engineer (Ruby or Go)

gitlab

(Remote)
2 Weeks ago
PwC - AWS Data Engineer|Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
8 Months ago
Thales - Data Visualization Power BI Engineer

Thales

Lille, Hauts-de-France, France (Hybrid)
1 Week ago
Unity - Senior Software Engineer - Frontend

Unity

Seoul, South Korea (On-Site)
6 Months ago
Fluxon - Staff Software Engineer

Fluxon

Kraków, Lesser Poland Voivodeship, Poland (Remote)
1 Week ago
Workato - Staff Software Engineer

Workato

Belgrade, Serbia (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Westford, Massachusetts, United States

NVIDIA - GPU Verification Architect

NVIDIA

Santa Clara, California, United States (On-Site)
4 Months ago
Sleeper - Backend Engineer

Sleeper

Los Angeles, California, United States (Remote)
2 Months ago
Riot Games - Senior Game Product Manager - League of Legends

Riot Games

Los Angeles, California, United States (On-Site)
1 Month ago
Axon - Sr. Business Systems Analyst - Salesforce

Axon

Boston, Massachusetts, United States (Hybrid)
2 Weeks ago
Coherent corp. - Senior Quality Control Technician

Coherent corp.

Bloomfield, Connecticut, United States (On-Site)
3 Weeks ago
Meta - Research Scientist Intern, Language and Multimodal Research for MetaAI (PhD)

Meta

Seattle, Washington, United States (On-Site)
6 Months ago
Snyk - Senior Solutions Architect

Snyk

Midwest, Wyoming, United States (On-Site)
2 Weeks ago
Springer Group - Associate or Senior Editor (preclinical models of diseases and drug development)

Springer Group

New York, United States (On-Site)
2 Weeks ago
attentive - Messaging Strategy & Operations Analyst, Deliverability

attentive

United States (Remote)
1 Week ago
NVIDIA - Solutions Architect, AI and ML

NVIDIA

Redmond, Washington, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Luxoft - Senior DevOps Engineer

Luxoft

Toronto, Ontario, Canada (On-Site)
5 Months ago
Smilegate - [CTO본부] 프라이빗클라우드팀 리드 (팀장급)

Smilegate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
5 Months ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Santa Clara, California, United States (Hybrid)
4 Months ago
Zazz - Java Developer

Zazz

(Remote)
3 Months ago
NVIDIA - Principal Engineer - Enterprise Applications

NVIDIA

Santa Clara, California, United States (Hybrid)
1 Month ago
Ion - Senior DevSecOps Engineer, Italy

Ion

Pisa, Tuscany, Italy (On-Site)
7 Months ago
Britive - SOFTWARE ENGINEER (CLOUD)

Britive

Bengaluru, Karnataka, India (Remote)
6 Months ago
NVIDIA - Senior AI-HPC Storage Engineer

NVIDIA

Westford, Massachusetts, United States (On-Site)
3 Months ago
Glean - Solutions Engineer - East

Glean

(Remote)
6 Months ago
extreme network - SR PROGRAMMER - Oracle Fusion Cloud- VBCS/ BI Reports/ OTBI/FRS & SmartView

extreme network

Chennai, Tamil Nadu, India (Hybrid)
7 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug