Senior Site Reliability Engineer

3 Months ago • 10 Years + • Devops • $168,000 PA - $322,000 PA

Job Summary

Job Description

As a Senior Site Reliability Engineer at NVIDIA, you will be responsible for ensuring the smooth operation of brand-new technologies. This involves owning solutions, collaborating with cross-functional teams, improving solution provisioning through automation, identifying areas for service resiliency improvements, detecting and resolving performance issues, conducting capacity planning, participating in incident reviews, and delivering SRE solutions in a multi-cloud environment (AWS, GCP, On-prem). You will ensure high uptime and QoS for internal customers and participate in on-call rotations. The role demands expertise in Kubernetes, CI/CD, IaC, Linux, and cloud services, along with strong coding skills (Python, Go, Ruby, or Groovy).
Must have:
  • 10+ years experience in building and supporting critical services
  • Kubernetes administration proficiency
  • CI/CD and IaC expertise
  • Deep Linux OS and TCP/IP understanding
  • Proficiency in at least one major cloud provider (AWS, GCP, Azure)
  • 5+ years coding experience (Python, Go, Ruby, or Groovy)
Good to have:
  • Linux certification (RedHat, Oracle)
  • Large-scale Kubernetes deployment experience
  • Strong modern container networking and storage architecture skills
  • Well-known Cloud Certifications
  • Slurm/LSF experience
Perks:
  • Equity
  • Benefits

Job Details

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

  • Continuously improve solution provisioning and management through automation.

  • Identify areas to improve service resiliency using industry-standard practices.

  • Detect performance issues and recommend solutions to maintain world-class service quality.

  • Conduct capacity management and planning to meet ongoing operational needs.

  • Participate in incident reviews, assist in root cause identification, and write RCA reports.

  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.

  • Participate in the team's on-call rotation.

What we need to see:

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.

  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

  • Deep understanding of Linux operating systems and TCP/IP fundamentals.

  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.

  • Demonstrated proficiency with end-to-end SRE capabilities and observability.

  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.

  • Prior experience managing large-scale Kubernetes deployment in production.

  • Strong skills in modern container networking and storage architecture.

  • Well-known Cloud Certification(s).

  • Hands-on experience working with Slurm/LSF environments.

The base salary range is 168,000 USD - 322,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

bytedance - Senior Manager, Product Management - Customer Service Platform - International E-commerce

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago
Apple - Sensing Hardware - Algorithm Systems Engineer

Apple

Cupertino, California, United States (On-Site)
4 Weeks ago
Scale AI - AI Product Manager, Generative AI

Scale AI

San Francisco, California, United States (On-Site)
8 Months ago
eBay - Technical Program Manager

eBay

Austin, Texas, United States (Hybrid)
3 Weeks ago
Gibbs Cam - Accountant

Gibbs Cam

Cincinnati, Ohio, United States (Hybrid)
1 Month ago
Progress - DevOps Engineer

Progress

Sofia, Sofia City Province, Bulgaria (Hybrid)
1 Month ago
Capgemini - AZURE SOLUTION ARCHITECT

Capgemini

Mumbai, Maharashtra, India (On-Site)
2 Months ago
bytedance - Software Engineer in ML Engineering Platform

bytedance

Seattle, Washington, United States (On-Site)
8 Months ago
PhonePe - Site Reliability Engineer

PhonePe

Pune, Maharashtra, India (On-Site)
1 Month ago
Saviynt - Manager Cloud Security, Infosec

Saviynt

Bengaluru, Karnataka, India (Hybrid)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

neostella - Strategic Account Manager

neostella

Chicago, Illinois, United States (Hybrid)
2 Months ago
PayPal - Sr Dir, AI Technology - Feature Engrg Lead

PayPal

Bengaluru, Karnataka, India (Hybrid)
2 Weeks ago
Blinkhealth - Customer Support Specialist

Blinkhealth

Pittsburgh, Pennsylvania, United States (On-Site)
1 Week ago
Roblox - Principal Software Engineer, Virtual Economy Optimization

Roblox

San Mateo, California, United States (On-Site)
1 Week ago
Illumina - Americas Senior Import/Export Analyst

Illumina

San Diego, California, United States (Hybrid)
1 Month ago
bohemia interactive - Lead Programmer

bohemia interactive

Prague, Prague, Czechia (On-Site)
7 Months ago
Banyan Software - Head of Payments - UK & Europe

Banyan Software

London, England, United Kingdom (On-Site)
4 Weeks ago
Lytx,  Inc  - Staff DevSecOps Engineer

Lytx, Inc

India (On-Site)
2 Months ago
Alpha Sense - Analyst, Client & Product Support

Alpha Sense

Pune, Maharashtra, India (On-Site)
1 Month ago
bytedance - Security Operation Engineer, Security Assurance

bytedance

Singapore (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

GoMotive - Account Executive, Enterprise - West

GoMotive

United States (Remote)
2 Months ago
Crunchyroll - Senior Data Engineer

Crunchyroll

Culver City, California, United States (On-Site)
6 Months ago
Trek - Seasonal Sales Associate

Trek

Omaha, Nebraska, United States (On-Site)
5 Months ago
Universal Music - Administrative Assistant

Universal Music

Santa Monica, California, United States (On-Site)
4 Months ago
Payactive - Director of Card Services

Payactive

Milpitas, California, United States (Hybrid)
8 Months ago
UPF Industries  - Regional Truck Driver

UPF Industries

Granger, Indiana, United States (On-Site)
1 Month ago
Apple - Firmware Engineer

Apple

Sunnyvale, California, United States (On-Site)
3 Weeks ago
rivos - Accelerator DV Testgen

rivos

Santa Clara, California, United States (Hybrid)
1 Year ago
Sailpoint - Digital Sales Representative

Sailpoint

Austin, Texas, United States (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Google - Staff Software Engineer, Infrastructure, Core

Google

Sunnyvale, California, United States (On-Site)
2 Months ago
Flexera - Senior Site Reliability Engineer

Flexera

Bengaluru, Karnataka, India (Hybrid)
9 Months ago
Perplexity - Site Reliability Engineer

Perplexity

San Francisco, California, United States (On-Site)
1 Month ago
Next Level Business Services - Windows Azure Build Engineer

Next Level Business Services

Redmond, Washington, United States (On-Site)
8 Months ago
Vercel - Site Reliability Engineer, Compute

Vercel

(Remote)
1 Month ago
Spruce Systems - Software Engineer, Cross-Platform Rust

Spruce Systems

(Remote)
1 Year ago
Collaborative Robotics - System Safety Engineer, Reliability

Collaborative Robotics

Santa Clara, California, United States (On-Site)
1 Month ago
BigID - Site Reliability Engineer

BigID

Buenos Aires, Buenos Aires, Argentina (Remote)
2 Weeks ago
Google - Software Engineer III, Full Stack, Google Cloud Business Platforms

Google

Sunnyvale, California, United States (On-Site)
2 Months ago
Hitachi - Terraform with DevOps

Hitachi

Pune, Maharashtra, India (On-Site)
8 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug