Senior Site Reliability Engineer

1 Month ago • 10 Years + • DevOps • $168,000 PA - $322,000 PA

Job Summary

Job Description

As a Senior Site Reliability Engineer at NVIDIA, you will be responsible for ensuring the smooth operation of brand-new technologies. This involves owning solutions, collaborating with cross-functional teams, improving solution provisioning through automation, identifying areas for service resiliency improvements, detecting and resolving performance issues, conducting capacity planning, participating in incident reviews, and delivering SRE solutions in a multi-cloud environment (AWS, GCP, On-prem). You will ensure high uptime and QoS for internal customers and participate in on-call rotations. The role demands expertise in Kubernetes, CI/CD, IaC, Linux, and cloud services, along with strong coding skills (Python, Go, Ruby, or Groovy).
Must have:
  • 10+ years experience in building and supporting critical services
  • Kubernetes administration proficiency
  • CI/CD and IaC expertise
  • Deep Linux OS and TCP/IP understanding
  • Proficiency in at least one major cloud provider (AWS, GCP, Azure)
  • 5+ years coding experience (Python, Go, Ruby, or Groovy)
Good to have:
  • Linux certification (RedHat, Oracle)
  • Large-scale Kubernetes deployment experience
  • Strong modern container networking and storage architecture skills
  • Well-known Cloud Certifications
  • Slurm/LSF experience
Perks:
  • Equity
  • Benefits

Job Details

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

  • Continuously improve solution provisioning and management through automation.

  • Identify areas to improve service resiliency using industry-standard practices.

  • Detect performance issues and recommend solutions to maintain world-class service quality.

  • Conduct capacity management and planning to meet ongoing operational needs.

  • Participate in incident reviews, assist in root cause identification, and write RCA reports.

  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.

  • Participate in the team's on-call rotation.

What we need to see:

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.

  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

  • Deep understanding of Linux operating systems and TCP/IP fundamentals.

  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.

  • Demonstrated proficiency with end-to-end SRE capabilities and observability.

  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.

  • Prior experience managing large-scale Kubernetes deployment in production.

  • Strong skills in modern container networking and storage architecture.

  • Well-known Cloud Certification(s).

  • Hands-on experience working with Slurm/LSF environments.

The base salary range is 168,000 USD - 322,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Animoca Brands - Senior DevOps Engineer

Animoca Brands

Hong Kong (On-Site)
7 Months ago
Eventbrite - Software Engineer I

Eventbrite

(Remote)
1 Day ago
GoMotive - Software Engineer - Backend

GoMotive

(Remote)
1 Day ago
GT - Full-Stack Developer (Ruby + React)

GT

(Remote)
1 Month ago
Visual Concepts - Senior Server Engineer

Visual Concepts

Austin, Texas, United States (On-Site)
1 Month ago
Canva - Senior Software Engineer (Cloud Platform)

Canva

Auckland, Auckland, New Zealand (Remote)
2 Months ago
Nintendo - Machine Learning Operations Engineer

Nintendo

Redmond, Washington, United States (On-Site)
2 Months ago
Remedy Entertainment Plc - Senior/Lead Build Engineer

Remedy Entertainment Plc

Helsinki, Uusimaa, Finland (Hybrid)
2 Months ago
PearlAbyss - Junior System Engineer

PearlAbyss

(On-Site)
3 Months ago
Microsoft - Senior Systems Engineer

Microsoft

Redmond, Washington, United States (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Velotio Technologies - Senior Engineer (ROR)

Velotio Technologies

Bengaluru, Karnataka, India (Hybrid)
2 Days ago
Fluxon - Staff Software Engineer

Fluxon

(Remote)
18 Hours ago
WebFX - Junior Back-End Software Engineer

WebFX

Ann Arbor, Michigan, United States (On-Site)
2 Months ago
Provenir - Senior Quality Assurance Automation Engineer

Provenir

Bengaluru, Karnataka, India (On-Site)
8 Months ago
Microsoft - OA-Senior Security Product Manager

Microsoft

Redmond, Washington, United States (On-Site)
2 Days ago
Appirits - Web Engineer

Appirits

Shibuya, Tokyo, Japan (Hybrid)
1 Month ago
Colo pl - Server-Side Engineer (New Title)

Colo pl

Minato City, Tokyo, Japan (On-Site)
10 Months ago
Colo pl - Web Service Development Engineer

Colo pl

Minato City, Tokyo, Japan (On-Site)
1 Day ago
Interactive Brokers - Senior Platform Engineer - Design

Interactive Brokers

Fort Lauderdale, Florida, United States (Hybrid)
6 Months ago
Postman - Senior Engineer (Backend), Collections

Postman

Bengaluru, Karnataka, India (Hybrid)
1 Day ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Riot Games - Staff Software Engineer, Gameplay/Characters

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
Google - Strategy and Operations Senior Associate

Google

San Francisco, California, United States (On-Site)
2 Weeks ago
Rockstar Games - Technical Writer

Rockstar Games

New York, New York, United States (On-Site)
4 Weeks ago
Next Level Business Services - SAP VIM (Open Text) Consultant

Next Level Business Services

Saint Paul, Minnesota, United States (On-Site)
6 Months ago
ByteDance - Research Scientist, Reinforcement Learning

ByteDance

Seattle, Washington, United States (On-Site)
6 Months ago
Google - Program Manager II, Demand Planning, Cloud Supply Chain

Google

Atlanta, Georgia, United States (On-Site)
2 Days ago
Nintendo - Director, My Nintendo Store

Nintendo

Redmond, Washington, United States (Hybrid)
2 Weeks ago
Payactiv - Director of Card Services

Payactiv

Milpitas, California, United States (Hybrid)
6 Months ago
Google - TPU Microarchitecture Design Lead

Google

San Diego, California, United States (On-Site)
2 Days ago
WebFX - Jr. Marketing Analytics Specialist - Account Manager

WebFX

Harrisburg, Pennsylvania, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Plarium - Data Architect

Plarium

Herzliya, Tel Aviv District, Israel (On-Site)
2 Months ago
PwC - IN-Associate_Azure Devops_MS Engg_Advisory_Kolkata

PwC

Kolkata, West Bengal, India (On-Site)
6 Months ago
Google - Staff Software Engineer, Site Reliability Engineering

Google

Poland (On-Site)
2 Days ago
Google - Site Reliability Engineer, Databases

Google

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
CloudLinux - Senior Site Reliability Engineer

CloudLinux

(Remote)
1 Month ago
Google - Systems Development Engineer III

Google

Reston, Virginia, United States (On-Site)
2 Weeks ago
Hedra - Machine Learning Engineer

Hedra

San Francisco, California, United States (On-Site)
1 Month ago
Google - Customer Engineer II, Application Modernization, Retail, Google Cloud

Google

Mountain View, California, United States (On-Site)
2 Weeks ago
Google - Customer Engineer, Google Cloud

Google

Wellington, Wellington, New Zealand (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug