Senior Site Reliability Engineer

4 Months ago • 10 Years + • Devops • $168,000 PA - $322,000 PA

Job Summary

Job Description

NVIDIA seeks a Senior Site Reliability Engineer to guarantee the smooth operation of their cutting-edge technologies. Responsibilities include owning solution implementation, collaborating with cross-functional teams, automating provisioning and management, improving service resiliency, detecting and resolving performance issues, conducting capacity planning, participating in incident reviews, and delivering SRE solutions in a multi-cloud environment (AWS, GCP, On-prem). The role demands ensuring high uptime and QoS for internal customers and participation in on-call rotations.
Must have:
  • 10+ years experience in building and supporting critical services
  • Kubernetes administration, CI/CD, IaC proficiency
  • Linux OS and TCP/IP expertise
  • Experience with at least one major cloud provider (AWS, GCP, Azure)
  • 5+ years coding/scripting (Python, Go, Ruby, or Groovy)
  • Excellent debugging and communication skills
Good to have:
  • Linux certification
  • Large-scale Kubernetes deployment experience
  • Modern container networking and storage architecture skills
  • Cloud certifications
  • Slurm/LSF environment experience
Perks:
  • Equity
  • Benefits

Job Details

Join our team in Santa Clara, CA, USA as a Senior Site Reliability Engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

  • Continuously improve solution provisioning and management through automation.

  • Identify areas to improve service resiliency using industry-standard practices.

  • Detect performance issues and recommend solutions to maintain world-class service quality.

  • Conduct capacity management and planning to meet ongoing operational needs.

  • Participate in incident reviews, assist in root cause identification, and write RCA reports.

  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.

  • Participate in the team's on-call rotation.

What we need to see:

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with over 10 years in building and supporting critical services.

  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

  • Deep understanding of Linux operating systems and TCP/IP fundamentals.

  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.

  • Demonstrated proficiency with end-to-end SRE capabilities and observability.

  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.

  • Prior experience managing large-scale Kubernetes deployment in production.

  • Strong skills in modern container networking and storage architecture.

  • Well-known Cloud Certification(s).

  • Hands-on experience working with Slurm/LSF environments.

The base salary range is 168,000 USD - 322,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Polygon Labs - APAC Growth Manager

Polygon Labs

Victoria, Australia (Remote)
4 Months ago
fuse games - Game Developer

fuse games

İstanbul, Türkiye (On-Site)
3 Months ago
Clearwater Analytics - Procurement Specialist

Clearwater Analytics

Chicago, Illinois, United States (On-Site)
1 Month ago
Morning Star - ServiceNow Engineer

Morning Star

Mumbai, Maharashtra, India (Hybrid)
1 Month ago
Unity - Senior Data Engineer

Unity

Canada (Remote)
3 Weeks ago
Tencent - Senior IT Devops Engineer

Tencent

Irvine, California, United States (On-Site)
2 Months ago
bytedance - CDN Senior Site Reliability Engineer - Traffic Infrastructure

bytedance

Singapore (On-Site)
9 Months ago
USE Insider - Solutions Architect

USE Insider

Bogota, Colombia (Hybrid)
3 Months ago
Riot Games - Senior Software Engineer, Services - Esports Platform & Experiences

Riot Games

Dublin, County Dublin, Ireland (On-Site)
8 Months ago
bytedance - Software Engineer, Architecture and Infrastructure

bytedance

San Jose, California, United States (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Thales - Director of Production/Manufacturing and Industrialization

Thales

Ulm, Baden-Württemberg, Germany (On-Site)
2 Months ago
dun bradstreet - Business Development Manager

dun bradstreet

Frankfurt Am Main, Hessen, Germany (Hybrid)
4 Months ago
Triple dot studios - Senior Product Manager

Triple dot studios

Melbourne, Victoria, Australia (Hybrid)
6 Months ago
Globalization Partners - Staff Mobile Engineer (AI)

Globalization Partners

India (Remote)
3 Weeks ago
Roblox - Principal Security Software Engineer, Anti-Cheat

Roblox

San Mateo, California, United States (On-Site)
1 Month ago
Apple - Engineering Program Manager, Privacy

Apple

Seattle, Washington, United States (On-Site)
1 Month ago
Trend Micro - (Sr.) Cloud Developer (Vision One)

Trend Micro

Taipei City, Taiwan (On-Site)
10 Months ago
Mistplay - Senior Backend Engineer

Mistplay

Toronto, Ontario, Canada (Hybrid)
1 Month ago
undefined - Business Process Designer

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Figma - Manager, IT Operations

Figma

New York, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Westford, Massachusetts, United States

Adtran - Process Technician - Supervisor

Adtran

Huntsville, Alabama, United States (On-Site)
2 Months ago
bytedance - Research Scientist, Reinforcement Learning

bytedance

Seattle, Washington, United States (On-Site)
9 Months ago
Super.com - Senior Full-Stack Software Engineer ( Remote! )

Super.com

Portland, Oregon, United States (Remote)
9 Months ago
Alten Technology - Brake Performance Engineer

Alten Technology

Foster City, California, United States (On-Site)
1 Month ago
Apple - Developer Platform Frameworks Engineer

Apple

Cupertino, California, United States (On-Site)
1 Month ago
Apple - Engineering Program/Project Lead

Apple

Austin, Texas, United States (On-Site)
1 Month ago
People Can Fly - Live Operations Technician

People Can Fly

Yonkers, New York, United States (Remote)
4 Months ago
Rockstar Games - Animation R&D Programmer: Retargeting

Rockstar Games

Carlsbad, California, United States (On-Site)
2 Months ago
Forescout Technologies  Inc  - Security Analyst

Forescout Technologies Inc

United States (On-Site)
2 Months ago
Ion - FX Implementation Specialists

Ion

Boston, Massachusetts, United States (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

undefined - Solution Architect

Iași, Iași County, Romania (On-Site)
1 Month ago
Virtusa - DevOps Lead

Virtusa

Pune, Maharashtra, India (Hybrid)
9 Months ago
version 1 - Oracle Cloud Infrastructure (OCI) Architect

version 1

Dublin, County Dublin, Ireland (Hybrid)
3 Months ago
Go Fund Me - Staff Software Engineer (Integrations Platform)

Go Fund Me

San Francisco, California, United States (Hybrid)
2 Months ago
EMA - Solution Architect

EMA

London, England, United Kingdom (Remote)
6 Months ago
Palo Alto Networks - Senior Principal FinOps/DevOps Engineer

Palo Alto Networks

Santa Clara, California, United States (On-Site)
1 Month ago
Rackspace Technology - Senior Solutions Architect (GCP)

Rackspace Technology

Egypt (Remote)
2 Months ago
Enverus - Senior Site Reliability Engineer

Enverus

Brno, South Moravian Region, Czechia (Hybrid)
3 Months ago
Apple - Senior Site Reliability Engineer

Apple

Austin, Texas, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Dubai, Dubai, United Arab Emirates (On-Site)

Beijing, Beijing, China (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug