Senior SRE Software Engineer, Storage and Data

2 Months ago • 5 Years + • DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer, Storage and Data at NVIDIA, you'll ensure the reliability and performance of storage infrastructures for the DGX Cloud platform. Responsibilities include developing strategies for system reliability and availability, analyzing and optimizing storage systems for performance, developing automation scripts, implementing monitoring and alerting systems, participating in on-call rotations, collaborating with cross-functional teams, and working with AI/ML workloads. This role demands expertise in storage systems, reliability engineering, and automation. You'll be involved in troubleshooting, root cause analysis, and implementing preventive measures to minimize downtime and enhance user experience.
Must have:
  • 5+ years experience
  • Storage system administration
  • Site reliability engineering
  • Automation scripting
  • Monitoring and alerting
  • Collaboration skills
  • Problem-solving skills
Good to have:
  • OpenStack Swift/AWS S3 experience
  • DDN or Lustre experience
  • Strong Linux & network troubleshooting
  • Kubernetes/OpenStack/Docker experience

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
  • Implement monitoring and alerting systems to proactively identify and address issues.
  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
  • Proven experience in storage system administration and site reliability engineering.
  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
  • Strong Linux and network troubleshooting skills by running various commands and tools.
  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Similar Jobs

NVIDIA - Test Developer - Robotics

NVIDIA

Pune, Maharashtra, India (On-Site)
2 Weeks ago
Krafton  - [Infra Div.] Technology Solution Engineer (4년 ~ 10년)

Krafton

Seoul, South Korea (On-Site)
4 Months ago
Zoox - Staff Software Systems Engineer - Software Architecture

Zoox

Foster City, California, United States (Hybrid)
5 Months ago
Electronic Arts - Server Software Engineer

Electronic Arts

Seoul, South Korea (On-Site)
5 Days ago
SuperPlay - SENIOR .NET DEVELOPER

SuperPlay

Bucharest, Bucharest, Romania (On-Site)
5 Months ago
Nagarro - Principal Engineer - Senior Salesforce Architect

Nagarro

Boston, Massachusetts, United States (Hybrid)
4 Months ago
ByteDance - Site Reliability Engineer - Data Infrastructure (San Jose)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
ByteDance - Software Engineer, SRE - Platform Services

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
Epic Games - Build Programmer, Fortnite

Epic Games

Vancouver, British Columbia, Canada (On-Site)
1 Month ago
Nagarro - Senior Engineer, Cloud

Nagarro

Bengaluru, Karnataka, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Scientific Games  - Technical Support Engineer

Scientific Games

Bengaluru, Karnataka, India (On-Site)
1 Month ago
NVIDIA - Senior Software Engineer – AI Infrastructure and Tooling

NVIDIA

California, United States (Remote)
4 Days ago
NVIDIA - Software Engineering Intern - CUDA Test Development

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
Salesforce - Director, Network Security Engineering

Salesforce

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Samsung Semiconductor - Senior Engineer, System Software

Samsung Semiconductor

San Jose, California, United States (On-Site)
15 Hours ago
The Walt Disney Company - Software Engineer, Test

The Walt Disney Company

Emeryville, California, United States (On-Site)
6 Days ago
ByteDance - Research Scientist in ML Systems

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Animoca Brands - Blockchain Apps Developer

Animoca Brands

Hong Kong (On-Site)
6 Months ago
Appier - Software Engineer, Machine Learning Platform

Appier

Taipei City, Taiwan (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

NVIDIA - System Software Engineer, GPU Development Tools

NVIDIA

Shanghai, Shanghai, China (Hybrid)
2 Months ago
Activision - VFX Artist

Activision

Shanghai, Shanghai, China (On-Site)
8 Months ago
Tencent - Business Application Developer Internship

Tencent

Shanghai, Shanghai, China (On-Site)
3 Weeks ago
Virtuos - Management Trainee

Virtuos

China (On-Site)
1 Week ago
Ubisoft - Level Designer-Scoring[Casual Party Game]

Ubisoft

Shanghai, Shanghai, China (On-Site)
5 Months ago
Thatgamecompany - Game Engine Engineer (Optimization)

Thatgamecompany

Shanghai, Shanghai, China (On-Site)
5 Days ago
Tencent - NIKKE Creative Director

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Week ago
Tencent - Esports Operations Manager (Team Operations & Club Ecosystem) -- PUBG Mobile

Tencent

Shenzhen, Guangdong Province, China (On-Site)
3 Months ago
Paper Games - Audio Business (Spring Recruitment 2025)

Paper Games

Shanghai, Shanghai, China (On-Site)
6 Days ago
Tencent - Senior UI Motion Designer for Global AAA Action Game

Tencent

Shenzhen, Guangdong Province, China (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Playtech - Release Engineer

Playtech

Kyiv, Kyiv City, Ukraine (On-Site)
3 Weeks ago
The Walt Disney Company - Lead Software Engineer (Identity)

The Walt Disney Company

Burbank, California, United States (On-Site)
4 Months ago
Wargaming - DevOps Engineer

Wargaming

Shanghai, Shanghai, China (On-Site)
1 Week ago
CD PROJEKT RED - ML Ops Engineer

CD PROJEKT RED

Warsaw, Masovian Voivodeship, Poland (On-Site)
3 Days ago
ByteDance - Linux System Engineer

ByteDance

London, England, United Kingdom (On-Site)
2 Months ago
QUANTIC DREAM - DevOps Software Developer

QUANTIC DREAM

Paris, Île-de-France, France (Hybrid)
5 Days ago
Fortis Games - Senior Cloud Security Engineer

Fortis Games

Portugal (On-Site)
1 Month ago
ION - Cloud Engineer Kubernetes

ION

Milan, Lombardy, Italy (Hybrid)
5 Months ago
Nintendo - Sr Manager, Engineering Infrastructure and IT

Nintendo

Redmond, Washington, United States (On-Site)
3 Months ago
ByteDance - Software Engineer, Cloud Native Platform

ByteDance

San Jose, California, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Ra'anana, Center District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug