Senior SRE Software Engineer, Storage and Data

3 Months ago • 5 Years + • DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer, Storage and Data at NVIDIA, you'll ensure the reliability and performance of storage infrastructures for the DGX Cloud platform. Responsibilities include developing strategies for system reliability and availability, analyzing and optimizing storage systems for performance, developing automation scripts, implementing monitoring and alerting systems, participating in on-call rotations, collaborating with cross-functional teams, and working with AI/ML workloads. This role demands expertise in storage systems, reliability engineering, and automation. You'll be involved in troubleshooting, root cause analysis, and implementing preventive measures to minimize downtime and enhance user experience.
Must have:
  • 5+ years experience
  • Storage system administration
  • Site reliability engineering
  • Automation scripting
  • Monitoring and alerting
  • Collaboration skills
  • Problem-solving skills
Good to have:
  • OpenStack Swift/AWS S3 experience
  • DDN or Lustre experience
  • Strong Linux & network troubleshooting
  • Kubernetes/OpenStack/Docker experience

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
  • Implement monitoring and alerting systems to proactively identify and address issues.
  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
  • Proven experience in storage system administration and site reliability engineering.
  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
  • Strong Linux and network troubleshooting skills by running various commands and tools.
  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Similar Jobs

NVIDIA - Deep Learning Engineer, Datacenters

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
3 Weeks ago
DNEG - FX Lead

DNEG

Karnataka, India (On-Site)
1 Month ago
Every matrix - Junior L2/L3 Support Engineer

Every matrix

Lviv, Lviv Oblast, Ukraine (Hybrid)
3 Weeks ago
NVIDIA - Senior System Software Engineer - Automotive Platform

NVIDIA

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Next Level Business Services - Sr. Cassandra Architect

Next Level Business Services

Sparks, Maryland, United States (On-Site)
5 Months ago
Google - Staff Site Reliability Engineer, Google Cloud Storage

Google

Sydney, New South Wales, Australia (On-Site)
1 Week ago
Microsoft - Senior Software Engineer

Microsoft

(On-Site)
1 Week ago
Roofstacks - Senior Platform Engineer

Roofstacks

İstanbul, İstanbul, Türkiye (On-Site)
2 Months ago
Company3 Method Studios - Technical Architect D365 Finance &Operations

Company3 Method Studios

Pune, Maharashtra, India (Remote)
7 Months ago
Ubisoft - Linux DevOps Systems Administrator

Ubisoft

Montreal, Quebec, Canada (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Velotio Technologies - Cloud Security Engineer

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago
ByteDance - DevOps Engineer - Applied Machine Learning, Engine

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
NVIDIA - System Products Memory Solutions Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
2 Weeks ago
Google - Cloud Technical Solutions Engineer, Infrastructure Compute

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
Tencent - Site Reliability Engineer Intern

Tencent

California, United States (On-Site)
1 Month ago
Google - Data Center Technician

Google

Montreal, Quebec, Canada (On-Site)
1 Week ago
Playtika - IT Engineering Team Lead

Playtika

Ukraine (On-Site)
2 Months ago
Omnissa - Member of Technical Staff (C++ Windows)

Omnissa

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
Milestone - Automation QA Engineer (Software Engineer in Test)

Milestone

Sofia, Sofia City Province, Bulgaria (Hybrid)
1 Week ago
Google - Data Center Technician

Google

Papillion, Nebraska, United States (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

Ourpalm - SLG Producer

Ourpalm

Beijing, Beijing, China (On-Site)
1 Week ago
Tencent - Senior Client-Side Security Engineer

Tencent

Shenzhen, Guangdong Province, China (On-Site)
4 Months ago
TiMi Studio Group - Operations Manager, AAA-Style Realistic Shooting PC Game (China)

TiMi Studio Group

Shenzhen, Guangdong Province, China (On-Site)
5 Days ago
Tencent - Senior Game Designer

Tencent

Shanghai, Shanghai, China (On-Site)
4 Months ago
Tencent - Senior 2D Character Concept Artist - Global AAA Action Game

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
Voodoo - Publishing Manager

Voodoo

Shanghai, Shanghai, China (Remote)
3 Months ago
NVIDIA - Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Months ago
Google - Strategic Partnerships Development Manager, Sellside Monetization

Google

Beijing, Beijing, China (On-Site)
1 Week ago
Zengame Technology - Advertising Optimizer

Zengame Technology

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
Thatgamecompany - Business Development and IP Licensing Manager - China

Thatgamecompany

Shanghai, Shanghai, China (On-Site)
4 Weeks ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Wargaming - DevOps Engineer (Deployment team)

Wargaming

Vilnius, Vilnius County, Lithuania (On-Site)
1 Month ago
PowerSchool - Sr Cloud Ops Eng I

PowerSchool

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Google - Technical Solutions Engineer, Infrastructure, Serverless

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
ByteDance - Senior Site Reliability Engineer - Data Infrastructure (San Jose)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Limit Break - Senior Site Reliability Engineer

Limit Break

Tokyo, Japan (On-Site)
7 Months ago
Google - Systems Development Engineer, Customer Deployments, Google Cloud Platform

Google

Munich, Bavaria, Germany (On-Site)
1 Week ago
N-iX - Senior DevOps (AWS) Engineer

N-iX

Colombia (Remote)
6 Days ago
Rackspace Technology - Cloud Practice Engineer III

Rackspace Technology

Jalisco, Mexico (Remote)
1 Week ago
Scanline VFX - Senior DevOps Engineer

Scanline VFX

Montreal, Quebec, Canada (Hybrid)
2 Months ago
Luxoft - Senior ETL Developer

Luxoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug