Home >

Jobs >

Senior SRE Software Engineer, Storage and Data

NVIDIA

Shanghai, China (On-site)

Senior SRE Software Engineer, Storage and Data

7 Months ago • 5 Years + • Devops

Job Summary

Job Description

As a Senior SRE Software Engineer, Storage and Data at NVIDIA, you'll ensure the reliability and performance of storage infrastructures for the DGX Cloud platform. Responsibilities include developing strategies for system reliability and availability, analyzing and optimizing storage systems for performance, developing automation scripts, implementing monitoring and alerting systems, participating in on-call rotations, collaborating with cross-functional teams, and working with AI/ML workloads. This role demands expertise in storage systems, reliability engineering, and automation. You'll be involved in troubleshooting, root cause analysis, and implementing preventive measures to minimize downtime and enhance user experience.

Must have:

5+ years experience
Storage system administration
Site reliability engineering
Automation scripting
Monitoring and alerting
Collaboration skills
Problem-solving skills

Good to have:

OpenStack Swift/AWS S3 experience
DDN or Lustre experience
Strong Linux & network troubleshooting
Kubernetes/OpenStack/Docker experience

15 skills required

15 skills required for this role

Add these skills to join the top 1% applicants for this job

linux

aws

ansible

openstack

networking

cross-functional

problem-solving

java

bash

unity

github

kubernetes

puppet

restful-api

grafana

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
Implement monitoring and alerting systems to proactively identify and address issues.
Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
Proven experience in storage system administration and site reliability engineering.
Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana.

Ways to stand out from the crowd:

Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
Strong Linux and network troubleshooting skills by running various commands and tools.
Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Similar Jobs

Systems Development Engineer, Edge Infrastructure Operations

Google

Dublin, County Dublin, Ireland (On-Site)

• 4 Months ago

Site Reliability Engineering (Edge Services) - Infrastructure Engineering

ByteDance

Singapore (On-Site)

• 10 Months ago

Software Engineer III - Linux Content Development

Crowd Strick

(Remote)

• 4 Months ago

IT Systems Administrator

VGW

Perth, Western Australia, Australia (On-Site)

• 6 Months ago

Principal Software Engineer

Zscaler

San Jose, California, United States (Hybrid)

• 4 Months ago

Software Engineer 2

Microsoft

Ho Chi Minh City, Ho Chi Minh City, Vietnam (On-Site)

• 4 Months ago

SDE 2 - DevOps

Dream Sports

Mumbai, Maharashtra, India (On-Site)

• 4 Months ago

Site Reliability Engineer

Sporty Group

(Remote)

• 5 Months ago

Customer Engineer, Infrastructure Modernization, Google Cloud

Google

Sunnyvale, California, United States (On-Site)

• 4 Months ago

Software Manager, Golang Kubernetes

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

• 6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Manager, Database Reliability Engineering (DBRE)

Demandbase

(Remote)

• 4 Months ago

Staff Engineer (C++ Windows)

Omnissa

Chennai, Tamil Nadu, India (On-Site)

• 10 Months ago

Senior Infrastructure Engineer (OpenSearch)

Workato

Lisbon, Lisbon, Portugal (On-Site)

• 4 Months ago

Senior/Team Lead NLP engineer

Social Discovery Group

(Remote)

• 9 Months ago

Technical Support Engineer

Glean

Bengaluru, Karnataka, India (On-Site)

• 4 Months ago

Server-Side Engineer (PlayStation™Network Server Application Development)

Sony Interactive Entertainment

Tokyo, Japan (On-Site)

• 7 Months ago

Engineering Manager, Commercial Systems

Canonical

(Remote)

• 4 Months ago

C++ SOFTWARE ENGINEER (MARKET DATA)

Equivalent Jobs

(Remote)

• 9 Months ago

DevOps Engineer (Kubernetes & Cloud Services)

Intrepid Studios, Inc

Canada (On-Site)

• 1 Year ago

Staff Systems Architect, Advanced Research and Development

Google

Mountain View, California, United States (On-Site)

• 4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, China

Operation Program Manager

NVIDIA

Shenzhen, Guangdong Province, China (On-Site)

• 4 Months ago

Rigger [Rainbow Six]

Ubisoft

Chengdu, Sichuan, China (On-Site)

• 9 Months ago

Global Sales Operations Senior Coordinator (Temp)

Spin Master

Guangdong Province, China (On-Site)

• 4 Months ago

Senior Game Designer - FFGS

Riot Games

Shanghai, Shanghai, China (On-Site)

• 1 Year ago

Legal BP

Ourpalm

Guangzhou, Guangdong Province, China (On-Site)

• 5 Months ago

Senior Manager, Insights

Riot Games

Shanghai, Shanghai, China (On-Site)

• 5 Months ago

Senior Product Designer, IoT

Grab

Beijing, Beijing, China (On-Site)

• 6 Months ago

Game Operations Manager

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 8 Months ago

Product Marketing Manager, Ads Marketing

Google

Shanghai, Shanghai, China (On-Site)

• 4 Months ago

Live Stream Commerce Manager

Zengame Technology

Shenzhen, Guangdong Province, China (On-Site)

• 4 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Software Engineer II/Senior Software Engineer - CTJ - POLY

Microsoft

Redmond, Washington, United States (On-Site)

• 4 Months ago

Sr. Engineer

Trend Micro

Taipei City, Taiwan (On-Site)

• 11 Months ago

Site Reliability Engineer - Security Engineering - San Jose

ByteDance

San Jose, California, United States (On-Site)

• 10 Months ago

Partner Solutions Consultant, Google Cloud

Google

Jakarta, Jakarta, Indonesia (On-Site)

• 4 Months ago

Infrastructure Site Reliability Engineer

Consilio LLC

Bengaluru, Karnataka, India (On-Site)

• 11 Months ago

Senior Software Support Engineer

Luxoft

(Remote)

• 9 Months ago

Lead Data Solution Engineer

The Walt Disney Company

Montévrain, Île-de-France, France (On-Site)

• 4 Months ago

Monitoring Engineer

G5 Games

Tbilisi, Tbilisi, Georgia (Remote)

• 5 Months ago

Senior DevOps (AWS) Engineer

N-iX

Colombia (Remote)

• 4 Months ago

Cloud Engineer (AWS)

Zazz

(Remote)

• 6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

74 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior SRE Software Engineer, Storage and Data

Job Summary

Job Description

15 skills required

15 skills required for this role

Job Details

Similar Jobs

Systems Development Engineer, Edge Infrastructure Operations

Site Reliability Engineering (Edge Services) - Infrastructure Engineering

Software Engineer III - Linux Content Development

IT Systems Administrator

Principal Software Engineer

Software Engineer 2

SDE 2 - DevOps

Site Reliability Engineer

Customer Engineer, Infrastructure Modernization, Google Cloud

Software Manager, Golang Kubernetes

Similar Skill Jobs

Manager, Database Reliability Engineering (DBRE)

Staff Engineer (C++ Windows)

Senior Infrastructure Engineer (OpenSearch)

Senior/Team Lead NLP engineer

Technical Support Engineer

Server-Side Engineer (PlayStation™Network Server Application Development)

Engineering Manager, Commercial Systems

C++ SOFTWARE ENGINEER (MARKET DATA)

DevOps Engineer (Kubernetes & Cloud Services)

Staff Systems Architect, Advanced Research and Development

Jobs in Shanghai, China

Operation Program Manager

Rigger [Rainbow Six]

Global Sales Operations Senior Coordinator (Temp)

Senior Game Designer - FFGS

Legal BP

Senior Manager, Insights

Senior Product Designer, IoT

Game Operations Manager

Product Marketing Manager, Ads Marketing

Live Stream Commerce Manager

Devops Jobs

Software Engineer II/Senior Software Engineer - CTJ - POLY

Sr. Engineer

Site Reliability Engineer - Security Engineering - San Jose

Partner Solutions Consultant, Google Cloud

Infrastructure Site Reliability Engineer

Senior Software Support Engineer

Lead Data Solution Engineer

Monitoring Engineer

Senior DevOps (AWS) Engineer

Cloud Engineer (AWS)

About The Company

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Network Site Reliability Engineer

ASIC Engineer

Senior ASIC Design Engineer

Physical Design CAD Team Manager

Engineering Farm Engineer

Senior Mixed Signal Design Verification Engineer

Senior Solutions Architect, Cloud Infrastructure and DevOps

Level Up Your Career in Game Development!