Senior SRE Software Engineer, Storage and Data

2 Months ago • 5 Years + • DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer at NVIDIA, you'll be responsible for ensuring the reliability, availability, and performance of storage infrastructures for the DGX Cloud platform. This involves developing strategies for redundancy and disaster recovery, continuously analyzing and optimizing storage systems, developing automation scripts, implementing monitoring and alerting systems, and participating in on-call rotations. You'll collaborate with cross-functional teams, troubleshoot issues, conduct root cause analysis, and work with AI/ML workloads. The role requires expertise in storage systems, SRE principles, and automation, along with experience with various tools and technologies like Ansible, Python, AWS S3, and monitoring stacks.
Must have:
  • 5+ years experience
  • Storage system administration
  • SRE experience
  • Automation scripting
  • Linux system administration
  • Problem-solving skills
  • Collaboration skills
Good to have:
  • Experience with OpenStack Swift, AWS S3, DDN, Lustre
  • Strong Linux and network troubleshooting skills
  • Experience with Kubernetes, OpenStack, Docker
  • Experience with Ansible, Chef, Puppet, Terraform
Perks:
  • Competitive salary
  • Generous benefits package

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.

  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.

  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.

  • Implement monitoring and alerting systems to proactively identify and address issues.

  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.

  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.

  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.

  • Proven experience in storage system administration and site reliability engineering.

  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.

  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java

  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.

  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.

  • Strong Linux and network troubleshooting skills by running various commands and tools.

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..

  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the most desirable employers in the world. We have some of the most brilliant and talented people in the world working for us. If you are creative, autonomous and love a challenge, we want to hear from you. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

Playtika - QA Automation Engineer

Playtika

Poland (Hybrid)
1 Month ago
Oriserve - Lead DevOps Engineer (5+ Yrs Exp)

Oriserve

Noida, Uttar Pradesh, India (On-Site)
4 Months ago
Sinch - Database Administrator (DBA)

Sinch

France (Remote)
1 Week ago
Luxoft - Murex Technical Developer - Lead

Luxoft

Toronto, Ontario, Canada (On-Site)
4 Months ago
ByteDance - Site Reliability Engineer Graduate (Technical Infrastructure) - 2025 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Activision - Cloud Engineering Co-op

Activision

Vancouver, British Columbia, Canada (Hybrid)
1 Month ago
CData Software - Platform Engineer

CData Software

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Ambition - Data Engineer (python)

Ambition

Singapore (On-Site)
7 Months ago
Velotio Technologies - Software Architect (Data Engineering)

Velotio Technologies

Maharashtra, India (Remote)
2 Weeks ago
SmileGate - SRE Strategy PM

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

NVIDIA - Senior Software and System Architect

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
Toptracer - Software Engineer

Toptracer

Stockholm, Stockholm County, Sweden (Hybrid)
2 Months ago
PlayStation Global - Data Software Engineer, Financial Systems

PlayStation Global

Carlsbad, California, United States (Hybrid)
6 Days ago
ByteDance - Software Engineer, ML System Scheduling

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Nielsen Holdings - Senior Software Engineer - Bigdata ( Java / Scala / Python  & Spark , SQL , AWS).

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Wargaming - Service Desk System Administrator

Wargaming

Vilnius, Vilnius County, Lithuania (On-Site)
2 Hours ago
Redhorse Corp - CNO Developer

Redhorse Corp

Chantilly, Virginia, United States (On-Site)
4 Months ago
Tencent - Senior Backend Engineer for Global Realistic 3A Action Game

Tencent

Shenzhen, Guangdong Province, China (On-Site)
3 Months ago
Crytek - Senior Site Reliability Engineer

Crytek

Frankfurt, Hessen, Germany (Remote)
6 Months ago
Thatgamecompany - Backend Engineer - China

Thatgamecompany

Shanghai, Shanghai, China (On-Site)
6 Days ago

Get notifed when new similar jobs are uploaded

Jobs in Taipei City, Taiwan

Corsair - Senior Commodity Buyer

Corsair

Taipei City, Taiwan (On-Site)
1 Week ago
Corsair - Oracle Application Developer

Corsair

Taiwan (On-Site)
1 Week ago
Appier - Senior Software Engineer, Data Backend(CrossX)

Appier

Taipei City, Taiwan (On-Site)
4 Months ago
NVIDIA - Senior Mixed Signal Designer Engineer

NVIDIA

Hsinchu, Hsinchu City, Taiwan (On-Site)
1 Month ago
NVIDIA - System Software Application Engineer

NVIDIA

Taipei City, Taiwan (On-Site)
2 Months ago
NVIDIA - Silicon Photonics Test Engineer

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago
NVIDIA - Customer Program Manager

NVIDIA

Taipei City, Taiwan (On-Site)
2 Months ago
Corsair - Firmware Software Engineer

Corsair

Taiwan (On-Site)
1 Week ago
Maersk Careers - People Advisor

Maersk Careers

Taoyuan City, Taiwan (On-Site)
5 Months ago
NVIDIA - Software Engineering Intern, Autonomous Vehicles (RDSS)

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Nielsen Holdings - SENIOR DEVOPS ENGINEER

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Metacore - DevOps Advocate

Metacore

Helsinki, Uusimaa, Finland (Hybrid)
6 Days ago
Nielsen Holdings - Senior Software Engineer - Bigdata ( Java / Scala / Python  & Spark , SQL , AWS).

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
NVIDIA - Senior HPC DevOps Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
2 Months ago
Ajmera Infotech - Site Reliability Engineer - Kubernetes

Ajmera Infotech

San Jose, California, United States (On-Site)
2 Months ago
Gaming Innovation Group  - Senior Platform DevOps Engineer

Gaming Innovation Group

St. Julian's, Malta (Hybrid)
5 Days ago
Krafton  - [Infra Div.] Game DevOps Engineer (BGMI) (3년 ~ 5년)

Krafton

Seoul, South Korea (On-Site)
4 Months ago
Anthology  Inc  - DevOps (SRE) Engineer

Anthology Inc

Brno, South Moravian Region, Czechia (On-Site)
5 Months ago
Epic Games - Senior DevOps Programmer

Epic Games

Porto Alegre, State Of Rio Grande Do Sul, Brazil (On-Site)
1 Week ago
Avalanche Studios Group - Senior DevOps Engineer

Avalanche Studios Group

Stockholm, Stockholm County, Sweden (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Ra'anana, Center District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug