Senior SRE Software Engineer, Storage and Data

1 Month ago โ€ข 5 Years + โ€ข DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer at NVIDIA, you'll be responsible for ensuring the reliability, availability, and performance of storage infrastructures for the DGX Cloud platform. This involves developing strategies for redundancy and disaster recovery, continuously analyzing and optimizing storage systems, developing automation scripts, implementing monitoring and alerting systems, and participating in on-call rotations. You'll collaborate with cross-functional teams, troubleshoot issues, conduct root cause analysis, and work with AI/ML workloads. The role requires expertise in storage systems, SRE principles, and automation, along with experience with various tools and technologies like Ansible, Python, AWS S3, and monitoring stacks.
Must have:
  • 5+ years experience
  • Storage system administration
  • SRE experience
  • Automation scripting
  • Linux system administration
  • Problem-solving skills
  • Collaboration skills
Good to have:
  • Experience with OpenStack Swift, AWS S3, DDN, Lustre
  • Strong Linux and network troubleshooting skills
  • Experience with Kubernetes, OpenStack, Docker
  • Experience with Ansible, Chef, Puppet, Terraform
Perks:
  • Competitive salary
  • Generous benefits package

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.

  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.

  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.

  • Implement monitoring and alerting systems to proactively identify and address issues.

  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.

  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.

  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.

  • Proven experience in storage system administration and site reliability engineering.

  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.

  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java

  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.

  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.

  • Strong Linux and network troubleshooting skills by running various commands and tools.

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..

  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the most desirable employers in the world. We have some of the most brilliant and talented people in the world working for us. If you are creative, autonomous and love a challenge, we want to hear from you. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

IGT - Senior Internal Auditor, IT

IGT

Providence, Rhode Island, United States (On-Site)
โ€ข 3 Months ago
Scopely - IT Support Specialist

Scopely

California, United States (Remote)
โ€ข 3 Hours ago
Milestone - Senior Embedded Software Engineer

Milestone

United States (Remote)
โ€ข 1 Week ago
Next Level Business Services - Oracle BI Developer

Next Level Business Services

Goleta, California, United States (On-Site)
โ€ข 4 Months ago
Egnyte - Senior Build Engineer - Python - Jenkins

Egnyte

India (Remote)
โ€ข 1 Month ago
Luxoft - JavaScript Full Stack Engineer

Luxoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
โ€ข 3 Months ago
PENN Interactive - Staff Software Developer, Pricing Engine

PENN Interactive

Philadelphia, Pennsylvania, United States (Hybrid)
โ€ข 1 Month ago
Rubrik - Senior Software Engineer - Cloud Native Protection

Rubrik

Bengaluru, Karnataka, India (On-Site)
โ€ข 3 Months ago
Info Stretch - Senior Engineer

Info Stretch

Bengaluru, Karnataka, India (On-Site)
โ€ข 3 Months ago
Microsoft - ROP-Senior Software Engineer

Microsoft

Hyderabad, Telangana, India (On-Site)
โ€ข 1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Cadence - IT -Sr Staff Systems Engineer

Cadence

Noida, Uttar Pradesh, India (On-Site)
โ€ข 5 Months ago
N-iX - Architect/Lead QT Engineer (#2522)

N-iX

Ukraine (Remote)
โ€ข 2 Months ago
ABBYY - Software Engineer for Cloud Infrastructure & Kuberentes

ABBYY

India (Remote)
โ€ข 3 Months ago
Milestone - Automation QA Engineer

Milestone

Sofia, Sofia City Province, Bulgaria (Hybrid)
โ€ข 3 Weeks ago
prizepicks - Front End Engineer III (React/Typescript)

prizepicks

Atlanta, Georgia, United States (Remote)
โ€ข 1 Month ago
Patterned Learning Career - Senior Software Engineer, Infrastructure

Patterned Learning Career

(Remote)
โ€ข 1 Week ago
LSEG (London Stock Exchange Group) - DevOps Engineer

LSEG (London Stock Exchange Group)

Bengaluru, Karnataka, India (Hybrid)
โ€ข 4 Months ago
Gaming Innovation Group  - Infrastructure Engineer

Gaming Innovation Group

(Hybrid)
โ€ข 2 Months ago
ByteDance - Tech Lead - Infrastructure Platform

ByteDance

Singapore (On-Site)
โ€ข 2 Weeks ago
PhonePe - Site Ops Engineer

PhonePe

Bengaluru, Karnataka, India (On-Site)
โ€ข 3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Taipei City, Taiwan

NVIDIA - BaseOS Foundry Engineer (RDSS Intern)

NVIDIA

Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago
Appier - Senior Product Manager

Appier

Taipei City, Taiwan (On-Site)
โ€ข 3 Months ago
NVIDIA - Security System Software Engineer (RDSS Intern)

NVIDIA

Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago
USE Insider - Sales Development Representative - Taiwan

USE Insider

Taipei City, Taiwan (Hybrid)
โ€ข 4 Months ago
NVIDIA - Research Scientist, Circuits

NVIDIA

Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago
Google - Silicon Engineer, University Graduate, 2025

Google

New Taipei, New Taipei City, Taiwan (On-Site)
โ€ข 3 Months ago
Google - Software Engineer III, Embedded, Pixel Memory Management

Google

New Taipei, New Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago
Appier - HR Business Partner (HRBP)

Appier

Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago
NVIDIA - Manager, Design Verification

NVIDIA

Hsinchu, Hsinchu City, Taiwan (Hybrid)
โ€ข 3 Weeks ago
NVIDIA - System Software Engineer โ€“ Embedded Power Management (RDSS Intern)

NVIDIA

Taipei City, Taiwan (On-Site)
โ€ข 1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

SmileGate - [AI์„ผํ„ฐ] DevOps, ์ธํ”„๋ผ ์—”์ง€๋‹ˆ์–ด

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
โ€ข 1 Month ago
IO Interactive - Senior Build Engineer

IO Interactive

Copenhagen, Denmark (Hybrid)
โ€ข 12 Hours ago
PwC - IN-Senior Associate_Azure DevOps Architect_OneCloud_Advisory _Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
โ€ข 2 Months ago
Hashlist - Senior Data Engineer

Hashlist

Pune, Maharashtra, India (Hybrid)
โ€ข 3 Months ago
Inkittt - Senior Machine Learning Engineer, Recommendations

Inkittt

San Francisco, California, United States (Hybrid)
โ€ข 1 Month ago
Kaedim - DevOps Engineer

Kaedim

London, England, United Kingdom (On-Site)
โ€ข 5 Months ago
Rackspace Technology - Senior Site Reliability Engineer (GCP)

Rackspace Technology

United States (Remote)
โ€ข 2 Weeks ago
Microsoft - Principal Software Engineer

Microsoft

Bengaluru, Karnataka, India (On-Site)
โ€ข 1 Month ago
ZeniMax Media - Senior DevOps Programmer

ZeniMax Media

Rockville, Maryland, United States (On-Site)
โ€ข 5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The companyโ€™s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug