Senior SRE Software Engineer, Storage and Data

3 Months ago • 5 Years + • DevOps

Job Summary

Job Description

As a Senior SRE Software Engineer at NVIDIA, you'll be responsible for ensuring the reliability, availability, and performance of storage infrastructures for the DGX Cloud platform. This involves developing strategies for redundancy and disaster recovery, continuously analyzing and optimizing storage systems, developing automation scripts, implementing monitoring and alerting systems, and participating in on-call rotations. You'll collaborate with cross-functional teams, troubleshoot issues, conduct root cause analysis, and work with AI/ML workloads. The role requires expertise in storage systems, SRE principles, and automation, along with experience with various tools and technologies like Ansible, Python, AWS S3, and monitoring stacks.
Must have:
  • 5+ years experience
  • Storage system administration
  • SRE experience
  • Automation scripting
  • Linux system administration
  • Problem-solving skills
  • Collaboration skills
Good to have:
  • Experience with OpenStack Swift, AWS S3, DDN, Lustre
  • Strong Linux and network troubleshooting skills
  • Experience with Kubernetes, OpenStack, Docker
  • Experience with Ansible, Chef, Puppet, Terraform
Perks:
  • Competitive salary
  • Generous benefits package

Job Details

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.

  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.

  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.

  • Implement monitoring and alerting systems to proactively identify and address issues.

  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.

  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.

  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.

What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.

  • Proven experience in storage system administration and site reliability engineering.

  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.

  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java

  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.

  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana. 

Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.

  • Strong Linux and network troubleshooting skills by running various commands and tools.

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..

  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the most desirable employers in the world. We have some of the most brilliant and talented people in the world working for us. If you are creative, autonomous and love a challenge, we want to hear from you. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

Daybreak Game Company LLC - Software Development Engineer (Cardset)

Daybreak Game Company LLC

Renton, Washington, United States (Remote)
5 Months ago
DNEG - Pipeline Supervisor

DNEG

Mumbai, Maharashtra, India (On-Site)
6 Months ago
Google - Senior Software Engineer, Embedded Systems/Firmware, Google Cloud

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
6 Days ago
Google - Software Engineer, PhD

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
ION - Z/OS Storage  System Programmer, Italy

ION

Italy (Hybrid)
6 Months ago
Sporty Group - Site Reliability Engineer

Sporty Group

(Remote)
4 Weeks ago
Milestone - Senior Data and Software Engineer

Milestone

Copenhagen, Denmark (On-Site)
1 Week ago
Zazz - Data Engineer

Zazz

(Remote)
3 Months ago
Google - Cloud Platform Manager, Professional Services

Google

Mexico City, Mexico City, Mexico (On-Site)
1 Week ago
Rackspace Technology - Sr Big Data Engineer - Oozie and Pig (GCP)

Rackspace Technology

United States (Remote)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Meta - Software Engineer - Datacenter networking

Meta

Bellevue, Washington, United States (On-Site)
5 Months ago
Microsoft - Technical Support Engineer

Microsoft

Bengaluru, Karnataka, India (Hybrid)
1 Week ago
ByteDance - Network Software Development Engineer, High Speed Network

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
NVIDIA - Senior Datacenter Product Development Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
1 Week ago
Google - Senior Mixed Signal Silicon CAD Engineer

Google

Mountain View, California, United States (On-Site)
6 Days ago
PearlAbyss - Junior System Engineer

PearlAbyss

(On-Site)
3 Months ago
ByteDance - Site Reliability Engineer - Machine Learning Systems - Singapore

ByteDance

Singapore (On-Site)
5 Months ago
Microsoft - Technical Support Engineer - Azure Monitoring

Microsoft

Taipei City, Taiwan (Hybrid)
1 Week ago
NVIDIA - Senior Software Engineer - Ethernet Switch

NVIDIA

Ra'anana, Center District, Israel (Hybrid)
3 Months ago
Netflix - Broadcast Engineer, Live Broadcast Technology

Netflix

United States (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Taipei City, Taiwan

NVIDIA - Research Scientist, Deep Learning and Computer Vision

NVIDIA

Taipei City, Taiwan (On-Site)
3 Months ago
Corsair - Mechanical Engineer

Corsair

Taipei City, Taiwan (On-Site)
1 Month ago
Rivos - Silicon SOC Verification - Full-time

Rivos

Hsinchu, Hsinchu City, Taiwan (Hybrid)
6 Months ago
Google - Hardware Engineer, Camera Actuator Control System

Google

New Taipei, New Taipei City, Taiwan (On-Site)
6 Days ago
NVIDIA - Senior Mixed Signal Circuit Design Engineer

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago
Google - Senior Software Engineer, Security and Privacy, Pixel Phone

Google

New Taipei, New Taipei City, Taiwan (On-Site)
6 Days ago
Trend Micro - Large Language Models (LLM) Expert (VicOne_Automotive Security)

Trend Micro

Taipei City, Taiwan (On-Site)
6 Months ago
NVIDIA - Senior Embedded System Software Engineer

NVIDIA

Taipei City, Taiwan (On-Site)
2 Weeks ago
Google - ASIC Power Management Architect, Silicon

Google

New Taipei, New Taipei City, Taiwan (On-Site)
1 Week ago
Google - Firmware Engineer, Pixel System Software

Google

New Taipei, New Taipei City, Taiwan (On-Site)
4 Days ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Zones - Cloud Engineer

Zones

Mumbai, Maharashtra, India (On-Site)
4 Months ago
DraftKings - Senior Site Reliability Engineer - FinOps

DraftKings

Canada (Remote)
9 Hours ago
Google - Site Reliability Engineer, F1 SRE

Google

Sydney, New South Wales, Australia (On-Site)
1 Week ago
White Hat Gaming  - Site Reliability Engineer (SRE)

White Hat Gaming

(Remote)
4 Weeks ago
Next Level Business Services - IIB, DP, ODM Admin

Next Level Business Services

Burbank, California, United States (On-Site)
6 Months ago
Velotio Technologies - Senior DevOps Engineer (GCP)

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago
Gunzilla - DevOps/Build Engineer

Gunzilla

Kyiv, Kyiv City, Ukraine (On-Site)
1 Month ago
Info Stretch - Lead Data Engineer

Info Stretch

Hyderabad, Telangana, India (On-Site)
5 Months ago
Axon - Senior Privacy Engineer

Axon

Scottsdale, Arizona, United States (Hybrid)
4 Months ago
RoofStack - Head of Software Development

RoofStack

İstanbul, İstanbul, Türkiye (On-Site)
4 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug