Outscal Logooutscal logo

Senior Software Engineer, AI Resiliency

2 Weeks ago • 6 Years + • Artificial Intelligence • Game AI • $184,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Software Engineer for AI Resiliency to lead development of AI software resiliency for its powerful AI supercomputers (100,000+ GPUs). Responsibilities include implementing and optimizing software features for enhanced reliability (checkpoint-recovery, error detection/isolation, straggler/hang detection); contributing to large-scale distributed systems using C++ and Python; working on AI system error handling and silent data corruption (SDC); collaborating with engineers and researchers; developing and implementing robust tests; and supporting production deployments in cloud and HPC environments. This role demands expertise in distributed systems, parallel programming, fault tolerance, and AI frameworks like PyTorch and JAX/XLA.
Must have:
  • Proficiency in C++ and Python
  • 6+ years of relevant experience
  • Understanding of distributed systems
  • Familiarity with AI frameworks (PyTorch, JAX/XLA)
  • Experience with debugging and profiling tools
Good to have:
  • Experience training models
  • Experience with CUDA, NCCL, or MPI
  • Knowledge of checkpointing strategies
  • Experience with large-scale AI clusters
  • Strong systems programming skills
Perks:
  • Equity
  • Benefits

Job Details

We are now looking for a Senior Software Engineer for AI Resiliency.

At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.

What You’ll Be Doing:

  • Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.

  • Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.

  • Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.

  • Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.

  • Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.

  • Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

What We Need to See:

  • You've achieved a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.

  • Proficiency in C++ and Python, with experience in writing efficient, high-performance code.

  • 6+ years of relevant experience

  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.

  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.

  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).

  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Ways to Stand Out From the Crowd:

  • Hands-on experience in training models or working with model training teams.

  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale.

  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.

  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.

  • Strong systems programming skills and experience with low-level performance tuning.

As part of the AI Resiliency team at NVIDIA, you’ll work alongside world-class engineers solving some of the hardest challenges in AI infrastructure. You’ll have the opportunity to contribute directly to making AI training and inference more reliable, scalable, and efficient. If you're passionate about AI, distributed systems, and high-performance computing, we want to hear from you!

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Pittsburgh, Pennsylvania, United States (On-Site)
4 Months ago
Canva - Senior Computer Vision Engineer - Photo AI

Canva

Vienna, Vienna, Austria (Remote)
1 Month ago
ByteDance - DevOps Engineer - Applied Machine Learning, Engine

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
PAPAYA - Data Scientist

PAPAYA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Samsung Semiconductor - Principal Engineer, NPU Architect

Samsung Semiconductor

San Jose, California, United States (Hybrid)
2 Months ago
Keywords Studios (Player Support) - AI - Technical Research Associate (Prompts)

Keywords Studios (Player Support)

Silesian Voivodeship, Poland (On-Site)
1 Week ago
Granicus - Data Scientist 4

Granicus

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Meta - AI Research Scientist, Language - Generative AI

Meta

Seattle, Washington, United States (On-Site)
4 Months ago
NVIDIA - Senior Software Engineer - Automated Parallel Programming

NVIDIA

Santa Clara, California, United States (Remote)
2 Months ago
Keywords Studios (Player Support) - Research Associate - AI

Keywords Studios (Player Support)

(Remote)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

NVIDIA - PhD Research Intern, Digital Human AI Research

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
PwC - ETIC, Business Intelligence Engineer, Manager

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
5 Months ago
PwC - Associate

PwC

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Zazz - Artificial Intelligence Engineer

Zazz

(Remote)
1 Month ago
Netflix - Product Manager, ML Platform: Training

Netflix

Los Gatos, California, United States (Hybrid)
5 Months ago
Meta - Software Engineer, Machine Learning

Meta

Redmond, Washington, United States (On-Site)
4 Months ago
Visa - Senior Manager Data Science - Visa Consulting & Analytics

Visa

Mumbai, Maharashtra, India (On-Site)
5 Months ago
NVIDIA - Senior AI-HPC Cluster Engineer

NVIDIA

Santa Clara, California, United States (Hybrid)
8 Hours ago
Ubisoft - Senior ML Data Scientist

Ubisoft

Montreal, Quebec, Canada (On-Site)
2 Months ago
Hedra - Machine Learning Engineer (CUDA)

Hedra

San Francisco, California, United States (On-Site)
12 Hours ago

Get notifed when new similar jobs are uploaded

Jobs in Redmond, Washington, United States

Snail Games - SEC Financial Reporting

Snail Games

Culver City, California, United States (On-Site)
4 Weeks ago
Scope AR - Account Executive - Full Cycle

Scope AR

United States (Remote)
4 Months ago
PlayStation Global - Creator Platform GTM & Communications Planning Manager

PlayStation Global

United States (Hybrid)
1 Month ago
ByteDance - Research Scientist in Foundation Models for Science - ByteDance Research

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
ByteDance - AR Optics Architect - Pico- San Jose

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
The Walt Disney Company - Poolside Server Assistant

The Walt Disney Company

Kapolei, Hawaii, United States (On-Site)
1 Day ago
Tencent - Workday Business Analyst -  HCM

Tencent

Palo Alto, California, United States (On-Site)
4 Months ago
Meta - ASIC Engineer, Design

Meta

Sunnyvale, California, United States (On-Site)
4 Months ago
Riot Games - Senior Technical Artist (Rendering) - VALORANT, UI/UX

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
ByteDance - Tech Lead, Research Scientist, Cloud & AI Computing - DPU/GPU/CPU

ByteDance

Seattle, Washington, United States (On-Site)
21 Hours ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

ByteDance - Software Engineer Intern (Doubao (Seed) - Machine Learning System) - 2025 Summer (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
4 Months ago
PwC - AWS AI Architect

PwC

Toronto, Ontario, Canada (On-Site)
5 Months ago
Hyper Verge - Machine Learning Engineer II

Hyper Verge

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Virtuos - R&D Machine Learning Engineer

Virtuos

China (On-Site)
1 Day ago
Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Pittsburgh, Pennsylvania, United States (On-Site)
4 Months ago
Cricketpedia - AI Engineer

Cricketpedia

Gurugram, Haryana, India (Remote)
2 Years ago
Krafton  - [Global Strategy & BD Div.] Strategy Manager(AI Ethics) (4년 ~ 7년)

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Equivalent Jobs - MLOPS ENGINEER

Equivalent Jobs

(Remote)
4 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model AI Platform) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
4 Months ago
NVIDIA - Solutions Architect - Generative AI

NVIDIA

Seoul, South Korea (Hybrid)
8 Hours ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Hsinchu, Hsinchu City, Taiwan (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Seoul, South Korea (Hybrid)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Shanghai, Shanghai, China (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Be'er Sheva, South District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug