Senior Software Engineer, AI Resiliency

1 Month ago • 6 Years + • Artificial Intelligence • Game AI • $184,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Software Engineer for AI Resiliency to lead development of AI software resiliency for powerful AI supercomputers. Responsibilities include implementing and optimizing software features (checkpoint-recovery, error detection, isolation, straggler/hang detection) for 100,000+ GPUs, contributing to large-scale distributed systems with C++ and Python, working on AI system error handling, and collaborating with various teams. The role involves developing and implementing robust tests, supporting production deployments, and ensuring seamless operation of AI training and inference workloads. The ideal candidate possesses strong experience in distributed systems, parallel programming, and fault tolerance.
Must have:
  • Proficiency in C++ and Python
  • 6+ years relevant experience
  • Understanding of distributed systems
  • AI framework familiarity (PyTorch, JAX/XLA)
  • Debugging and profiling tool experience
Good to have:
  • Experience with model training teams
  • CUDA, NCCL, or MPI experience
  • Checkpointing strategies knowledge
  • Large-scale AI cluster experience
  • Systems programming skills
Perks:
  • Equity
  • Benefits

Job Details

We are now looking for a Senior Software Engineer for AI Resiliency.

At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.

What You’ll Be Doing:

  • Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.

  • Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.

  • Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.

  • Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.

  • Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.

  • Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

What We Need to See:

  • You've achieved a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.

  • Proficiency in C++ and Python, with experience in writing efficient, high-performance code.

  • 6+ years of relevant experience

  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.

  • Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.

  • Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).

  • Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Ways to Stand Out From the Crowd:

  • Hands-on experience in training models or working with model training teams.

  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale.

  • Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.

  • Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.

  • Strong systems programming skills and experience with low-level performance tuning.

As part of the AI Resiliency team at NVIDIA, you’ll work alongside world-class engineers solving some of the hardest challenges in AI infrastructure. You’ll have the opportunity to contribute directly to making AI training and inference more reliable, scalable, and efficient. If you're passionate about AI, distributed systems, and high-performance computing, we want to hear from you!

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

PwC - IN-Senior Associate_ML Engineer_Data and Analytics_Advisory_Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Jane Street - Machine Learning Engineer

Jane Street

Hong Kong, Hong Kong (On-Site)
5 Hours ago
Google - Staff Software Engineer, Machine Learning

Google

Los Angeles, California, United States (On-Site)
1 Week ago
Google - Software Engineer, Computer Vision and Machine Learning

Google

New Taipei, New Taipei City, Taiwan (On-Site)
2 Weeks ago
Flip Fit - Senior Machine Learning Engineer

Flip Fit

(Remote)
1 Month ago
ByteDance - Research Scientist Graduate (Foundation Model - Vision and Language)

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
ByteDance - Research Scientist - AI Security

ByteDance

San Jose, California, United States (On-Site)
2 Weeks ago
Meta - AI Research Scientist - Generative AI Red Teaming (London or Paris)

Meta

London, England, United Kingdom (On-Site)
5 Months ago
Google - Senior Software Engineer, Applied AI

Google

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
2 Weeks ago
NVIDIA - Senior Software Engineer - Triton Tools

NVIDIA

California, United States (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Hashlist - Data Scientist

Hashlist

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Google - Customer Engineer, Cloud AI, Google Cloud

Google

Seattle, Washington, United States (On-Site)
2 Days ago
Ubisoft - Senior ML Data Scientist

Ubisoft

Montreal, Quebec, Canada (On-Site)
3 Months ago
Xsolla - Machine Learning Engineer

Xsolla

Montreal, Quebec, Canada (Remote)
1 Week ago
Attentive - Senior Software Engineer, Search Optimization

Attentive

(Remote)
2 Months ago
Scale AI - Machine Learning Engineer, Enterprise GenAI

Scale AI

San Francisco, California, United States (On-Site)
1 Day ago
Inworld AI - Staff / Principal Machine Learning Engineer

Inworld AI

Mountain View, California, United States (On-Site)
7 Hours ago
Scale AI - Director, Agent Research

Scale AI

San Francisco, California, United States (On-Site)
1 Day ago
ByteDance - Software Engineer, Model Inference

ByteDance

Seattle, Washington, United States (On-Site)
2 Months ago
AppZen - Python Developer Lead/Manager

AppZen

Pune, Maharashtra, India (On-Site)
3 Days ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Ello - Product Engineer (ML & Mobile)

Ello

San Francisco, California, United States (On-Site)
1 Month ago
ByteDance - Senior Software Quality Assurance Engineer

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Anavation - Systems Administrator (SME)

Anavation

Clarksburg, West Virginia, United States (Remote)
4 Weeks ago
Google - Senior Digital Strategy Lead, Accelerated Growth Team

Google

Chicago, Illinois, United States (On-Site)
2 Weeks ago
NBC universal - Sr. Director, Talent Acquisition

NBC universal

New York, New York, United States (Hybrid)
1 Week ago
Trek - Seasonal Sales Associate (Part Time)

Trek

Issaquah, Washington, United States (On-Site)
2 Months ago
Axon - Senior Machine Learning Scientist II

Axon

Seattle, Washington, United States (Hybrid)
6 Hours ago
Cadence - SVG Software Intern (Summer 2025)

Cadence

San Jose, California, United States (On-Site)
4 Hours ago
Scale AI - Senior Software Engineer, GenAI Allocation

Scale AI

San Francisco, California, United States (Hybrid)
1 Day ago
Highspot - Sr. Engineering Manager, Meeting Intelligence (Backend)

Highspot

Seattle, Washington, United States (Hybrid)
5 Hours ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Google - Senior Staff Software Engineer, GPU Performance, Google Scale

Google

Sunnyvale, California, United States (On-Site)
2 Weeks ago
Keywords Studios - Research Associate - Fresher

Keywords Studios

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
The Walt Disney Company - Senior Machine Learning Engineer - Ad Platforms

The Walt Disney Company

San Francisco, California, United States (On-Site)
2 Months ago
Google - Customer Engineer, Applied and Generative AI, Google Cloud

Google

Singapore, Singapore (On-Site)
1 Week ago
Google - Customer Solutions Engineer

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Weeks ago
Microsoft - Senior Researcher

Microsoft

Singapore (On-Site)
2 Weeks ago
Microsoft - Engineering Manager

Microsoft

Mountain View, California, United States (Hybrid)
1 Month ago
Airlab Inc  - Junior Programmer Artificial Intelligence

Airlab Inc

Quebec, Canada (On-Site)
1 Month ago
PwC - Associate

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Resemble AI - Deep Learning Speech Researcher

Resemble AI

Mountain View, California, United States (On-Site)
8 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug