Jobs Courses Resources Companies Placements

Home >

Jobs >

Senior Software Engineer, AI Resiliency

NVIDIA

California, United States (On-site)

Senior Software Engineer, AI Resiliency

3 Months ago • 6 Years + • Game AI • $184,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Software Engineer for AI Resiliency to lead development of AI software resiliency for powerful AI supercomputers. Responsibilities include implementing and optimizing software features (checkpoint-recovery, error detection, isolation, straggler/hang detection) for 100,000+ GPUs, contributing to large-scale distributed systems with C++ and Python, working on AI system error handling, and collaborating with various teams. The role involves developing and implementing robust tests, supporting production deployments, and ensuring seamless operation of AI training and inference workloads. The ideal candidate possesses strong experience in distributed systems, parallel programming, and fault tolerance.

Must have:

Proficiency in C++ and Python
6+ years relevant experience
Understanding of distributed systems
AI framework familiarity (PyTorch, JAX/XLA)
Debugging and profiling tool experience

Good to have:

Experience with model training teams
CUDA, NCCL, or MPI experience
Checkpointing strategies knowledge
Large-scale AI cluster experience
Systems programming skills

Perks:

Equity
Benefits

8 skills required

8 skills required for this role

Add these skills to join the top 1% applicants for this job

tensorflow

ci-cd

nvidia-nsight

python

pytorch

cuda

scalability

problem-solving

Job Details

We are now looking for a Senior Software Engineer for AI Resiliency.

At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.

What You’ll Be Doing:

Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.
Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

What We Need to See:

You've achieved a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proficiency in C++ and Python, with experience in writing efficient, high-performance code.
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar.
Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment.

Ways to Stand Out From the Crowd:

Hands-on experience in training models or working with model training teams.
Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale.
Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training.
Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads.
Strong systems programming skills and experience with low-level performance tuning.

As part of the AI Resiliency team at NVIDIA, you’ll work alongside world-class engineers solving some of the hardest challenges in AI infrastructure. You’ll have the opportunity to contribute directly to making AI training and inference more reliable, scalable, and efficient. If you're passionate about AI, distributed systems, and high-performance computing, we want to hear from you!

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Postdoctoral Researcher, Embodied AI (PhD)

Senior Software Engineer, Machine Learning (Recommendations, Rankings, and Predictions)

Google

Mountain View, California, United States (On-Site)

• 2 Months ago

Sr. Software Engineer, Machine Learning

matchgroup

Palo Alto, California, United States (Hybrid)

• 1 Month ago

Machine Learning Engineer for Game Technology

PlayStation Global

London, England, United Kingdom (On-Site)

• 11 Months ago

Senior Computer Vision Engineer - Photo AI

Canva

Vienna, Vienna, Austria (Remote)

• 5 Months ago

Software Engineering Manager - Image and Data Compression Libraries

NVIDIA

(Hybrid)

• 4 Months ago

Solution Architect - Auto

NVIDIA

Shanghai, Shanghai, China (On-Site)

• 4 Months ago

Senior Staff Software Engineer, GPU Performance, Google Scale

Google

Kirkland, Washington, United States (On-Site)

• 2 Months ago

Researcher Graduate (Applied Machine Learning - Enterprise)

ByteDance

San Jose, California, United States (On-Site)

• 3 Months ago

Senior Software Engineer, Generative AI, Google Cloud AI

Google

Sunnyvale, California, United States (On-Site)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior System Software Engineer - Triton Inference Server

NVIDIA

California, United States (Remote)

• 5 Months ago

Research Engineer Internship - Audio Driven Facial Animation

FlawlessAi

Santa Monica, California, United States (Hybrid)

• 2 Months ago

Solution/Delivery Architect

Capgemini

Mumbai, Maharashtra, India (On-Site)

• 1 Month ago

Algorithm Engineer - Audio Understanding

ByteDance

Singapore (On-Site)

• 8 Months ago

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Santa Clara, California, United States (Hybrid)

• 5 Months ago

Site Reliability Engineer - Applied Machine Learning Engine (Singapore)

ByteDance

Singapore (On-Site)

• 8 Months ago

Software Engineer Graduate (Applied Machine Learning - Engine) - 2025 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)

• 8 Months ago

Data Scientist - Recommender S/m's

Arrise Solutions (India)

Hyderabad, Telangana, India (On-Site)

• 9 Months ago

Senior ML Data Scientist

Ubisoft

Montreal, Quebec, Canada (On-Site)

• 3 Months ago

Research Scientist in Machine Learning for Science (AML - AI-for-Science) - 2024 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)

• 8 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Product Manager, Red Team

Scale AI

San Francisco, California, United States (On-Site)

• 1 Month ago

Senior Researcher: Artificial General Intelligence (Natural Language Processing)

Tencent

Washington, United States (On-Site)

• 4 Months ago

Senior Hardware Engineer, Control Electronics, Quantum AI

Google

Goleta, California, United States (On-Site)

• 2 Months ago

Customer Business Executive

Cognite

Houston, Texas, United States (Hybrid)

• 6 Months ago

Senior Account Executive - Field Sales

Instawork

Atlanta, Georgia, United States (Hybrid)

• 2 Months ago

Product Marketing Manager

Kokotree

Wilmington, North Carolina, United States (On-Site)

• 7 Months ago

Head of Tax

Zinnia

Greenwich, Connecticut, United States (Hybrid)

• 2 Months ago

Senior Data Engineer, Global E-Commerce Governance Platform

ByteDance

San Jose, California, United States (On-Site)

• 4 Months ago

Configuration Support Specialist

Nintendo

Redmond, Washington, United States (Hybrid)

• 9 Months ago

Staff Electrical Engineer

The Walt Disney Company

Anaheim, California, United States (On-Site)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Game AI Jobs

Staff Research Engineer, Applied ML

Google

London, England, United Kingdom (On-Site)

• 2 Months ago

Senior AI-HPC Storage Engineer

NVIDIA

Westford, Massachusetts, United States (On-Site)

• 4 Months ago

Software Engineer (Applied Machine Learning - Enterprise)

ByteDance

San Jose, California, United States (On-Site)

• 2 Months ago

Staff C++ Engineer

Inworld AI

Mountain View, California, United States (On-Site)

• 3 Months ago

Software Engineer III, AI/ML, Core

Google

Sunnyvale, California, United States (On-Site)

• 2 Months ago

IaaS Product Solution Architect

Tencent

(On-Site)

• 2 Months ago

Member of Technical Staff, AI Platform Engineer

Microsoft

Mountain View, California, United States (Hybrid)

• 2 Months ago

AI Trainer (Contractor) - Writing & Gaming

Inworld AI

Mountain View, California, United States (Remote)

• 3 Months ago

Researcher - Large Language Models, Applied Machine Learning

ByteDance

Seattle, Washington, United States (On-Site)

• 3 Months ago

Senior AI Engineer, Italy

ION

Pisa, Tuscany, Italy (On-Site)

• 8 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

385 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior Software Engineer, AI Resiliency

Job Summary

Job Description

8 skills required

8 skills required for this role

Job Details

Similar Jobs

Postdoctoral Researcher, Embodied AI (PhD)

Senior Software Engineer, Machine Learning (Recommendations, Rankings, and Predictions)

Sr. Software Engineer, Machine Learning

Machine Learning Engineer for Game Technology

Senior Computer Vision Engineer - Photo AI

Software Engineering Manager - Image and Data Compression Libraries

Solution Architect - Auto

Senior Staff Software Engineer, GPU Performance, Google Scale

Researcher Graduate (Applied Machine Learning - Enterprise)

Senior Software Engineer, Generative AI, Google Cloud AI

Similar Skill Jobs

Senior System Software Engineer - Triton Inference Server

Research Engineer Internship - Audio Driven Facial Animation

Solution/Delivery Architect

Algorithm Engineer - Audio Understanding

Senior Site Reliability Engineer - AI Research Clusters

Site Reliability Engineer - Applied Machine Learning Engine (Singapore)

Software Engineer Graduate (Applied Machine Learning - Engine) - 2025 Start (BS/MS)

Data Scientist - Recommender S/m's

Senior ML Data Scientist

Research Scientist in Machine Learning for Science (AML - AI-for-Science) - 2024 Start (PhD)

Jobs in Santa Clara, California, United States

Product Manager, Red Team

Senior Researcher: Artificial General Intelligence (Natural Language Processing)

Senior Hardware Engineer, Control Electronics, Quantum AI

Customer Business Executive

Senior Account Executive - Field Sales

Product Marketing Manager

Head of Tax

Senior Data Engineer, Global E-Commerce Governance Platform

Configuration Support Specialist

Staff Electrical Engineer

Game AI Jobs

Staff Research Engineer, Applied ML

Senior AI-HPC Storage Engineer

Software Engineer (Applied Machine Learning - Enterprise)

Staff C++ Engineer

Software Engineer III, AI/ML, Core

IaaS Product Solution Architect

Member of Technical Staff, AI Platform Engineer

AI Trainer (Contractor) - Writing & Gaming

Researcher - Large Language Models, Applied Machine Learning

Senior AI Engineer, Italy

About The Company

Solutions Architect, Generative AI

VLSI Physical Design Engineer - New College Grad 2025

Senior Software Engineer, ASIC Verification Tools

Senior ASIC Full Chip Verification Engineer

Principal Engineer - Enterprise Applications

Senior Business System Architect, AI and ML

Senior Product Security Engineer

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Level Up Your Career in Game Development!