Senior RAS Architect - Datacenter CPU and SOC

1 Month ago • 12-8 Years • Research & Development • $224,000 PA - $425,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior RAS Architect to drive key RAS/resilience features for next-generation data center CPUs and SOCs. Responsibilities include defining memory RAS requirements, owning system RAS models and risk assessments, leading data analysis of RAS logs to refine testing, and collaborating with cross-functional teams throughout the product lifecycle. The ideal candidate will have 12+ years of hands-on experience in designing, testing, and benchmarking system RAS/resilience features in large compute or AI systems, expertise in HPC/AI architecture and cluster interconnect technologies, and strong problem-solving skills. The role involves architecting manufacturing RAS tests and methodologies and fostering a deep understanding of NVIDIA's AI hardware and software architecture.
Must have:
  • 12+ years experience in system RAS/Resiliency
  • Proficient in Compute System RAS/Resilience model
  • Proficient in HPC or AI system architecture
  • Strong problem-solving and troubleshooting expertise
Good to have:
  • HPC or MLPerf benchmarking experience
Perks:
  • Equity
  • Benefits

Job Details

For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one Architect to drive key aspects of RAS/Resilience features for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of CPU, DRAM, and AI cluster systems.

What you’ll be doing:

  • Drive Memory RAS requirements for data-center CPU.

  • Own the system RAS/Resilience models, Benchmarking and Risk assessment.

  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

  • Proficient in Compute System RAS/Resilience model theory and methodology.

  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

The base salary range is 224,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Genies - Machine Learning Engineer, Character Animation & Motion AI

Genies

Los Angeles, California, United States (On-Site)
1 Month ago
ByteDance - Senior Site Reliability Engineer, AI Applications

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
NVIDIA - Senior Deep Learning Research Engineer, Advanced AI Systems

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
NVIDIA - Senior Signal Integrity Design Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
MiQ - Manager Data Science

MiQ

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
NVIDIA - Senior Deep Learning Software Engineer, cuDNN

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
ByteDance - SoC System Software Architect

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
NVIDIA - ASIC Design Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
2 Months ago
NVIDIA - Staff Systems Software Engineer - Server

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago
Microsoft - Research Intern - Audio and Acoustics

Microsoft

Redmond, Washington, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Software Engineer Intern (Doubao (Seed) - Machine Learning System) - 2025 Summer (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
NVIDIA - Senior HPC AI Cluster Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
2 Months ago
NVIDIA - Solution Architect - Generative AI, Automotive

NVIDIA

Tokyo, Japan (On-Site)
2 Months ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Hyderabad, Telangana, India (Hybrid)
2 Months ago
NVIDIA - Senior Verification Engineer - Memory Subsystem

NVIDIA

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Granicus - Data Scientist 4

Granicus

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
NVIDIA - Machine Learning Intern - 2025

NVIDIA

(On-Site)
2 Months ago
NVIDIA - Senior ASIC Physical Design Engineer, Netlisting

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
NVIDIA - Senior System Software Engineer, Base OS Kernel

NVIDIA

Santa Clara, California, United States (Remote)
2 Months ago
NVIDIA - Software Engineering Manager - GPU Communications Libraries

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

VX Media - Showroom Coordinator Intern *UNPAID*

VX Media

New York, New York, United States (On-Site)
5 Months ago
AI Fund - Founder-in-Residence

AI Fund

United States (Remote)
5 Months ago
Next Level Business Services - Salesforce Tech Lead / Tech Architect

Next Level Business Services

Hoboken, New Jersey, United States (On-Site)
5 Months ago
Epic Games - Senior Character Artist

Epic Games

Cary, North Carolina, United States (On-Site)
3 Months ago
Company3 Method Studios - Director of Payroll

Company3 Method Studios

Atlanta, Georgia, United States (Remote)
1 Month ago
Lionsgate Games - Creative Development Intern

Lionsgate Games

Santa Monica, California, United States (On-Site)
2 Months ago
Evolution - Studio Game Presenter (Customer Service Alternative)

Evolution

Fairfield, Connecticut, United States (On-Site)
9 Months ago
Pixar Animation Studios - Sr. Network Engineer

Pixar Animation Studios

Emeryville, California, United States (On-Site)
1 Month ago
Sphere Entertainment Co - Manager, Lighting Systems

Sphere Entertainment Co

Las Vegas, Nevada, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Riot Games - Art Outsourcing II (Illustration) - VALORANT

Riot Games

Dublin, County Dublin, Ireland (On-Site)
4 Months ago
ByteDance - Software Engineer Intern (Machine Learning Platform) - 2024 Summer (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
ByteDance - Tech Lead, Camera Algorithms Engineer

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
ByteDance - LLM Software Engineer/Researcher (Applied Machine Learning)- 2024 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Tesla - Constructor

Tesla

Prüm, Rhineland-Palatinate, Germany (On-Site)
2 Months ago
NVIDIA - System Software Engineering Intern, GPU

NVIDIA

Taipei City, Taiwan (On-Site)
2 Months ago
Niantic - 2025 R&D Software Engineering Intern (Masters Degree or PhD)

Niantic

London, England, United Kingdom (Hybrid)
4 Months ago
Zuru - Scientific Python Developer

Zuru

Modena, Emilia-Romagna, Italy (Hybrid)
5 Months ago
Krafton  - Client Programmer

Krafton

(On-Site)
2 Months ago
Fabric - Applied Researcher, Cryptography Hardware

Fabric

British Columbia, Canada (Remote)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Texas, United States (Remote)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

United Kingdom (Remote)

Yokne'am Illit, North District, Israel (On-Site)

Bengaluru, Karnataka, India (Hybrid)

Toronto, Ontario, Canada (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug