Senior RAS Architect - Datacenter CPU and SOC

1 Day ago • 12-8 Years • Research & Development • $224,000 PA - $425,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior RAS Architect to drive key RAS/resilience features for next-generation data center CPUs and SOCs. Responsibilities include defining memory RAS requirements, owning system RAS models and risk assessments, leading data analysis of RAS logs to refine testing, and collaborating with cross-functional teams throughout the product lifecycle. The ideal candidate will have 12+ years of hands-on experience in designing, testing, and benchmarking system RAS/resilience features in large compute or AI systems, expertise in HPC/AI architecture and cluster interconnect technologies, and strong problem-solving skills. The role involves architecting manufacturing RAS tests and methodologies and fostering a deep understanding of NVIDIA's AI hardware and software architecture.
Must have:
  • 12+ years experience in system RAS/Resiliency
  • Proficient in Compute System RAS/Resilience model
  • Proficient in HPC or AI system architecture
  • Strong problem-solving and troubleshooting expertise
Good to have:
  • HPC or MLPerf benchmarking experience
Perks:
  • Equity
  • Benefits

Job Details

For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one Architect to drive key aspects of RAS/Resilience features for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of CPU, DRAM, and AI cluster systems.

What you’ll be doing:

  • Drive Memory RAS requirements for data-center CPU.

  • Own the system RAS/Resilience models, Benchmarking and Risk assessment.

  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

  • Proficient in Compute System RAS/Resilience model theory and methodology.

  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

The base salary range is 224,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

N-iX - Senior ML Data Engineer

N-iX

Ukraine (Remote)
1 Week ago
Coursera - Machine Learning Scientist

Coursera

India (Remote)
2 Months ago
NVIDIA - Senior Software QA Test Development Engineer

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Weeks ago
Ubisoft - Senior ML Ops - Content Creation Technology Group

Ubisoft

Montreal, Quebec, Canada (On-Site)
1 Month ago
NVIDIA - Senior Performance Software Engineer

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago
ByteDance - Senior Machine Learning Ops Engineer, ML System - Foundation Model

ByteDance

Seattle, Washington, United States (On-Site)
2 Weeks ago
Tencent - NLP Research Intern

Tencent

(On-Site)
1 Month ago
Assystems - Ingénieur d'Etudes Electricité H/F

Assystems

Lyon, Auvergne-Rhône-Alpes, France (On-Site)
3 Months ago
Nielsen Holdings - Lead Software Engineer - Python

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Meta - Software Engineer, Systems ML - SW/HW Co-design

Meta

Seattle, Washington, United States (Remote)
3 Months ago
NVIDIA - ASIC Engineer - PCIe

NVIDIA

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
Hedra - Research Scientist

Hedra

New York, New York, United States (On-Site)
5 Months ago
N-iX - Senior ML Data Engineer

N-iX

Ukraine (Remote)
1 Week ago
Meta - Research Scientist Intern, Language and Multimodal Research for GenAI (PhD)

Meta

Seattle, Washington, United States (On-Site)
3 Months ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
NVIDIA - NVIDIA 2025 Internships: MBA Product Marketing Management

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
ByteDance - Research Scientist, Foundation Model, Speech Understanding

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model - Video Generation) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
ByteDance - Research Scientist in Foundation Model, Music Core Machine Learning Graduates - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Trek - Sales Associate - Part Time

Trek

Palm Desert, California, United States (On-Site)
3 Months ago
Trek - Assistant Store Manager

Trek

Burlington, Massachusetts, United States (On-Site)
2 Months ago
prizepicks - Senior Front End Engineer (React/Typescript)

prizepicks

Atlanta, Georgia, United States (Remote)
1 Month ago
Riot Games - Game Designer III - League of Legends, Summoner's Rift Team, Seasonal Pod

Riot Games

Los Angeles, California, United States (On-Site)
1 Month ago
Riot Games - Principal Software Engineer (Services) - Teamfight Tactics, Core Tech

Riot Games

Los Angeles, California, United States (On-Site)
3 Months ago
Globalization Partners - Principal Social Media Influencer Manager

Globalization Partners

United States (Remote)
1 Week ago
Alpha Sense - Enterprise Account Executive, Corporate

Alpha Sense

United States (Remote)
3 Months ago
InMobiInMobi - Sales Manager

InMobiInMobi

New York, New York, United States (On-Site)
3 Months ago
ByteDance - Machine Learning Engineer Intern (Global E-commerce Risk Control) - 2025 Summer (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Meta - Software Engineer - Datacenter networking

Meta

Bellevue, Washington, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Assystems - Ingénieur MES / AVEVA H/F

Assystems

Carquefou, Pays De La Loire, France (On-Site)
3 Months ago
Cirrus Logic - Summer Intern, Validation Engineer

Cirrus Logic

Austin, Texas, United States (On-Site)
4 Months ago
Easygo - Software Development Engineer, Payments & Fraud

Easygo

Melbourne, Victoria, Australia (On-Site)
2 Months ago
Ceragon Networks - Verification Team Lead

Ceragon Networks

Karnataka, India (On-Site)
3 Months ago
Google - Software Engineering Manager (For Women in Tech Candidates)

Google

State Of Minas Gerais, Brazil (On-Site)
1 Month ago
Meta - Software Engineer, Machine Learning

Meta

New York, New York, United States (On-Site)
3 Months ago
Rivos - CPU Physical Design - Full time

Rivos

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Krafton  - Deep Learning Research Scientist - Core

Krafton

Seoul, South Korea (On-Site)
1 Week ago
Ubisoft - Architecte de Stockage

Ubisoft

Montreal, Quebec, Canada (On-Site)
2 Months ago
JMA - Senior Engineer - Firmware

JMA

Bologna, Emilia-Romagna, Italy (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug