Senior System Software Engineer, Distributed Systems - DGX Cloud

3 Months ago • 6 Years + • DevOps • $148,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior System Software Engineer specializing in distributed systems for its DGX Cloud platform. The role involves designing, developing, and optimizing solutions for datacenter firmware, collaborating with hardware and software teams, ensuring seamless integration across the system. Responsibilities include automating GPU asset provisioning, configuration, and lifecycle management across cloud providers, defining reliability and availability requirements, and driving failure analysis. The ideal candidate possesses strong programming skills in Python and Linux, system-level expertise, familiarity with industry standards (SPI, I2C, PCIe, UEFI, PLDM), and experience with distributed systems. This is a full-time position with remote options.
Must have:
  • 6+ years experience with Python & Linux
  • Distributed systems understanding
  • System programming (Go/Python)
  • Familiarity with SPI, I2C, PCIe, UEFI, PLDM
  • Data structures & algorithms expertise
Good to have:
  • Machine check architecture knowledge
  • Linux server design, x86/ARM architecture
  • Experience with large-scale distributed systems
  • Cloud AI infrastructure operational excellence
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, a deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities. We also welcome out-of-the-box thinkers who can provide new ideas with strong at execution bias. Expect to be constantly challenged, improving, and evolving for the better. You and other engineers in this team will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications that affect core data science. What are you waiting for if you're creative, passionate about what you do, and love having fun apply today!

What you’ll be doing:

  • We are designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.

  • Design, develop, test, debug, and optimize creative solutions for Datacenter firmware throughout lifecycle.

  • Work closely with hardware, software, infrastructure, and business teams to transform new firmware features from idea to reality.

  • Define server-level reliability, availability, and serviceability requirements in collaboration with various customers like CSPs and deliver fault resilient solution at scale as per customer expectations.

  • Collaborate with hardware, software and firmware teams to drive failure analysis and large scale solution deployment.

  • Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications.

What we need to see:

  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience) with 6+ years of experience active development using Python as primary programming language using Linux as OS.

  • Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles and architects and coordinate effectively across organizational boundaries and geographies.

  • Familiarity with industry standards and specifications such as SPI, I2C, PCIe, UEFI and PLDM.

  • System knowledge - how platform management works - areas like BMC-BIOS communication, thermal management, power management, firmware update, device monitoring, firmware security, etc.

  • Expert level knowledge of a systems programming language (Go, Python) and a solid understanding of Data Structure and Algorithms.

  • Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture, data synchronization, fault tolerance and state management.

Ways to stand out from the crowd:

  • Background with In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software.

  • Familiar with Linux server design, x86/ARM system architecture, interconnects like PCI, and other I/O buses.

  • Proven operational excellence in designing and maintaining cloud AI infrastructure. Proficiency in architecting and running large-scale distributed systems, independent of cloud providers.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.

The base salary range is 148,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Epic Games - Rendering Programmer

Epic Games

(On-Site)
3 Months ago
Microsoft - Data Science: Internship Opportunities - Redmond

Microsoft

Redmond, Washington, United States (On-Site)
4 Months ago
ByteDance - Software Engineer in ML Systems Graduate (AML - Machine Learning Systems) - 2024 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
NVIDIA - GPU ASIC Design Engineer

NVIDIA

Bengaluru, Karnataka, India (On-Site)
2 Months ago
SLAY - Marketing Data Analyst (SKAN Attribution, LTV forecasting)

SLAY

Berlin, Berlin, Germany (On-Site)
1 Month ago
Truecaller - Senior MLOps Engineer

Truecaller

Stockholm, Stockholm County, Sweden (On-Site)
5 Months ago
Playtech - Release Engineer

Playtech

Kyiv, Kyiv City, Ukraine (On-Site)
1 Month ago
Unity - Senior Data Ops Engineer

Unity

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
7 Months ago
Keywords Studios (Player Support) - Azure Devops Engineer

Keywords Studios (Player Support)

London, England, United Kingdom (Remote)
11 Months ago
Rackspace Technology - Sr AWS Sales Engineer

Rackspace Technology

California, United States (Hybrid)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Balbix - Staff /Sr Staff/ Principal Engineer - Lakehouse

Balbix

Gurugram, Haryana, India (On-Site)
6 Months ago
ByteDance - Machine Learning Engineer Intern (Global E-commerce Risk Control) - 2025 Summer (MS)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Wargaming - Game Data Analyst (World of Tanks)

Wargaming

Warsaw, Masovian Voivodeship, Poland (On-Site)
2 Months ago
Netflix - Research Scientist 4 - Speech Synthesis, Content and Studio

Netflix

Los Gatos, California, United States (On-Site)
3 Months ago
Bally's Interactive - Social Media Manager - 6 Month Fixed Term Contract

Bally's Interactive

London, England, United Kingdom (On-Site)
4 Months ago
Warner Bros Games - Senior Software Engineer - Backend - MSC Team

Warner Bros Games

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
Mozilla - Staff Machine Learning Engineer, Gen AI

Mozilla

Canada (Remote)
6 Months ago
VGW - Machine Learning Engineer

VGW

Perth, Western Australia, Australia (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Samsung Semiconductor - Workplace Manager

Samsung Semiconductor

San Jose, California, United States (On-Site)
3 Months ago
NVIDIA - CAD Engineer - New College Grad 2025

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
The Walt Disney Company - Sr Software Engineer (JavaScript)

The Walt Disney Company

Washington, United States (On-Site)
3 Months ago
Paypal - Sr. UX Designer

Paypal

San Jose, California, United States (Hybrid)
6 Months ago
NVIDIA - Mixed Signal Design Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Twitch - Senior Software Engineer - Mobile

Twitch

Seattle, Washington, United States (On-Site)
5 Months ago
Samsung Semiconductor - Senior Manager, Customer Quality and Reliability

Samsung Semiconductor

San Jose, California, United States (On-Site)
1 Month ago
Axon - Senior Enterprise Account Executive

Axon

Denver, Colorado, United States (Remote)
2 Months ago
Zuru - Associate Marketing Manager (Toys)

Zuru

Los Angeles, California, United States (On-Site)
6 Months ago
AVER LLC - Senior SQL Database Administrator

AVER LLC

United States (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Microsoft - Senior Software Engineer - Azure Agents

Microsoft

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Microsoft - Software Engineer II – Azure Agents

Microsoft

Bengaluru, Karnataka, India (On-Site)
3 Months ago
ARHS - Data Manager

ARHS

Stockholm, Stockholm County, Sweden (On-Site)
6 Months ago
Epic Games - Senior DevOps Programmer

Epic Games

United States (On-Site)
2 Months ago
Omnissa - Engineering Manager (C++, Linux/Windows/MacOS internals)

Omnissa

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
PwC - Senior Associate_Azure Data Engineer_Data & Analytics_Advisory_PAN  India

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Canva - Senior Software Engineer (Release Engineering/Python) - open to remote across ANZ

Canva

Sydney, New South Wales, Australia (Remote)
3 Months ago
Ubisoft - Site Reliability Engineer [Game Security]

Ubisoft

Düsseldorf, North Rhine-Westphalia, Germany (On-Site)
3 Months ago
CCP Games - Senior Infrastructure Engineer

CCP Games

Reykjavík, Reykjavíkurborg, Iceland (Hybrid)
1 Month ago
Sperasoft - Release Engineer

Sperasoft

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug