Senior System Software Engineer, Distributed Systems - DGX Cloud

1 Month ago • 6 Years + • DevOps • $148,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior System Software Engineer specializing in distributed systems for its DGX Cloud platform. The role involves designing, developing, and optimizing solutions for datacenter firmware, collaborating with hardware and software teams, ensuring seamless integration across the system. Responsibilities include automating GPU asset provisioning, configuration, and lifecycle management across cloud providers, defining reliability and availability requirements, and driving failure analysis. The ideal candidate possesses strong programming skills in Python and Linux, system-level expertise, familiarity with industry standards (SPI, I2C, PCIe, UEFI, PLDM), and experience with distributed systems. This is a full-time position with remote options.
Must have:
  • 6+ years experience with Python & Linux
  • Distributed systems understanding
  • System programming (Go/Python)
  • Familiarity with SPI, I2C, PCIe, UEFI, PLDM
  • Data structures & algorithms expertise
Good to have:
  • Machine check architecture knowledge
  • Linux server design, x86/ARM architecture
  • Experience with large-scale distributed systems
  • Cloud AI infrastructure operational excellence
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is hiring engineers to scale up its AI Infrastructure. We expect you to have a strong programming background, a deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities. We also welcome out-of-the-box thinkers who can provide new ideas with strong at execution bias. Expect to be constantly challenged, improving, and evolving for the better. You and other engineers in this team will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications that affect core data science. What are you waiting for if you're creative, passionate about what you do, and love having fun apply today!

What you’ll be doing:

  • We are designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.

  • Design, develop, test, debug, and optimize creative solutions for Datacenter firmware throughout lifecycle.

  • Work closely with hardware, software, infrastructure, and business teams to transform new firmware features from idea to reality.

  • Define server-level reliability, availability, and serviceability requirements in collaboration with various customers like CSPs and deliver fault resilient solution at scale as per customer expectations.

  • Collaborate with hardware, software and firmware teams to drive failure analysis and large scale solution deployment.

  • Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications.

What we need to see:

  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience) with 6+ years of experience active development using Python as primary programming language using Linux as OS.

  • Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles and architects and coordinate effectively across organizational boundaries and geographies.

  • Familiarity with industry standards and specifications such as SPI, I2C, PCIe, UEFI and PLDM.

  • System knowledge - how platform management works - areas like BMC-BIOS communication, thermal management, power management, firmware update, device monitoring, firmware security, etc.

  • Expert level knowledge of a systems programming language (Go, Python) and a solid understanding of Data Structure and Algorithms.

  • Understanding of performance, security and reliability in complex distributed systems. Familiarity with system level architecture, data synchronization, fault tolerance and state management.

Ways to stand out from the crowd:

  • Background with In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software.

  • Familiar with Linux server design, x86/ARM system architecture, interconnects like PCI, and other I/O buses.

  • Proven operational excellence in designing and maintaining cloud AI infrastructure. Proficiency in architecting and running large-scale distributed systems, independent of cloud providers.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.

The base salary range is 148,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Nexters - Data Engineer at Attribution

Nexters

Cyprus (Remote)
1 Week ago
Google - Software Engineer III, Machine Learning, Google Ads

Google

Mountain View, California, United States (On-Site)
3 Months ago
Fluence - Controls Engineer (m/f/d)

Fluence

Amsterdam, North Holland, Netherlands (Remote)
4 Months ago
Hitachi - Data Science

Hitachi

Pune, Maharashtra, India (On-Site)
4 Months ago
Rackspace Technology - Site Reliability Engineer / Observability Engineer

Rackspace Technology

Giza, Giza Governorate, Egypt (Remote)
2 Months ago
Pixar Animation Studios - Build & Release Engineer

Pixar Animation Studios

Emeryville, California, United States (Hybrid)
3 Weeks ago
Ubisoft Blue Byte - Site Reliability Engineer [Game Security]

Ubisoft Blue Byte

Düsseldorf, North Rhine-Westphalia, Germany (On-Site)
3 Weeks ago
PwC - IN- Senior Associate_ DevOps_Advisory Corporate_Advisory _Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Omnissa - Staff Engineer (C++ Linux)

Omnissa

Bengaluru, Karnataka, India (Hybrid)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Research Scientist, Reinforcement Learning

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Matic Robots - iOS Engineer, Graphics and Rendering

Matic Robots

Mountain View, California, United States (On-Site)
4 Months ago
Demonware - Software Development Co-op

Demonware

Vancouver, British Columbia, Canada (Hybrid)
3 Weeks ago
ByteDance - CPU Optimization Architect

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
GoMotive - Embedded Engineer

GoMotive

India (Remote)
3 Days ago
Inkittt - Junior Finance Manager

Inkittt

Berlin, Berlin, Germany (Hybrid)
2 Months ago
Bombay Play - Game Developer

Bombay Play

Bengaluru, Karnataka, India (On-Site)
7 Months ago
The Walt Disney Company - Manager, Software Engineering

The Walt Disney Company

San Francisco, California, United States (On-Site)
1 Month ago
Unity - Principal Applied Research Machine Learning Engineer

Unity

London, England, United Kingdom (On-Site)
4 Months ago
Dream Games - Graphic Designer

Dream Games

İstanbul, Türkiye (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Crunchyroll - Principal Software Engineer

Crunchyroll

Dallas, Texas, United States (On-Site)
1 Month ago
Google - Staff Software Engineer, Infrastructure, Google Cloud Performance

Google

Sunnyvale, California, United States (On-Site)
1 Month ago
ByteDance - Video Coding/Transcoding Algorithm Engineer

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
The Walt Disney Company - WTVD (ABC11) Community Engagement Intern, Fall 2025

The Walt Disney Company

Durham, North Carolina, United States (On-Site)
1 Day ago
Thumbtack - Trust & Safety Senior Specialist - Incident Operations

Thumbtack

United States (Remote)
1 Month ago
NVIDIA - Solutions Architect, Generative AI

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Microsoft - Member of Technical Staff, AI Data

Microsoft

Mountain View, California, United States (On-Site)
6 Days ago
Glean - Channel Manager, AMER - East+Canada

Glean

Palo Alto, California, United States (On-Site)
3 Months ago
ByteDance - HR Shared Services Center Senior Specialist - HR Operations - Austin

ByteDance

Austin, Texas, United States (On-Site)
1 Month ago
USE Insider - Senior Content Writer - Remote

USE Insider

United States (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

The Walt Disney Company - Senior Software Engineer, Big Data Infrastructure

The Walt Disney Company

California, United States (On-Site)
3 Weeks ago
Aera Technology - Senior Platform Administration Engineer

Aera Technology

Bucharest, Bucharest, Romania (Hybrid)
4 Months ago
Granicus - Sr. DevOps Engineer

Granicus

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Google - Data Cloud Consultant, Professional Services, Google Cloud

Google

Mexico City, Mexico City, Mexico (On-Site)
1 Month ago
Nielsen Holdings - Software Engineer - Bigdata ( Java/Scala ,Python, Spark, SQL, AWS )

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Canva - Senior Software Engineer (Cloud Platform)

Canva

Auckland, Auckland, New Zealand (Remote)
5 Days ago
Synamedia - Software Engineer (Node JS, GoLang, AWS)

Synamedia

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Ubisoft - DevOps Linux System Administrator

Ubisoft

Montreal, Quebec, Canada (On-Site)
1 Week ago
Grid Dynamics - Lead QE Engineer

Grid Dynamics

Bengaluru, Karnataka, India (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

United States (Remote)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Bengaluru, Karnataka, India (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug