Senior GPU Cluster Software Engineer

1 Month ago • 5 Years + • Research & Development

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

SuperPlay - Game Economist Specialist - Disney

SuperPlay

Tel Aviv District, Israel (On-Site)
3 Weeks ago
Voodoo - Data Analyst - Blitz

Voodoo

Paris, Île-de-France, France (On-Site)
3 Weeks ago
Patterned Learning Career - Senior Software Engineer, Backend

Patterned Learning Career

(Remote)
1 Week ago
Evolution - Technical Compliance Team Lead - (Certification)

Evolution

St. Julian's, Malta (On-Site)
4 Weeks ago
Rockstar Games - Tools Programmer (Mid/Senior)

Rockstar Games

Dundee, Scotland, United Kingdom (On-Site)
1 Month ago
Riot Games - Staff Software Engineer, Gameplay/Characters

Riot Games

Los Angeles, California, United States (On-Site)
1 Week ago
Rec Room - Machine Learning Engineer

Rec Room

United States (Remote)
3 Months ago
Krafton  - Publishing Tech PM

Krafton

Seoul, South Korea (On-Site)
1 Month ago
Rivos - Analog Mixed Signal Design

Rivos

Hsinchu, Hsinchu City, Taiwan (Hybrid)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

seeking alpha - Senior Data Scientist

seeking alpha

Ukraine (Remote)
2 Weeks ago
Thumbtack - Trust & Safety Incident Operations Specialist

Thumbtack

United States (Remote)
1 Day ago
Netflix - Senior Manager, Learning & Development, Customer Service

Netflix

Los Gatos, California, United States (On-Site)
3 Weeks ago
Plarium - Game Designer

Plarium

Lviv, Lviv Oblast, Ukraine (Remote)
4 Days ago
CharacterAI - Data Scientist, iCreator Ecosystems

CharacterAI

New York, New York, United States (On-Site)
2 Weeks ago
Ubisoft - UX/UI Designer (Mobile Gaming)

Ubisoft

Barcelona, Catalonia, Spain (On-Site)
7 Hours ago
MyGames - UI/UX Designer

MyGames

(Remote)
4 Hours ago
Sharkmob - Senior Game Quality Analyst

Sharkmob

Malmö, Skåne County, Sweden (On-Site)
2 Weeks ago
A Thinking Ape - Intermediate UI/UX Designer - Mobile Games (Fixed Term Contract)

A Thinking Ape

Vancouver, British Columbia, Canada (Remote)
6 Days ago
Microsoft - Research Intern - UX of AI

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

NVIDIA - Safety Engineer

NVIDIA

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
Unity - Director, General Counsel, Greater China

Unity

Shanghai, Shanghai, China (On-Site)
3 Months ago
NVIDIA - Senior System Software Engineer - Autonomous Driving

NVIDIA

Shanghai, Shanghai, China (On-Site)
1 Month ago
Tencent - Global Communications Intern

Tencent

Beijing, Beijing, China (On-Site)
2 Months ago
NinjaVan - Specialist, Key Account Management-大客户高级运营

NinjaVan

Guangzhou, Guangdong Province, China (On-Site)
4 Months ago
Luxoft - Senior Team Lead

Luxoft

Shanghai, Shanghai, China (On-Site)
3 Months ago
Virtuos - 3D Animation

Virtuos

China (On-Site)
4 Months ago
NVIDIA - Performance Engineer Intern, Deep Learning and HPC

NVIDIA

Shanghai, Shanghai, China (On-Site)
1 Month ago
Microsoft - Senior Researcher for Chemistry with AI/ML

Microsoft

Beijing, Beijing, China (On-Site)
1 Month ago
Tencent - Senior Business Development Manager -Supercell Games

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

ByteDance - Senior Software Development Engineer - Database NoSQL Redis

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
Fluence - Lead Engineer - Battery Module

Fluence

Houston, Texas, United States (Hybrid)
4 Months ago
NVIDIA - Senior Solutions Architect, HPC and AI

NVIDIA

Santa Clara, California, United States (Hybrid)
4 Weeks ago
NVIDIA - Hardware Board Design Manager, IC Product

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Riot Games - Director, Software Engineering - League of Legends

Riot Games

Dublin, County Dublin, Ireland (On-Site)
3 Months ago
Scientific Games  - Technical Software Release Engineer

Scientific Games

Warwick, Rhode Island, United States (Hybrid)
1 Month ago
NVIDIA - Performance Engineering Intern - Summer 2025

NVIDIA

Toronto, Ontario, Canada (On-Site)
3 Weeks ago
Rivos - Senior Memory Design Engineer

Rivos

Santa Clara, California, United States (Hybrid)
4 Months ago
Krafton  - PUBG IP Franchise China Business PM (6+ years)

Krafton

Seoul, South Korea (On-Site)
1 Week ago
Krafton  - [Deep Learning Div.] Deep Learning Engineer - ML (1년 ~ 3년)

Krafton

Seoul, South Korea (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug