Senior GPU Cluster Software Engineer

2 Months ago • 5 Years + • Research & Development

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

Sphere Entertainment Co - Motion Graphics Compositor

Sphere Entertainment Co

Burbank, California, United States (On-Site)
1 Week ago
ByteDance - Product Operations, Search Ads AI Data Service - Trust & Safety

ByteDance

Pasig, Metro Manila, Philippines (On-Site)
6 Days ago
Universally Speaking - Russian Games Tester

Universally Speaking

Community Of Madrid, Spain (On-Site)
2 Weeks ago
Super - Product Design Manager, Fintech & Earnings

Super

(Remote)
1 Month ago
Xsolla - Product Manager

Xsolla

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
1 Month ago
Nielsen Holdings - Software Engineering Manager - Windows\C++\.Net

Nielsen Holdings

Mumbai, Maharashtra, India (Hybrid)
3 Months ago
Rivos - Senior Memory Design Engineer

Rivos

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Mistplay - Data Scientist Intern (Summer 2025)

Mistplay

Toronto, Ontario, Canada (Hybrid)
1 Month ago
Riot Games - Senior Data Scientist - Singapore Efficiency Team

Riot Games

Singapore (On-Site)
2 Months ago
NVIDIA - Silicon Validation Engineer

NVIDIA

Canada (Hybrid)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

CrazyLabs - Senior UI/UX Designer

CrazyLabs

Berlin, Berlin, Germany (On-Site)
1 Month ago
ByteDance - Software Engineer Intern (Cloud Native Infrastructure)

ByteDance

San Jose, California, United States (On-Site)
1 Week ago
Mistplay - Senior Product Design Manager

Mistplay

Toronto, Ontario, Canada (Hybrid)
1 Week ago
Sawhorse Productions - Senior Roblox Developer

Sawhorse Productions

California, United States (Remote)
1 Week ago
Windranger Labs - Senior Fullstack Engineer

Windranger Labs

El Paso, Texas, United States (Remote)
6 Days ago
Overwolf - Developer Relations Manager

Overwolf

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
5 Days ago
 Vizrt - UX Engineer

Vizrt

Lisbon, Lisbon, Portugal (Remote)
1 Week ago
RoofStack - Senior Game Designer

RoofStack

İstanbul, İstanbul, Türkiye (On-Site)
1 Month ago
Sorare  - Product Designer

Sorare

Paris, Île-de-France, France (Hybrid)
1 Week ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

Tencent - UA Manager - AI Integration

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Weeks ago
Riot Games - Manager, Product Management - VALORANT Mobile, China Studios

Riot Games

Shanghai, Shanghai, China (On-Site)
6 Months ago
Kaiying Network - Game Publishing Brand/Marketing Planner

Kaiying Network

Shanghai, Shanghai, China (On-Site)
1 Week ago
NVIDIA - Performance Engineering Intern - 2025

NVIDIA

Shanghai, Shanghai, China (On-Site)
1 Month ago
Canva - CJKI User Voice AI Knowledge Management Expert

Canva

Beijing, Beijing, China (Remote)
6 Days ago
Tencent - 2D Open World Game Director

Tencent

Guangzhou, Guangdong Province, China (On-Site)
1 Week ago
NVIDIA - Performance Engineering Intern - 2025

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
Ubisoft - Live Performance Specialist

Ubisoft

Shanghai, Shanghai, China (On-Site)
1 Month ago
Every matrix - Middle Manual QA Tester

Every matrix

Changsha, Hunan, China (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

ByteDance - Research Scientist, Infrastructure System Lab

ByteDance

San Jose, California, United States (On-Site)
6 Days ago
EXUSIA - Ab Initio Technical Lead

EXUSIA

Hyderabad, Telangana, India (Remote)
6 Days ago
Riot Games - Associate Art Director, Characters - Unpublished R&D Product

Riot Games

Los Angeles, California, United States (On-Site)
4 Months ago
NVIDIA - Senior SRAM Engineer, Circuit Design

NVIDIA

Santa Clara, California, United States (Hybrid)
1 Month ago
NVIDIA - Senior System Software Engineer, Deep Learning Accelerator

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
Krafton  - [Publishing Platform Div.] Sr. Web Front-End Developer (5년 이상)

Krafton

Seoul, South Korea (On-Site)
4 Months ago
Tesla - Torque Tool and Production Technology Internship

Tesla

Brandenburg, Germany (On-Site)
1 Month ago
ByteDance - Firmware Software Engineer / Architect

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
NVIDIA - Senior System Profiling Software Engineer

NVIDIA

Canada (On-Site)
1 Month ago
NVIDIA - Senior ASIC Verification Engineer - GPU Memory Subsystem

NVIDIA

Santa Clara, California, United States (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Ra'anana, Center District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug