Senior GPU Cluster Software Engineer

3 Months ago • 5 Years + • Research & Development

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

InnoGames - Game Designer for Heroes of History

InnoGames

Hamburg, Hamburg, Germany (Hybrid)
2 Weeks ago
MIQ Digital - Creative Designer (Junior/Midweight)

MIQ Digital

London, England, United Kingdom (Hybrid)
8 Hours ago
Playtika - Unity Senior Expert

Playtika

Poland (Hybrid)
1 Month ago
QuinStreet - Sr. CSS Developer

QuinStreet

(Remote)
1 Day ago
GoDaddy - Engineering Manager - Software Development

GoDaddy

(Remote)
6 Hours ago
Tesla - Lead/Manager (Power) Electronic/Electrical Design Engineer

Tesla

Brandenburg, Germany (On-Site)
2 Months ago
Tencent - Senior Researcher, Speech Processing

Tencent

London, England, United Kingdom (On-Site)
2 Months ago
Rivos - SOC Design Verification - Intern

Rivos

Santa Clara, California, United States (On-Site)
6 Months ago
Zuru - Scientific Python Developer

Zuru

Modena, Emilia-Romagna, Italy (Hybrid)
6 Months ago
NVIDIA - Senior Chip Design Verification Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

NVIDIA - Director of Product - AI Training Platform Software

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Wargaming - Senior Gameplay Programmer (World of Warships)

Wargaming

Belgrade, Serbia (Hybrid)
1 Week ago
Google - Product Lead, Support

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
Inkittt - Senior Front-End Engineer - Mobile & Web

Inkittt

Krakow Am See, Mecklenburg-Vorpommern, Germany (Hybrid)
1 Month ago
anavatio  - User Interface / User Experience (UI/UX) Developer

anavatio

Lorton, Virginia, United States (Hybrid)
4 Weeks ago
CD PROJEKT RED - Lead UX Designer

CD PROJEKT RED

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Day ago
Universally Speaking - Simplified Chinese Games Tester

Universally Speaking

Liverpool, England, United Kingdom (On-Site)
1 Month ago
Canva - Backend Software Engineer (Java) - User Product

Canva

Sydney, New South Wales, Australia (Remote)
2 Months ago
GT - Senior WebGL Game Developer

GT

(Remote)
1 Month ago
Ubisoft - Generalist Designer

Ubisoft

Pune, Maharashtra, India (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

Tencent - Social Media Content Operator - PUBG Mobile Esports

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
Tencent - Security Operations - PUBG Mobile

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
Tencent - Senior Client-Side Security Engineer

Tencent

Shenzhen, Guangdong Province, China (On-Site)
4 Months ago
Tencent - Senior Technical Artist UE5

Tencent

Shenzhen, Guangdong Province, China (On-Site)
4 Months ago
Zengame Technology - Advertising Video Designer

Zengame Technology

Beijing, Beijing, China (On-Site)
1 Month ago
Tencent - Senior Environment Concept Artist

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
Nagarro - Associate Principal Consultant, Operations

Nagarro

Shanghai, Shanghai, China (On-Site)
6 Months ago
Tencent - Esports Operations Manager (Team Operations & Club Ecosystem) -- PUBG Mobile

Tencent

Shenzhen, Guangdong Province, China (On-Site)
4 Months ago
eBay - Commercial Underwriting Teammate

eBay

Shanghai, Shanghai, China (On-Site)
7 Months ago
Virtuos - Senior Game Producer

Virtuos

China (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Google - Software Engineer, PhD, Early Career, Campus, Embedded Systems and Firmware, 2025 start

Google

Mountain View, California, United States (On-Site)
5 Months ago
Rivos - Accelerator Design Verification - Full Time

Rivos

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
Google - Student Researcher, BS/MS, Winter/Summer 2025

Google

Montreal, Quebec, Canada (On-Site)
5 Months ago
NVIDIA - Senior SRAM Circuit Design Engineer

NVIDIA

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Netflix - Senior Software Engineer, Partner Engineering - APAC

Netflix

Hsinchu, Hsinchu City, Taiwan (On-Site)
6 Months ago
Google - Staff Software Engineer, Technical Infrastructure

Google

Hyderabad, Telangana, India (On-Site)
2 Weeks ago
ByteDance - Research Engineer Graduate (Machine Learning Sys-US) - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
NVIDIA - Senior Technical Program Manager - Autonomous Vehicles

NVIDIA

Santa Clara, California, United States (On-Site)
6 Days ago
Google - Hardware Architect, GPU, ML IP

Google

Mountain View, California, United States (On-Site)
2 Weeks ago
NVIDIA - Senior Electronics Failure Analysis Hardware Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug