Senior GPU Cluster Software Engineer

4 Months ago • 5 Years + • Research & Development

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

Tekion Corp - Senior Manager of Data, ML, and AI Product Management

Tekion Corp

Pleasanton, California, United States (On-Site)
2 Months ago
Adform - Senior Software Engineer

Adform

Mumbai, Maharashtra, India (On-Site)
8 Months ago
AppLovin - Research Engineer

AppLovin

Palo Alto, California, United States (On-Site)
1 Month ago
Kaedim - UX/UI Designer

Kaedim

London, England, United Kingdom (On-Site)
10 Months ago
PhonePe - Associate Manager - CX Process Design

PhonePe

Bengaluru, Karnataka, India (On-Site)
1 Week ago
NVIDIA - Senior Hardware Validation Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
Google - Junior CPU Formal Verification Engineer

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Tencent - Software Engineering Associate

Tencent

(On-Site)
5 Months ago
Tesla - Senior Embedded Software/Firmware Engineer - Power Electronics

Tesla

Baden-Württemberg, Germany (On-Site)
3 Months ago
NVIDIA - Senior Chip Design Verification Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

bytedance - Backend Software Engineer

bytedance

San Jose, California, United States (On-Site)
2 Months ago
Scopely - QA Director - Unannounced Project

Scopely

Dublin, County Dublin, Ireland (Hybrid)
5 Months ago
Epic Games - SDET

Epic Games

London, England, United Kingdom (On-Site)
1 Month ago
Accenture - Marketing Engagement Analyst

Accenture

Mumbai, Maharashtra, India (On-Site)
3 Weeks ago
N-ix - Senior Backend (Java/Scala) Engineer

N-ix

Ukraine (Remote)
1 Month ago
Mendix - Senior Mendix Developer

Mendix

Munich, Bavaria, Germany (Hybrid)
4 Weeks ago
N-ix - Senior Game Designer

N-ix

Ukraine (Remote)
1 Week ago
WebFX - Jr. Website Planning Specialist

WebFX

Harrisburg, Pennsylvania, United States (On-Site)
7 Months ago
GameJobs - Korean QA Tester

GameJobs

Manila, Metro Manila, Philippines (On-Site)
1 Year ago
Haptic - Head of Product Design

Haptic

United Kingdom (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

Riot Games - Content Producer, VALORANT

Riot Games

Shanghai, Shanghai, China (On-Site)
2 Months ago
Riot Games - Software Engineer - Platform & Tools (Contractor)

Riot Games

Shanghai, Shanghai, China (On-Site)
7 Months ago
Lilith games - UE Client Development Engineer - System (Dislyte)

Lilith games

Shanghai, China (On-Site)
4 Days ago
Ubisoft - Lead Audio Designer

Ubisoft

Shanghai, Shanghai, China (On-Site)
1 Month ago
Coda - Senior Account Manager/Account Manager (Gaming Industry)

Coda

Shanghai, China (Hybrid)
1 Year ago
NVIDIA - Solution Architect - CSP Cloud

NVIDIA

Beijing, Beijing, China (On-Site)
4 Months ago
Tencent - Funcom - Senior Staff Technical Artist

Tencent - Funcom

Shenzhen, Guangdong Province, China (On-Site)
4 Days ago
Aptive - ATT Technician

Aptive

Nantong, Jiangsu, China (On-Site)
1 Month ago
Informa Group - Marketing Manager

Informa Group

Shanghai, Shanghai, China (On-Site)
1 Month ago
Moonton  - Card Game - Z Level Designer

Moonton

Shanghai, China (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

NVIDIA - Senior System Software Architect, HPC Networking

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Google - Senior Research Scientist, Quantum

Google

Goleta, California, United States (On-Site)
1 Month ago
Meta - ASIC Engineer, Design Verification

Meta

Austin, Texas, United States (Remote)
6 Months ago
bytedance - Research Scientist, Infrastructure System Lab

bytedance

San Jose, California, United States (On-Site)
2 Months ago
Avathon - Software Engineering Manager

Avathon

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Cadence - Product Engineering Architect (Circuit Simulation)

Cadence

San Jose, California, United States (On-Site)
7 Months ago
Google - Embedded Senior Software Engineer, Pixel Power Software

Google

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
1 Month ago
rivos - CPU Design/Verification - Intern

rivos

Santa Clara, California, United States (On-Site)
7 Months ago
bytedance - Engineering Manager Machine Learning Infrastructure

bytedance

San Jose, California, United States (On-Site)
7 Months ago
Meta - Software Engineer, Machine Learning

Meta

Redmond, Washington, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug