Senior GPU Cluster Software Engineer

5 Months ago • 5 Years + • Software Development & Engineering

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

Glean - Cloud Infrastructure Engineer

Glean

Bengaluru, Karnataka, India (On-Site)
1 Month ago
ARHS - Systems Engineer

ARHS

Valletta, Malta (On-Site)
8 Months ago
Canonical - Site Reliability / Gitops Engineer

Canonical

(Remote)
1 Month ago
WebFX - Full Stack JavaScript Developer (Remote PH)

WebFX

Philippines (Remote)
8 Months ago
luxsoft - QA Automation Engineer

luxsoft

Bucharest, Bucharest, Romania (On-Site)
1 Week ago
Penrose studios - Dev Ops Engineer

Penrose studios

San Francisco, California, United States (On-Site)
1 Month ago
The Walt Disney Company - Second Engineer

The Walt Disney Company

(On-Site)
3 Months ago
lifechruh - APIs Staff Software Engineer

lifechruh

Edmond, Oklahoma, United States (On-Site)
8 Months ago
Glean - Technical Support Engineer (EST shift hours)

Glean

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Nice - Senior Professional Services Engineer, Actimize

Nice

Hoboken, New Jersey, United States (Hybrid)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

USE Insider - Working Student (Technical Support)

USE Insider

Berlin, Berlin, Germany (Hybrid)
2 Months ago
bytedance - Machine Learning Engineer Graduate (AML Algorithm) - 2025 Start (PhD)

bytedance

San Jose, California, United States (On-Site)
8 Months ago
Zeeco, Inc. - Thermal Design Engineer - WHRB

Zeeco, Inc.

Mumbai, Maharashtra, India (On-Site)
7 Months ago
Qualcomm - Peripherals Design Verification Sr Engineer

Qualcomm

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Qualcomm - AI ML Engineer

Qualcomm

Hyderabad, Telangana, India (On-Site)
1 Month ago
Enphase Energy - Sales Operations Analyst

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
9 Months ago
WebMD - Marketing Solutions Associate

WebMD

Madison, Wisconsin, United States (On-Site)
4 Months ago
Black Kite Studios - VFX Supervisor

Black Kite Studios

London, England, United Kingdom (On-Site)
2 Years ago
velotio technologies  - Lead Fullstack Engineer

velotio technologies

Pune, Maharashtra, India (Remote)
1 Week ago
CD PROJEKT RED - Technical Animator, Narrative

CD PROJEKT RED

Warsaw, Masovian Voivodeship, Poland (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

Moonton  - Lead UX - Western Cartoon

Moonton

Shanghai, Shanghai, China (On-Site)
1 Month ago
Paper Stacking games - Image Algorithm Development Intern

Paper Stacking games

Shanghai, China (On-Site)
3 Weeks ago
Lilith games - Overseas Advertising Placement Manager

Lilith games

Shanghai, Shanghai, China (On-Site)
3 Days ago
hogarth - (Senior) Localization Project Manager

hogarth

Shanghai, China (On-Site)
1 Month ago
Coda - Partner Enablement Specialist

Coda

Shanghai, China (Hybrid)
1 Month ago
Tencent - Game Operation Manager

Tencent

Shenzhen, Guangdong Province, China (On-Site)
8 Months ago
Nordson Corporation - Electrical Engineer II

Nordson Corporation

Suzhou, Jiangsu, China (On-Site)
1 Month ago
Paper Stacking games - User Operations (External Announcements & Public Opinion Management) - Infinite Warmth

Paper Stacking games

Shanghai, China (On-Site)
3 Weeks ago
Paper Stacking games - Video Designer - Infinity Nuan Nuan (Star Stack)

Paper Stacking games

Shanghai, China (On-Site)
3 Weeks ago
sony global (Games) - Global HR Platform Consultant

sony global (Games)

Dalian, Liaoning, China (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Software Development & Engineering Jobs

Actian - Zen Sustaining Engineer - Bangalore/Pune

Actian

Bengaluru, Karnataka, India (On-Site)
8 Months ago
whoop - Mechanical Engineer II (Apparel & Accessories)

whoop

Boston, Massachusetts, United States (On-Site)
2 Months ago
The Walt Disney Company - Sr. Principal Software Engineer - Identity

The Walt Disney Company

New York, New York, United States (On-Site)
6 Months ago
Aryaka - Senior Sales Engineer

Aryaka

United Kingdom (Remote)
1 Month ago
sitetracker - Salesforce Staff Engineer

sitetracker

Bengaluru, Karnataka, India (Hybrid)
3 Weeks ago
Apple - Flex DFM Engineer

Apple

Cupertino, California, United States (On-Site)
1 Month ago
PwC - SAP SuccessFactors (EC) Manager

PwC

Makati City, Metro Manila, Philippines (On-Site)
9 Months ago
Haptic  - Lead Engine Software Engineer

Haptic

Dallas, Texas, United States (Remote)
6 Months ago
Adyen - Implementation Engineer

Adyen

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Morning Star - Senior Principal Engineer

Morning Star

Mumbai, Maharashtra, India (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug