Senior GPU Cluster Software Engineer

6 Months ago • 5 Years + • Software Development & Engineering

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.
Must have:
  • 5+ years software development (Python)
  • Gitlab/CI/CD experience
  • Understanding of algorithms & data structures
  • Distributed system architecture knowledge
  • HPC GPU cluster & Slurm basics
  • ML concepts & terminology
  • SQL & NoSQL database experience
  • Distributed data pipeline, telemetry, visualization & alerting
Good to have:
  • Debugging HPC GPU clusters
  • Distributed LLM training experience
  • LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
  • HPC schedulers (Slurm)
  • Opentelemetry

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry

#LI-Hybrid

Similar Jobs

ISS Stoxx - Python Engineer

ISS Stoxx

Mumbai, Maharashtra, India (On-Site)
1 Month ago
Tesla - Autopilot Test Specialist

Tesla

Santa Oliva, Catalunya, Spain (On-Site)
5 Months ago
Jane Street - Cybersecurity Engineer - Threat Modelling

Jane Street

New York, United States (On-Site)
2 Months ago
Qualcomm - Software Platform Solutions Developer

Qualcomm

San Diego, California, United States (On-Site)
1 Month ago
Hawkeye Innovations - Football Tracking Systems Technician

Hawkeye Innovations

Athens, Greece (On-Site)
1 Month ago
Trend Micro - Sales Engineer

Trend Micro

Seoul, South Korea (Hybrid)
2 Weeks ago
CyberArk - Senior Integration Engineer

CyberArk

United States (Hybrid)
1 Month ago
Apple - Software Engineering Manager, Stress Software

Apple

San Diego, California, United States (On-Site)
2 Months ago
Gigamon - Senior Hardware Engineer

Gigamon

Chennai, Tamil Nadu, India (On-Site)
5 Months ago
Mapbox - Technical Support Engineer

Mapbox

United Kingdom (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Tesla - Senior Mechanical Engineer, Technical Service

Tesla

Brandenburg, Germany (On-Site)
5 Months ago
LegalZoom - Senior Financial Analyst

LegalZoom

Los Angeles, California, United States (Remote)
1 Month ago
Zinnia - Revenue Operations Manager (GTM Systems)

Zinnia

Greenwich, Connecticut, United States (Hybrid)
3 Weeks ago
Uniswap Labs - Senior Backend Engineer

Uniswap Labs

New York, United States (Hybrid)
2 Months ago
Techland - Rendering Programmer

Techland

Warsaw, Masovian Voivodeship, Poland (On-Site)
3 Months ago
Playdawn Consulting - C++ Programmer

Playdawn Consulting

Mumbai, Maharashtra, India (On-Site)
3 Months ago
Aeries technology - Project Manager

Aeries technology

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Trellix - Site Reliability Engineer

Trellix

Cork, County Cork, Ireland (On-Site)
2 Months ago
Qualcomm - Data Cache & Coherent Interconnect Architect/Engineer (Multiple Levels)

Qualcomm

Austin, Texas, United States (On-Site)
1 Week ago
Apple - On-device ML Infrastructure Engineer (ML Execution)

Apple

Cupertino, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, Shanghai, China

yostar - System Operations Engineer

yostar

Shanghai, China (On-Site)
2 Months ago
Canonical - Software Engineer

Canonical

Beijing, China (On-Site)
2 Months ago
Tencent - Game Operations (Content Operations) - Middle East Region

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
fluence - Senior Advanced Supplier Quality Engineer

fluence

China (Remote)
1 Year ago
Light Speed Studios - Client Development Engineer - AAA Stylized Realistic Shooting Game

Light Speed Studios

Shenzhen, Guangdong Province, China (On-Site)
3 Months ago
Bosch Group - ADAS Technical Project Manager

Bosch Group

Suzhou, Jiangsu, China (On-Site)
1 Month ago
Tencent - Senior Staff Technical Artist

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
Tencent - Game Operations (Data Analysis Direction)

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
Moonton  - Senior 3D Animator - Card Project

Moonton

Shanghai, China (On-Site)
1 Week ago
Lilith games - Senior Client Development Engineer (Open World)

Lilith games

Shanghai, China (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Software Development & Engineering Jobs

codeninja  - IoT Engineer

codeninja

Dubai, Dubai, United Arab Emirates (On-Site)
2 Weeks ago
WebTech Corporation - Deputy Engineer - PRT

WebTech Corporation

Nanded, Maharashtra, India (Remote)
2 Months ago
Apple - Cellular ASIC Design Integration Engineer

Apple

San Diego, California, United States (On-Site)
1 Month ago
zoox - Senior/Staff Software Engineer, Mission Planning

zoox

Foster City, California, United States (Hybrid)
2 Years ago
Tesla - CAD & PLM Support Engineer, 3DEXPERIENCE

Tesla

Berlin, Berlin, Germany (On-Site)
5 Months ago
Adyen - Head of Engineering - Risk & Financial Crime

Adyen

Amsterdam, North Holland, Netherlands (On-Site)
4 Weeks ago
Qualcomm - SoC STA/Timing Engineer (Lead/Staff)

Qualcomm

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Assystems - Mechanical Engineer

Assystems

Bengaluru, Karnataka, India (On-Site)
9 Months ago
luxsoft - Developer - SAP ECC

luxsoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
3 Months ago
Nice - Professional Services Engineer

Nice

United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Dubai, Dubai, United Arab Emirates (On-Site)

Beijing, Beijing, China (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug