Jobs Courses Resources Companies Placements

Home >

Jobs >

Senior GPU Cluster Software Engineer

NVIDIA

Shanghai, China (On-site)

Senior GPU Cluster Software Engineer

7 Months ago • 5 Years + • Software Development & Engineering

Job Summary

Job Description

Senior GPU Cluster Software Engineer responsible for building profiling solutions for large-scale real-world applications running on GPU compute clusters. The role involves architecting, designing, implementing, testing, deploying, and supporting large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. A key focus is building internal profiling tools for ML/DL applications, analyzing failures and inefficiencies to improve GPU clusters and hardware. The engineer will also need to understand state-of-the-art improvements in ML/DL and work with application owners and research teams to enhance profiling needs for current and future features. The work environment is agile and fast-paced, requiring collaboration in a global setting.

Must have:

5+ years software development (Python)
Gitlab/CI/CD experience
Understanding of algorithms & data structures
Distributed system architecture knowledge
HPC GPU cluster & Slurm basics
ML concepts & terminology
SQL & NoSQL database experience
Distributed data pipeline, telemetry, visualization & alerting

Good to have:

Debugging HPC GPU clusters
Distributed LLM training experience
LLM training features & libraries (Pytorch, Megatron-LM, NCCL)
HPC schedulers (Slurm)
Opentelemetry

19 skills required

19 skills required for this role

Add these skills to join the top 1% applicants for this job

problem-solving

data-structures

game-texts

agile-development

gitlab

user-experience-ux

nosql

kibana

prometheus

grafana

elasticsearch

pytorch

redis

ci-cd

python

sql

algorithms

css

machine-learning

Job Details

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster. Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster. Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime
Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs
Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)
Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.
Solid understanding of algorithms, data structures, and runtime/space complexity
Experience working with distributed system software architecture
Basic understanding of HPC GPU cluster, slurm
Basic understanding of Machine learning concepts and terminologies
Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)
Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

Experience debugging functional and performance issues in HPC GPU clusters
Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster
Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL
Experience with HPC schedulers such as Slurm
Background with Opentelemetry

#LI-Hybrid

Similar Jobs

Python Engineer

ISS Stoxx

Mumbai, Maharashtra, India (On-Site)

• 2 Months ago

Autopilot Test Specialist

Tesla

Santa Oliva, Catalunya, Spain (On-Site)

• 6 Months ago

Cybersecurity Engineer - Threat Modelling

Jane Street

New York, United States (On-Site)

• 3 Months ago

Software Platform Solutions Developer

Qualcomm

San Diego, California, United States (On-Site)

• 2 Months ago

Football Tracking Systems Technician

Hawkeye Innovations

Athens, Greece (On-Site)

• 2 Months ago

Sales Engineer

Trend Micro

Seoul, South Korea (Hybrid)

• 1 Month ago

Senior Integration Engineer

CyberArk

United States (Hybrid)

• 2 Months ago

Software Engineering Manager, Stress Software

Apple

San Diego, California, United States (On-Site)

• 3 Months ago

Senior Hardware Engineer

Gigamon

Chennai, Tamil Nadu, India (On-Site)

• 6 Months ago

Technical Support Engineer

Mapbox

United Kingdom (Remote)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior Mechanical Engineer, Technical Service

Tesla

Brandenburg, Germany (On-Site)

• 6 Months ago

Senior Financial Analyst

LegalZoom

Los Angeles, California, United States (Remote)

• 2 Months ago

Revenue Operations Manager (GTM Systems)

Zinnia

Greenwich, Connecticut, United States (Hybrid)

• 1 Month ago

Senior Backend Engineer

Uniswap Labs

New York, United States (Hybrid)

• 3 Months ago

Rendering Programmer

Techland

Warsaw, Masovian Voivodeship, Poland (On-Site)

• 4 Months ago

C++ Programmer

Playdawn Consulting

Mumbai, Maharashtra, India (On-Site)

• 4 Months ago

Project Manager

Aeries technology

Bengaluru, Karnataka, India (On-Site)

• 2 Months ago

Site Reliability Engineer

Trellix

Cork, County Cork, Ireland (On-Site)

• 3 Months ago

Data Cache & Coherent Interconnect Architect/Engineer (Multiple Levels)

Qualcomm

Austin, Texas, United States (On-Site)

• 1 Month ago

On-device ML Infrastructure Engineer (ML Execution)

Apple

Cupertino, California, United States (On-Site)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Shanghai, China

System Operations Engineer

yostar

Shanghai, China (On-Site)

• 3 Months ago

Software Engineer

Canonical

Beijing, China (On-Site)

• 3 Months ago

Game Operations (Content Operations) - Middle East Region

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 2 Months ago

Senior Advanced Supplier Quality Engineer

fluence

China (Remote)

• 1 Year ago

Client Development Engineer - AAA Stylized Realistic Shooting Game

Light Speed Studios

Shenzhen, Guangdong Province, China (On-Site)

• 4 Months ago

ADAS Technical Project Manager

Bosch Group

Suzhou, Jiangsu, China (On-Site)

• 2 Months ago

Senior Staff Technical Artist

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 3 Months ago

Game Operations (Data Analysis Direction)

Tencent

Shenzhen, Guangdong Province, China (On-Site)

• 3 Months ago

Senior 3D Animator - Card Project

Moonton

Shanghai, China (On-Site)

• 1 Month ago

Senior Client Development Engineer (Open World)

Lilith games

Shanghai, China (On-Site)

• 1 Month ago

Get notifed when new similar jobs are uploaded

Software Development & Engineering Jobs

IoT Engineer

codeninja

Dubai, Dubai, United Arab Emirates (On-Site)

• 1 Month ago

Deputy Engineer - PRT

WebTech Corporation

Nanded, Maharashtra, India (Remote)

• 3 Months ago

Cellular ASIC Design Integration Engineer

Apple

San Diego, California, United States (On-Site)

• 2 Months ago

Senior/Staff Software Engineer, Mission Planning

zoox

Foster City, California, United States (Hybrid)

• 2 Years ago

CAD & PLM Support Engineer, 3DEXPERIENCE

Tesla

Berlin, Berlin, Germany (On-Site)

• 6 Months ago

Head of Engineering - Risk & Financial Crime

Adyen

Amsterdam, North Holland, Netherlands (On-Site)

• 1 Month ago

SoC STA/Timing Engineer (Lead/Staff)

Qualcomm

Bengaluru, Karnataka, India (On-Site)

• 3 Months ago

Mechanical Engineer

Assystems

Bengaluru, Karnataka, India (On-Site)

• 10 Months ago

Developer - SAP ECC

luxsoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)

• 4 Months ago

Professional Services Engineer

Nice

United States (Remote)

• 2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

76 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior GPU Cluster Software Engineer

Job Summary

Job Description

19 skills required

19 skills required for this role

Job Details

Similar Jobs

Python Engineer

Autopilot Test Specialist

Cybersecurity Engineer - Threat Modelling

Software Platform Solutions Developer

Football Tracking Systems Technician

Sales Engineer

Senior Integration Engineer

Software Engineering Manager, Stress Software

Senior Hardware Engineer

Technical Support Engineer

Similar Skill Jobs

Senior Mechanical Engineer, Technical Service

Senior Financial Analyst

Revenue Operations Manager (GTM Systems)

Senior Backend Engineer

Rendering Programmer

C++ Programmer

Project Manager

Site Reliability Engineer

Data Cache & Coherent Interconnect Architect/Engineer (Multiple Levels)

On-device ML Infrastructure Engineer (ML Execution)

Jobs in Shanghai, China

System Operations Engineer

Software Engineer

Game Operations (Content Operations) - Middle East Region

Senior Advanced Supplier Quality Engineer

Client Development Engineer - AAA Stylized Realistic Shooting Game

ADAS Technical Project Manager

Senior Staff Technical Artist

Game Operations (Data Analysis Direction)

Senior 3D Animator - Card Project

Senior Client Development Engineer (Open World)

Software Development & Engineering Jobs

IoT Engineer

Deputy Engineer - PRT

Cellular ASIC Design Integration Engineer

Senior/Staff Software Engineer, Mission Planning

CAD & PLM Support Engineer, 3DEXPERIENCE

Head of Engineering - Risk & Financial Crime

SoC STA/Timing Engineer (Lead/Staff)

Mechanical Engineer

Developer - SAP ECC

Professional Services Engineer

About The Company

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Network Site Reliability Engineer

ASIC Engineer

Senior ASIC Design Engineer

Physical Design CAD Team Manager

Engineering Farm Engineer

Senior Mixed Signal Design Verification Engineer

Senior Solutions Architect, Cloud Infrastructure and DevOps

Level Up Your Career in Game Development!