Senior HPC Cluster Engineer

3 Months ago • 5 Years + • Research & Development

Job Summary

Job Description

As a Senior HPC Cluster Engineer at NVIDIA, you'll lead the design and implementation of cutting-edge GPU compute clusters for deep learning, HPC, and computationally intensive workloads. Responsibilities include building and improving the GPU-accelerated computing ecosystem, developing large-scale automation solutions, maintaining and building deep learning clusters, supporting researchers, performing performance analysis and optimization, and conducting root cause analysis. You'll also be involved in strategic challenges related to compute, networking, storage, resource utilization, cloud strategy, capacity modeling, and growth planning.
Must have:
  • 5+ years experience designing/operating large-scale compute infrastructure
  • Experience analyzing and tuning HPC workload performance
  • Knowledge of cluster management tools (Ansible, Puppet, Salt)
  • Experience with HPC schedulers (SLURM, LSF)
  • Understanding of container technologies (Docker, Singularity)
  • Proficient in Linux (CentOS/RHEL or Ubuntu), Python, bash scripting
  • Experience with MPI-based HPC workflows
Good to have:
  • Understanding of MLPerf benchmarking
  • Familiarity with InfiniBand, IBOP, RDMA
  • Understanding of Lustre/GPFS for HPC
  • Background in SDN and HPC cluster networking
  • Familiarity with PyTorch and TensorFlow
Perks:
  • Highly competitive salaries
  • Comprehensive benefits package

Job Details

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing:

  • Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions

  • Maintaining and building deep learning clusters at scale

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

  • Root cause analysis and suggest corrective action for problems large and small scales

  • Finding and fixing problems before they occur

What we need to see:

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.

  • Minimum 5 years of experience designing and operating large scale compute infrastructure.

  • Experience analyzing and tuning performance for a variety of HPC workloads.

  • Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

  • Experience with HPC cluster job schedulers such as SLURM, LSF

  • In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud

  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

  • Experience with HPC workflows that use MPI

Ways to stand out from the crowd:

  • Understanding of MLPerf benchmarking

  • Familiarity with InfiniBand with IBOP and RDMA

  • Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.

  • Background with Software Defined Networking and HPC cluster networking

  • Familuarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

#LI-Hybrid

Similar Jobs

ION - Senior DevSecOps Engineer, Italy

ION

Pisa, Tuscany, Italy (On-Site)
6 Months ago
Zuru - DevOps Specialist

Zuru

Modena, Emilia-Romagna, Italy (Hybrid)
6 Months ago
Fluence - DevSecOps Engineer

Fluence

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
Fortra - Professional Services Consultant - Cybersecurity

Fortra

Saudi Arabia (On-Site)
5 Months ago
Microsoft - Technical Support Engineer - Kubernetes

Microsoft

Sydney, New South Wales, Australia (Remote)
3 Months ago
NVIDIA - VLSI Timing Methodology Intern - Summer 2025

NVIDIA

Santa Clara, California, United States (On-Site)
3 Months ago
ByteDance - Software Engineer in ML Systems Graduate (AML - Machine Learning Systems) - 2024 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Riot Games - Staff Software Engineer, Generalist - Unreal Ecosystem

Riot Games

Dublin, County Dublin, Ireland (On-Site)
5 Months ago
Cirrus Logic - Systems Engineer / Product Definer

Cirrus Logic

Edinburgh, Scotland, United Kingdom (Hybrid)
6 Months ago
NXP - <2025 Internship Program> Application Engineer

NXP

Taipei City, Taiwan (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

PwC - Senior Associate_Azure Data Engineer_Data & Analytics_Advisory_PAN  India

PwC

Kolkata, West Bengal, India (On-Site)
6 Months ago
NVIDIA - Senior Software QA Automation Engineer

NVIDIA

Ra'anana, Center District, Israel (On-Site)
3 Months ago
Luxoft - Senior Computer Systems Linux Engineer w/ Python

Luxoft

Bucharest, Bucharest, Romania (On-Site)
5 Months ago
Polygon Labs - Senior DevOps Engineer

Polygon Labs

United States (Remote)
1 Month ago
Interactive Brokers - Senior Cloud Security Engineer

Interactive Brokers

Fort Lauderdale, Florida, United States (Hybrid)
6 Months ago
Onward Search - DevOps Engineer

Onward Search

Irvine, California, United States (Hybrid)
2 Months ago
Every matrix - Application Security Engineer

Every matrix

Bucharest, Bucharest, Romania (Hybrid)
3 Months ago
Gearbox Software - Senior Site Reliability Engineer

Gearbox Software

Frisco, Texas, United States (On-Site)
4 Months ago
Rackspace Technology - ML/LLM Ops Intern

Rackspace Technology

Mexico City, Mexico City, Mexico (Remote)
2 Months ago
ION - Cyber Security Analyst, Italy

ION

Turin, Piedmont, Italy (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Yokne'am Illit, North District, Israel

NVIDIA - Senior Formal Verification Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
NVIDIA - Senior Instructional Designer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
SciPlay - Data Analyst - Maternity Leave Replacement

SciPlay

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
SuperPlay - Bookkeeper

SuperPlay

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago
Moon Active - Account Manager - German Market

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
4 Months ago
Playtika - Loyalty Manager

Playtika

Israel (On-Site)
3 Months ago
Playtika - Community Manager

Playtika

Israel (On-Site)
3 Months ago
SciPlay - Business Intelligence Developer

SciPlay

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago
Moon Active - Automation Engineer (Java)

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago
PAPAYA - Analytics Principal

PAPAYA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
8 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

HP - Machine Learning Intern

HP

Austin, Texas, United States (On-Site)
7 Months ago
Intel Corporation - CPU Physical Design-Timing Engineer

Intel Corporation

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Riot Games - Staff Software Engineer (Services) - League of Legends, Motivations

Riot Games

Los Angeles, California, United States (On-Site)
7 Months ago
Tesla - Senior Power Electronics Engineer

Tesla

Baden-Württemberg, Germany (On-Site)
2 Months ago
Ceragon Networks - Verification Team Lead

Ceragon Networks

Karnataka, India (On-Site)
5 Months ago
Epic Games - Principal Research Engineer

Epic Games

Cary, North Carolina, United States (On-Site)
3 Months ago
Ubisoft - Tools Programmer

Ubisoft

Shanghai, Shanghai, China (On-Site)
5 Months ago
NVIDIA - Senior High-Performance LLM Training Engineer

NVIDIA

Santa Clara, California, United States (Hybrid)
3 Months ago
NVIDIA - Senior ASIC Design Engineer

NVIDIA

California, Maryland, United States (Remote)
1 Month ago
NVIDIA - Senior Chip Design Verification Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

India (Remote)

Santa Clara, California, United States (Remote)

Santa Clara, California, United States (Remote)

Santa Clara, California, United States (Remote)

California, United States (Remote)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug