Senior HPC Cluster Engineer

1 Month ago • 5 Years + • Research & Development

Job Summary

Job Description

As a Senior HPC Cluster Engineer at NVIDIA, you'll lead the design and implementation of cutting-edge GPU compute clusters for deep learning, HPC, and computationally intensive workloads. Responsibilities include building and improving the GPU-accelerated computing ecosystem, developing large-scale automation solutions, maintaining and building deep learning clusters, supporting researchers, performing performance analysis and optimization, and conducting root cause analysis. You'll also be involved in strategic challenges related to compute, networking, storage, resource utilization, cloud strategy, capacity modeling, and growth planning.
Must have:
  • 5+ years experience designing/operating large-scale compute infrastructure
  • Experience analyzing and tuning HPC workload performance
  • Knowledge of cluster management tools (Ansible, Puppet, Salt)
  • Experience with HPC schedulers (SLURM, LSF)
  • Understanding of container technologies (Docker, Singularity)
  • Proficient in Linux (CentOS/RHEL or Ubuntu), Python, bash scripting
  • Experience with MPI-based HPC workflows
Good to have:
  • Understanding of MLPerf benchmarking
  • Familiarity with InfiniBand, IBOP, RDMA
  • Understanding of Lustre/GPFS for HPC
  • Background in SDN and HPC cluster networking
  • Familiarity with PyTorch and TensorFlow
Perks:
  • Highly competitive salaries
  • Comprehensive benefits package

Job Details

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing:

  • Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions

  • Maintaining and building deep learning clusters at scale

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

  • Root cause analysis and suggest corrective action for problems large and small scales

  • Finding and fixing problems before they occur

What we need to see:

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.

  • Minimum 5 years of experience designing and operating large scale compute infrastructure.

  • Experience analyzing and tuning performance for a variety of HPC workloads.

  • Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

  • Experience with HPC cluster job schedulers such as SLURM, LSF

  • In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud

  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

  • Experience with HPC workflows that use MPI

Ways to stand out from the crowd:

  • Understanding of MLPerf benchmarking

  • Familiarity with InfiniBand with IBOP and RDMA

  • Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.

  • Background with Software Defined Networking and HPC cluster networking

  • Familuarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

#LI-Hybrid

Similar Jobs

WorldWinner - Senior DevOps Engineer

WorldWinner

(Remote)
3 Weeks ago
Next Level Business Services - Full Stack Developer

Next Level Business Services

Jersey City, New Jersey, United States (On-Site)
4 Months ago
Ubisoft - DevOps Linux Administrator

Ubisoft

Saint-Mandé, Île-de-France, France (Hybrid)
1 Week ago
Eleven Labs - Compliance Engineer

Eleven Labs

London, England, United Kingdom (Remote)
3 Months ago
Intel Corporation - Web Application Development Engineer

Intel Corporation

San José, San José Province, Costa Rica (Hybrid)
2 Months ago
Riot Games - Technical Program Manager II - DevCon

Riot Games

Los Angeles, California, United States (On-Site)
1 Month ago
NVIDIA - Senior RTL Analysis Methodology Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
4 Weeks ago
Microsoft - Research Intern - Applied Sciences Group (Audio/Vision/NLP/Multimodal)

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago
NVIDIA - Research Scientist, Circuits

NVIDIA

Taipei City, Taiwan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

WebPT - Lead, DevOps Engineer

WebPT

Hyderabad, Telangana, India (Hybrid)
5 Months ago
Respawn Entertainment - Senior Build Engineer (Apex Legends)

Respawn Entertainment

Los Angeles, California, United States (On-Site)
6 Months ago
Cargo Studio - Lead DevOps Engineer

Cargo Studio

(On-Site)
2 Weeks ago
ARHS - Systems Engineer

ARHS

Valletta, Malta (On-Site)
3 Months ago
Dario - Senior DevOps Engineer

Dario

Gurugram, Haryana, India (Hybrid)
5 Months ago
Easy Brain - Senior Data Engineer

Easy Brain

Limassol, Limassol, Cyprus (Hybrid)
2 Weeks ago
Postman - Senior Security Engineer, Detection & Response

Postman

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Moon Active - DevOps Team Leader

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago
Wargaming - Software Engineer (Python/Go) (World of Warships, PC)

Wargaming

Belgrade, Serbia (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Yokne'am Illit, North District, Israel

PLAYSTUDIOS - Marketing Data Analyst

PLAYSTUDIOS

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Weeks ago
Playtika - VIP Account Manager

Playtika

Israel (On-Site)
3 Months ago
NVIDIA - Physical Design Full Chip STA Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
6 Days ago
Unity - Automation Engineer

Unity

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago
NVIDIA - Senior Video Compression Architect

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Moon Active - Backend Developer

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
3 Days ago
Playtika - Product Manager

Playtika

Israel (On-Site)
3 Months ago
Playtika - Growth Ventures-Director of Marketing

Playtika

Israel (On-Site)
2 Months ago
NVIDIA - HPC Lab Manager

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
NVIDIA - Senior Software Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

ByteDance - Software Engineer, AML Machine Learning Systems

ByteDance

Seattle, Washington, United States (On-Site)
5 Days ago
ByteDance - CPU Application Platform Architect

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
ByteDance - Research Scientist, Foundation Model, Speech & Audio

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Krafton  - Game Analyst / Game Researcher

Krafton

Seoul, South Korea (On-Site)
5 Days ago
Luxoft - Senior Computer Systems Linux Engineer w/ Python

Luxoft

Bucharest, Bucharest, Romania (On-Site)
3 Months ago
Luxoft - Senior GPU Compiler Software Development Engineer

Luxoft

Türkiye (Remote)
2 Months ago
ByteDance - Research Scientist Graduate (Foundation Model, Vision and Language) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Ubisoft - Team Lead Engine

Ubisoft

Montreal, Quebec, Canada (On-Site)
6 Days ago
Fabric - Applied Cryptographer, ZKP Research

Fabric

Toronto, Ontario, Canada (Remote)
4 Months ago
Riot Games - Staff Software Engineer, Engine & Console - Unpublished R&D Product

Riot Games

Dublin, County Dublin, Ireland (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Taipei City, Taiwan (On-Site)

Taipei City, Taiwan (On-Site)

Shanghai, Shanghai, China (On-Site)

Shanghai, Shanghai, China (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug