Senior HPC Cluster Engineer

1 Month ago • 5 Years + • Research & Development

Job Summary

Job Description

As a Senior HPC Cluster Engineer at NVIDIA, you'll lead the design and implementation of cutting-edge GPU compute clusters for deep learning, HPC, and computationally intensive workloads. Responsibilities include building and improving the GPU-accelerated computing ecosystem, developing large-scale automation solutions, maintaining and building deep learning clusters, supporting researchers, performing performance analysis and optimization, and conducting root cause analysis. You'll also be involved in strategic challenges related to compute, networking, storage, resource utilization, cloud strategy, capacity modeling, and growth planning.
Must have:
  • 5+ years experience designing/operating large-scale compute infrastructure
  • Experience analyzing and tuning HPC workload performance
  • Knowledge of cluster management tools (Ansible, Puppet, Salt)
  • Experience with HPC schedulers (SLURM, LSF)
  • Understanding of container technologies (Docker, Singularity)
  • Proficient in Linux (CentOS/RHEL or Ubuntu), Python, bash scripting
  • Experience with MPI-based HPC workflows
Good to have:
  • Understanding of MLPerf benchmarking
  • Familiarity with InfiniBand, IBOP, RDMA
  • Understanding of Lustre/GPFS for HPC
  • Background in SDN and HPC cluster networking
  • Familiarity with PyTorch and TensorFlow
Perks:
  • Highly competitive salaries
  • Comprehensive benefits package

Job Details

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing:

  • Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions

  • Maintaining and building deep learning clusters at scale

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

  • Root cause analysis and suggest corrective action for problems large and small scales

  • Finding and fixing problems before they occur

What we need to see:

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.

  • Minimum 5 years of experience designing and operating large scale compute infrastructure.

  • Experience analyzing and tuning performance for a variety of HPC workloads.

  • Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

  • Experience with HPC cluster job schedulers such as SLURM, LSF

  • In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud

  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

  • Experience with HPC workflows that use MPI

Ways to stand out from the crowd:

  • Understanding of MLPerf benchmarking

  • Familiarity with InfiniBand with IBOP and RDMA

  • Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.

  • Background with Software Defined Networking and HPC cluster networking

  • Familuarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

#LI-Hybrid

Similar Jobs

King - Senior Software Engineer

King

(On-Site)
5 Days ago
ION - Storage Engineer, Italy

ION

Italy (Hybrid)
4 Months ago
Initializ - Senior React.js (Next.js) Developer

Initializ

Gurugram, Haryana, India (Hybrid)
5 Months ago
NVIDIA - Senior DevOps Engineer, Deep Learning Frameworks

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
NVIDIA - Clock Design Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
NVIDIA - Senior Chip-Design Verification Engineer, Networking Chip Design

NVIDIA

Belfast, Northern Ireland, United Kingdom (On-Site)
1 Month ago
NVIDIA - SOC Design Engineer

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
3 Weeks ago
NVIDIA - Senior Software Technical Program Manager - Compute Software Technologies

NVIDIA

Santa Clara, California, United States (On-Site)
1 Week ago
ByteDance - Software Engineer - Applied Machine Learning

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Krafton  - PUBG IP Franchise New Project Business PM

Krafton

Seoul, South Korea (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

DaySmart - Senior DevOps Engineer

DaySmart

Hyderabad, Telangana, India (On-Site)
4 Months ago
Consilio LLC - Infrastructure Site Reliability Engineer

Consilio LLC

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Netflix - Support Solutions Engineer (L5) Data Platform, Kafka

Netflix

United States (Remote)
1 Day ago
Wind River Systems - Star Lab - Principal Technologist - Embedded Security Professional Services

Wind River Systems

San Antonio, Texas, United States (On-Site)
3 Months ago
Wargaming - DevOps Engineer

Wargaming

Vilnius, Vilnius County, Lithuania (On-Site)
2 Months ago
Rackspace Technology - Google Cloud Engineer IV

Rackspace Technology

United States (Remote)
1 Month ago
Google - CPU Design Verification Engineer, Google Cloud

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Keywords Studios (Player Support) - Architecte de solutions

Keywords Studios (Player Support)

Montreal, Quebec, Canada (Remote)
2 Months ago
Krafton  - [Infra Div.] Game DevOps Engineer (BGMI) (3년 ~ 5년)

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Tencent - Cloud Engineer

Tencent

(On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Yokne'am Illit, North District, Israel

BigID - Senior Data Engineer

BigID

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
2 Months ago
PAPAYA - Business Analyst

PAPAYA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Moon Active - Product Game Designer

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago
NVIDIA - Senior Post-Silicon PHY System Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Varonis  - Talent Sourcer

Varonis

Herzliya, Tel Aviv District, Israel (Hybrid)
3 Months ago
NVIDIA - Senior Software Validation Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
SuperPlay - GAME ECONOMIST

SuperPlay

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
NVIDIA - Senior Software Verification Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Overwolf - Global Payroll Manager

Overwolf

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Week ago
NVIDIA - Senior Chip Architect

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

ByteDance - CPU Optimization Architect

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Netflix - Research Scientist L4/L5, Algorithms Engineering

Netflix

United States (Remote)
1 Month ago
NVIDIA - Senior Systems Software Engineer

NVIDIA

Oregon, United States (On-Site)
3 Weeks ago
Booming games - Prototyping Engineer

Booming games

(Remote)
1 Month ago
Rambus - SMTS Verification Engineering

Rambus

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
NVIDIA - Manager, Design Verification

NVIDIA

Hsinchu, Hsinchu City, Taiwan (Hybrid)
2 Weeks ago
NVIDIA - Mixed Signal Circuit Designer (RDSS Intern)

NVIDIA

Hsinchu, Hsinchu City, Taiwan (On-Site)
1 Month ago
Cadence - Product Engineering Architect (Circuit Simulation)

Cadence

San Jose, California, United States (On-Site)
4 Months ago
Fabric - Applied Researcher, Cryptography Hardware

Fabric

Chicago, Illinois, United States (Remote)
4 Months ago
NVIDIA - CSP Hardware Application Engineer

NVIDIA

Beijing, Beijing, China (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

United States (Remote)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Bengaluru, Karnataka, India (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug