Senior HPC AI Cluster Engineer

3 Months ago • 5 Years + • DevOps • $144,000 PA - $270,250 PA

Job Summary

Job Description

NVIDIA seeks a Senior HPC AI Cluster Engineer to design, implement, and maintain large-scale HPC/AI clusters. Responsibilities include managing workloads, developing CI/CD pipelines, automating infrastructure deployment, troubleshooting systems (from bare metal to applications), and developing standard methodologies. The role requires expertise in HPC/AI technologies, Linux, networking, storage solutions (Lustre, GPFS, etc.), automation tools (Ansible, Jenkins), and scripting (Python, Bash). Collaboration with researchers, developers, and customers to optimize workflows and build differentiated solutions is key. This position involves supporting R&D and POCs for future improvements.
Must have:
  • 5+ years experience
  • HPC/AI solution knowledge
  • Job scheduling (Slurm, K8s)
  • Linux/Windows expertise
  • Storage solutions (Lustre, GPFS)
  • Python/Bash scripting
  • Automation tools (Ansible, Jenkins)
Good to have:
  • CPU/GPU architecture knowledge
  • Kubernetes experience
  • GPU hardware/software (DGX, CUDA)
  • RDMA experience
  • Cloud computing familiarity
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC, be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs.

You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms. Does this sound like you? If so, we would love to hear from you!

What you will be doing:

  • Designing, implementing and maintaining large scale HPC/AI clusters with monitoring, logging and alerting

  • Managing Linux job/workload schedules and orchestration tools

  • Developing and maintaining continuous integration and delivery pipelines

  • Developing tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources

  • Deploying monitoring solutions for the servers, network and storage

  • Troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level

  • Being a technical resource, developing, re-defining and documenting standard methodologies to share with internal teams

  • Supporting Research & Development activities and engaging in POCs/POVs for future improvements

What we need to see:

  • Bachelor's Degree in Computer Science, Engineering, or a related field; or equivalent experience

  • 5+ years of experience

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software

  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s

  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.

  • Python programming and bash scripting experience.

  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet

  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)

  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Ways to stand out from the crowd:

  • Knowledge of CPU and/or GPU architecture

  • Knowledge of Kubernetes, container related microservice technologies

  • Experience with GPU-focused hardware/software (DGX, Cuda)

  • Background with RDMA (InfiniBand or RoCE) fabrics

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. We have a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent.

The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Microsoft - Software Engineer - Storage

Microsoft

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Activision - Data Engineering Co-op

Activision

Vancouver, British Columbia, Canada (Hybrid)
2 Months ago
Epic Games - Senior Engineer, Patching

Epic Games

United States (On-Site)
3 Months ago
Zoox - Test Engineer, Manufacturing Test & Diagnostics

Zoox

San Carlos, California, United States (On-Site)
6 Months ago
ByteDance - Server System Performance Engineer (Multiple Positions)

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
Anthology  Inc  - Senior Applications Developer I

Anthology Inc

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Interactive Brokers - Senior Systems Engineer- Microsoft M365/Active Directory

Interactive Brokers

Chicago, Illinois, United States (Hybrid)
6 Months ago
Luxoft - Senior Java Developer

Luxoft

Pune, Maharashtra, India (On-Site)
5 Months ago
Quantinium - DevOps Engineer

Quantinium

Boca Raton, Florida, United States (Hybrid)
2 Months ago
PwC - ETIC, Cloud Solution Architect - Manager

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Senior Software Development Engineer - Distributed NoSQL Database Systems

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
NVIDIA - Solutions Architect, AI Infrastructure

NVIDIA

Georgia, Vermont, United States (Remote)
3 Months ago
Matific - Senior Data Engineer

Matific

Colombo, Western Province, Sri Lanka (On-Site)
2 Months ago
Virtusa - Manual Tester

Virtusa

Austin, Texas, United States (On-Site)
8 Months ago
GoTo Group - Lead Software Engineer - Engineering Platform

GoTo Group

Bengaluru, Karnataka, India (On-Site)
5 Months ago
NVIDIA - Senior Functional Test Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
3 Months ago
NVIDIA - Senior RAS Architect - Datacenter CPU and SOC

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
ByteDance - Research Engineer (Foundation Model) - Machine Learning Systems

ByteDance

Singapore (On-Site)
5 Months ago
NVIDIA - Senior Software Configuration Management Engineer

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Fluence - Controls Engineer (m/f/d)

Fluence

Amsterdam, North Holland, Netherlands (Remote)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in California, United States

Scale AI - Strategic Finance, Consolidations

Scale AI

San Francisco, California, United States (On-Site)
5 Months ago
Sourcegraph  Inc  - Product Designer [IC4]

Sourcegraph Inc

San Francisco, California, United States (On-Site)
5 Months ago
ION - Technical Support Analyst, Chicago - 5849/9555

ION

Chicago, Illinois, United States (On-Site)
6 Months ago
NVIDIA - System Software Engineer - Data Center Diagnostics

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
ION - Technical Support Analyst, Jersey City - 9781

ION

Jersey City, New Jersey, United States (On-Site)
6 Months ago
Scale AI - Executive Researcher

Scale AI

San Francisco, California, United States (Hybrid)
6 Months ago
PENN Interactive - Senior Software Developer, Pricing Engine

PENN Interactive

Philadelphia, Pennsylvania, United States (Hybrid)
3 Months ago
2K - Manager, PC Compatibility

2K

Las Vegas, Nevada, United States (On-Site)
1 Month ago
Epic Games - Release Manager

Epic Games

Cary, North Carolina, United States (On-Site)
7 Months ago
Mattel  Inc  - American Girl Retail Security (Part-Time)

Mattel Inc

Illinois, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Trend Micro - (Sr.) Cloud Developer (Vision One)

Trend Micro

Taipei City, Taiwan (On-Site)
6 Months ago
Crunchyroll - Staff DevOps Engineer, Embedded Cloud Reliability

Crunchyroll

San Francisco, California, United States (Hybrid)
2 Months ago
NVIDIA - Senior DevOps Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Info Stretch - .Net Architect

Info Stretch

Mechanicsburg, Pennsylvania, United States (On-Site)
5 Months ago
The Walt Disney Company - Lead Software Engineer (Identity)

The Walt Disney Company

San Francisco, California, United States (On-Site)
5 Months ago
Milestone - Lead Data Engineer

Milestone

United States (Remote)
2 Months ago
ByteDance - Backend Software Engineer (Business Infra), ByteCloud - 2025 Start

ByteDance

Singapore (On-Site)
5 Months ago
Ubisoft - DevOps Linux Administrator

Ubisoft

Saint-Mandé, Île-de-France, France (On-Site)
2 Months ago
SmileGate - Platform Engineering Lead

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
2 Months ago
Rackspace Technology - AWS Support Engineer III

Rackspace Technology

Bengaluru, Karnataka, India (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

New York, New York, United States (On-Site)

Taipei City, Taiwan (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug