Outscal Logooutscal logo

HPC Operations Manager – Hardware Engineering

1 Month ago • 15 Years + • Research & Development • $272,000 PA - $425,500 PA

Job Summary

Job Description

NVIDIA seeks a highly motivated HPC Operations Manager to lead and mentor a multinational team of sysadmins and DevOps engineers supporting chip design teams. Responsibilities include ensuring high reliability of HPC clusters, developing critical metrics, identifying failures, and implementing improvement plans. The role requires evaluating new technologies, planning hardware deployments, collaborating with hardware engineering leaders, managing the HPC scheduler (LSF), tracking software licenses, and communicating program status to senior management. The ideal candidate will have extensive experience in IT infrastructure management, Linux server administration, HPC schedulers, and hardware design workflows.
Must have:
  • 15+ years overall experience
  • 5+ years managing IT infrastructure teams
  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Knowledge of hardware design workflows
  • Data center operations
Good to have:
  • HPC storage (Netapp, Pure Storage, Lustre, ZFS, Isilon)
  • Infiniband (operations, debugging, performance tuning)
  • Software development (DevOps context)
  • Relational databases, data lakes, metrics/visualization/analytic platforms
  • Deploying and maintaining FlexLM-based software license servers
  • Established relationships with enterprise-level equipment suppliers
Perks:
  • Equity
  • Benefits

Job Details

Widely considered to be one of the technology world’s most desirable employers, NVIDIA is an industry leader with groundbreaking developments in High-Performance Computing, Artificial Intelligence and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables outstanding creativity and discovery and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are now looking for a highly motivated HPC Operations Manager to join this multifaceted and innovative infrastructure team to craft global and dynamic HPC clusters used by Nvidia’s hardware design teams. We are looking for leaders to help us grow and evolve a reliable computing environment to enable our hardware designers to build the next generation of GPUs and SOCs.

What You'll be Doing:

  • A huge part of the day-to-day job is collaborating with partners to develop programs driving around storage, networking, and compute in our growing fleet of data centers.

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers, in support of the chip design teams

  • Ensure the highest reliability of HPC clusters. Develop critical metrics, program schedules to measure program health, predictability, and achievements

  • Identify failures, lead retrospective analysis, and help to develop improvement action plans. Build standard methodologies that cut through complexity and can be used across Nvidia and influence other partners for continuous improvement

  • Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure. Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)

  • Work multi-functionally with hardware engineering leaders to support their future chip design needs, understand their workflow characteristics, and engineer an efficient HPC environment. Work with IT and engineering infrastructure teams on the different subsystems that comprise the computing environment.

  • Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand to each hardware division, and drive high utilization.

  • Track software licensing servers and drive efficient license utilization

  • Develop and manage program schedules, milestones and deliverables. Adjust in the face of a highly fluid customer product roadmap.

  • Regularly communicate program status and key issues to senior management at NVIDIA’s headquarters. Accurately represent the importance of issues and call out issues appropriately. Be the evangelist of data driven project management

What We Need to See:

  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)

  • 15+ years overall

  • 5+ years managing IT infrastructure teams of 10+ people

  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks

  • Knowledge of HPC schedulers (IBM LSF preferred)

  • Knowledge of hardware design workflows (EDA tools and methodology)

  • Experience using project management and capacity planning software

  • Datacenter operations (rack and stack, maintenance)

Ways to stand out from the crowd:

  • HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)

  • Infiniband (operations, debugging, performance tuning)

  • Software development, especially in a devops context

  • Knowledge of relational databases, data lakes, metrics/visualization/analytics platforms

  • Deploying and maintaining FlexLM-based software license servers

  • Established relationships with enterprise-level equipment suppliers

The base salary range is 272,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

NVIDIA - Senior Technical Instructor - AI and Data Center Infrastructure

NVIDIA

Ra'anana, Center District, Israel (On-Site)
1 Week ago
NVIDIA - Senior HPC Technical Support Engineer – Ethernet

NVIDIA

Westford, Massachusetts, United States (On-Site)
4 Weeks ago
Interactive Brokers - Automation Database Developer

Interactive Brokers

Greenwich, Connecticut, United States (Hybrid)
5 Months ago
Meta - Network Engineer, Deployment and Support

Meta

Dublin, County Dublin, Ireland (On-Site)
4 Months ago
Intrepid Studios,  Inc  - Senior Networking Engineer

Intrepid Studios, Inc

Canada (On-Site)
7 Months ago
NVIDIA - Senior ASIC Design Engineer

NVIDIA

Washington, District Of Columbia, United States (Remote)
2 Weeks ago
Niantic - Senior Computer Vision Software Engineer

Niantic

Sunnyvale, California, United States (Hybrid)
1 Month ago
NVIDIA - DFT Engineer

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Meta - Software Engineer (Leadership) - Machine Learning

Meta

Paris, Île-de-France, France (On-Site)
4 Months ago
Samsung Semiconductor - Staff Engineer, Formal Verification

Samsung Semiconductor

San Jose, California, United States (Hybrid)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ION - Service Desk Analyst - 5600

ION

Singapore (On-Site)
5 Months ago
NVIDIA - Senior Software Engineer - Windows for ARM and Tegra

NVIDIA

Austin, Texas, United States (On-Site)
1 Month ago
Playrix - Senior Release Support Engineer

Playrix

Montenegro (Remote)
5 Months ago
Skydio - Senior Software Engineer - Manufacturing Software

Skydio

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Steer Studios - Sr. IT Infrastructure Administrator

Steer Studios

Riyadh, Riyadh Province, Saudi Arabia (On-Site)
11 Hours ago
ION - IT System Administrator

ION

Italy (Hybrid)
5 Months ago
Rivos - Member of Technical Staff (91839)

Rivos

Santa Clara, California, United States (Hybrid)
5 Months ago
Skydio - Flight Test Operator - Flight Core and Hardware Validation

Skydio

San Mateo, California, United States (On-Site)
8 Months ago
DNEG - Creature TD - Rigging

DNEG

Mumbai, Maharashtra, India (On-Site)
1 Week ago
Niantic - Senior Security Engineer, Detection and Response

Niantic

Zürich, Zurich, Switzerland (Hybrid)
8 Hours ago

Get notifed when new similar jobs are uploaded

Jobs in Westford, Massachusetts, United States

Mattel  Inc  - American Girl Server

Mattel Inc

Texas, United States (On-Site)
3 Months ago
PENN Interactive - Staff Design Operations Program Manager

PENN Interactive

Philadelphia, Pennsylvania, United States (Hybrid)
2 Months ago
Next Level Business Services - Solution Architect

Next Level Business Services

Philadelphia, Pennsylvania, United States (On-Site)
5 Months ago
Zoox - Systems Engineer, Autonomy Verification and Validation

Zoox

Foster City, California, United States (Hybrid)
5 Months ago
Samsung Semiconductor - Staff Software Engineer – Platform

Samsung Semiconductor

San Jose, California, United States (Hybrid)
1 Week ago
Life church - Donor Relationship Manager

Life church

Edmond, Oklahoma, United States (On-Site)
5 Months ago
Epic Games - Animation Lead

Epic Games

Cary, North Carolina, United States (On-Site)
6 Months ago
USE Insider - Senior Content Writer - Remote

USE Insider

United States (Remote)
5 Months ago
Floor 84 Studio - Intern - Summer Term

Floor 84 Studio

Los Angeles, California, United States (On-Site)
3 Months ago
Scanline VFX - Lead Software Engineer

Scanline VFX

Los Angeles, California, United States (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

ByteDance - Research Engineer Graduate (Machine Learning Sys-US) - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
GlobalHunt - Design Engineer

GlobalHunt

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Rivos - Silicon CAD Front End- Full time

Rivos

Bengaluru, Karnataka, India (On-Site)
5 Months ago
NVIDIA - Senior Software Engineer - Switch Simulation

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
NVIDIA - Senior Chip Design Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago
ByteDance - Software Engineer, Model Inference

ByteDance

San Jose, California, United States (On-Site)
21 Hours ago
NVIDIA - SDK Ethernet Software Team Manager

NVIDIA

Ra'anana, Center District, Israel (On-Site)
1 Month ago
ByteDance - Research Scientist, Infrastructure System Lab

ByteDance

Seattle, Washington, United States (On-Site)
20 Hours ago
Riot Games - Staff Software Engineer, Rendering - League of Legends

Riot Games

Los Angeles, California, United States (On-Site)
1 Day ago
Pattern® - Senior Software Engineer - NodeJS

Pattern®

Pune, Maharashtra, India (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Hsinchu, Hsinchu City, Taiwan (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Seoul, South Korea (Hybrid)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Shanghai, Shanghai, China (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Be'er Sheva, South District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug