HPC Operations Manager - Hardware Engineering

4 Months ago • 15 Years + • Software Development & Engineering • $272,000 PA - $425,500 PA

Job Summary

Job Description

NVIDIA seeks a highly motivated HPC Operations Manager to lead and mentor a multinational team in managing and evolving its global HPC clusters. Responsibilities include ensuring high reliability, developing critical metrics, identifying and resolving failures, evaluating new technologies, planning hardware deployments, collaborating with hardware engineering teams, managing the HPC scheduler (LSF), tracking software licenses, and communicating program status to senior management. The ideal candidate will have extensive experience in IT infrastructure management, Linux server administration, HPC schedulers, and hardware design workflows.
Must have:
  • 15+ years overall experience
  • 5+ years managing IT infra teams
  • 10+ years running Linux servers
  • HPC schedulers (IBM LSF preferred)
  • Knowledge of hardware design workflows
Good to have:
  • HPC storage (Netapp, Pure Storage, etc.)
  • Infiniband (operations, debugging)
  • Software development (DevOps)
  • Relational databases, data lakes
  • FlexLM-based software license servers
Perks:
  • Equity
  • Benefits

Job Details

Widely considered to be one of the technology world’s most desirable employers, NVIDIA is an industry leader with groundbreaking developments in High-Performance Computing, Artificial Intelligence and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables outstanding creativity and discovery and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are now looking for a highly motivated HPC Operations Manager to join this multifaceted and innovative infrastructure team to craft global and dynamic HPC clusters used by Nvidia’s hardware design teams. We are looking for leaders to help us grow and evolve a reliable computing environment to enable our hardware designers to build the next generation of GPUs and SOCs.

What You'll be Doing:

  • A huge part of the day-to-day job is collaborating with partners to develop programs driving around storage, networking, and compute in our growing fleet of data centers.

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers, in support of the chip design teams

  • Ensure the highest reliability of HPC clusters. Develop critical metrics, program schedules to measure program health, predictability, and achievements

  • Identify failures, lead retrospective analysis, and help to develop improvement action plans. Build standard methodologies that cut through complexity and can be used across Nvidia and influence other partners for continuous improvement

  • Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure. Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)

  • Work multi-functionally with hardware engineering leaders to support their future chip design needs, understand their workflow characteristics, and engineer an efficient HPC environment. Work with IT and engineering infrastructure teams on the different subsystems that comprise the computing environment.

  • Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand to each hardware division, and drive high utilization.

  • Track software licensing servers and drive efficient license utilization

  • Develop and manage program schedules, milestones and deliverables. Adjust in the face of a highly fluid customer product roadmap.

  • Regularly communicate program status and key issues to senior management at NVIDIA’s headquarters. Accurately represent the importance of issues and call out issues appropriately. Be the evangelist of data driven project management

What We Need to See:

  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)

  • 15+ years overall

  • 5+ years managing IT infrastructure teams of 10+ people

  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks

  • Knowledge of HPC schedulers (IBM LSF preferred)

  • Knowledge of hardware design workflows (EDA tools and methodology)

  • Experience using project management and capacity planning software

  • Datacenter operations (rack and stack, maintenance)

Ways to stand out from the crowd:

  • HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)

  • Infiniband (operations, debugging, performance tuning)

  • Software development, especially in a devops context

  • Knowledge of relational databases, data lakes, metrics/visualization/analytics platforms

  • Deploying and maintaining FlexLM-based software license servers

  • Established relationships with enterprise-level equipment suppliers

The base salary range is 272,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Tesla - Construction Manager, Life Safety Systems

Tesla

Brandenburg, Germany (On-Site)
4 Months ago
Veeam Software - Backend Engineer, SaaS platform

Veeam Software

Prague, Czechia (On-Site)
1 Month ago
Dayforce - Test Automation Engineer Sr

Dayforce

Bengaluru, Karnataka, India (Remote)
11 Months ago
entrata - Consultant

entrata

United States (Remote)
1 Month ago
Optiv - SailPoint Engineer - TS/SCI with CI Poly

Optiv

Herndon, Virginia, United States (On-Site)
1 Month ago
Nagarro - Senior Engineer

Nagarro

India (Remote)
8 Months ago
Roblox - Principal Software Engineer - Voice

Roblox

San Mateo, California, United States (On-Site)
2 Weeks ago
Apple - Engineering Project Manager (SAP Finance)

Apple

Sunnyvale, California, United States (On-Site)
3 Weeks ago
Marvell - Principal Design Verification Engineer

Marvell

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Sonar Source - Software Engineer

Sonar Source

Bochum, North Rhine-Westphalia, Germany (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Vigaet - Project Coordinator-Internship

Vigaet

Bengaluru, Karnataka, India (On-Site)
1 Year ago
The Walt Disney Company - Senior Software Engineer - Front End

The Walt Disney Company

Glendale, California, United States (On-Site)
2 Months ago
Assystems - MEP Design Expert

Assystems

Bengaluru, Karnataka, India (On-Site)
8 Months ago
bohemia interactive - Junior Game Programmer

bohemia interactive

Prague, Prague, Czechia (On-Site)
1 Month ago
Trend Micro - Sr. Cloud Software Engineer (Vision One XDR Search Data Lake)

Trend Micro

Taipei City, Taiwan (On-Site)
1 Month ago
Philips - Firmware Engineer

Philips

Pune, Maharashtra, India (On-Site)
3 Weeks ago
Sailpoint - Senior Consultant

Sailpoint

Australia (Remote)
1 Month ago
Zynga - Game Designer

Zynga

Bengaluru, Karnataka, India (On-Site)
1 Month ago
NBC Universal - Fleet Service Technician

NBC Universal

Doraville, Georgia, United States (On-Site)
1 Month ago
Insight Software - Customer Success Associate (English and German Speaking)

Insight Software

London, England, United Kingdom (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Austin, Texas, United States

Blitz app - Reverse Engineer

Blitz app

Los Angeles, California, United States (On-Site)
3 Years ago
Epic Games - Audio Director

Epic Games

Cary, North Carolina, United States (On-Site)
2 Months ago
Rackner - Cybersecurity Engineer

Rackner

Dayton, Ohio, United States (Remote)
1 Month ago
Rockstar Games - Senior Illustrator

Rockstar Games

New York, United States (On-Site)
1 Month ago
Zscaler - Principal Software Engineer (ZDX)- Mac/IOS

Zscaler

San Jose, California, United States (Hybrid)
1 Month ago
Power Integrations - Senior Customer Quality Manager

Power Integrations

San Jose, California, United States (On-Site)
5 Months ago
Rippling - Senior Fullstack Engineer (Backend) — App Platforms

Rippling

San Francisco, California, United States (On-Site)
6 Days ago
Floor84 Studio - Intern - Fall Term

Floor84 Studio

Los Angeles, California, United States (On-Site)
1 Month ago
Sony Pictures Entertainment - Manager, Payroll Tax

Sony Pictures Entertainment

Culver City, California, United States (On-Site)
3 Weeks ago
Granicus - Account Executive - New Business

Granicus

United States (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

Software Development & Engineering Jobs

Tesla - Process Engineer, Battery Module

Tesla

Brandenburg, Germany (On-Site)
4 Months ago
Thales - Senior Sales Engineer

Thales

California, United States (Remote)
1 Month ago
supercell - LLM Engineer

supercell

Helsinki, Uusimaa, Finland (On-Site)
2 Months ago
Scopely - Senior Client Engineer - Star Trek Fleet Command

Scopely

Dublin, County Dublin, Ireland (Hybrid)
3 Months ago
Wind River - Field Application Engineer

Wind River

Beijing, China (On-Site)
3 Weeks ago
Roblox - Senior Software Engineer, Core Services

Roblox

San Mateo, California, United States (On-Site)
1 Week ago
Haleon - Project Electrical Engineer

Haleon

Levice, Nitra Region, Slovakia (On-Site)
1 Month ago
Rippling - Staff Software Engineer

Rippling

Bengaluru, Karnataka, India (On-Site)
5 Months ago
bytedance - SDK Support Engineer (B2B) - Pico

bytedance

San Jose, California, United States (On-Site)
6 Months ago
Nagarro - Team Lead SAP SuccessFactors

Nagarro

Germany (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug