HPC Operations Manager – Hardware Engineering

6 Months ago • 15 Years + • Software Development & Engineering • $272,000 PA - $425,500 PA

Job Summary

Job Description

NVIDIA seeks a highly motivated HPC Operations Manager to lead and mentor a multinational team in managing global HPC clusters used by hardware design teams. Responsibilities include ensuring cluster reliability, developing key metrics, identifying and resolving failures, evaluating new technologies, planning hardware deployments, collaborating with engineering leaders, managing the HPC scheduler (LSF), and communicating program status to senior management. The role requires expertise in Linux servers, NFS storage, Ethernet networks, HPC schedulers, and hardware design workflows.
Must have:
  • 15+ years experience
  • 5+ years managing IT teams
  • 10+ years running Linux servers
  • HPC schedulers (LSF preferred)
  • Hardware design workflows knowledge
  • Data center operations
Good to have:
  • HPC storage expertise
  • Infiniband knowledge
  • Software development skills
  • Relational database knowledge
  • Experience with enterprise equipment suppliers
Perks:
  • Equity
  • Benefits

Job Details

Widely considered to be one of the technology world’s most desirable employers, NVIDIA is an industry leader with groundbreaking developments in High-Performance Computing, Artificial Intelligence and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables outstanding creativity and discovery and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are now looking for a highly motivated HPC Operations Manager to join this multifaceted and innovative infrastructure team to craft global and dynamic HPC clusters used by Nvidia’s hardware design teams. We are looking for leaders to help us grow and evolve a reliable computing environment to enable our hardware designers to build the next generation of GPUs and SOCs.

What You'll be Doing:

  • A huge part of the day-to-day job is collaborating with partners to develop programs driving around storage, networking, and compute in our growing fleet of data centers.

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers, in support of the chip design teams

  • Ensure the highest reliability of HPC clusters. Develop critical metrics, program schedules to measure program health, predictability, and achievements

  • Identify failures, lead retrospective analysis, and help to develop improvement action plans. Build standard methodologies that cut through complexity and can be used across Nvidia and influence other partners for continuous improvement

  • Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure. Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)

  • Work multi-functionally with hardware engineering leaders to support their future chip design needs, understand their workflow characteristics, and engineer an efficient HPC environment. Work with IT and engineering infrastructure teams on the different subsystems that comprise the computing environment.

  • Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand to each hardware division, and drive high utilization.

  • Track software licensing servers and drive efficient license utilization

  • Develop and manage program schedules, milestones and deliverables. Adjust in the face of a highly fluid customer product roadmap.

  • Regularly communicate program status and key issues to senior management at NVIDIA’s headquarters. Accurately represent the importance of issues and call out issues appropriately. Be the evangelist of data driven project management

What We Need to See:

  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)

  • 15+ years overall

  • 5+ years managing IT infrastructure teams of 10+ people

  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks

  • Knowledge of HPC schedulers (IBM LSF preferred)

  • Knowledge of hardware design workflows (EDA tools and methodology)

  • Experience using project management and capacity planning software

  • Datacenter operations (rack and stack, maintenance)

Ways to stand out from the crowd:

  • HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)

  • Infiniband (operations, debugging, performance tuning)

  • Software development, especially in a devops context

  • Knowledge of relational databases, data lakes, metrics/visualization/analytics platforms

  • Deploying and maintaining FlexLM-based software license servers

  • Established relationships with enterprise-level equipment suppliers

The base salary range is 272,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

The Walt Disney Company - Manager, Software Engineer - Video Playback

The Walt Disney Company

New York, New York, United States (On-Site)
6 Months ago
miniclip - Backend Developer

miniclip

Netherlands (On-Site)
3 Months ago
Trackman - Customer Service Specialist (Tier 1)

Trackman

(On-Site)
4 Months ago
Nintendo - Senior Instructional Designer

Nintendo

Redmond, Washington, United States (On-Site)
10 Months ago
Grab - Associate Software Engineer, Support

Grab

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Adtran - Software Engineer (M/F/D)

Adtran

Meiningen, Thuringia, Germany (On-Site)
2 Months ago
Precisly - Mainframe Storage, Senior Support Engineer II

Precisly

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Apple - Engineering Program Manager, Security Compliance

Apple

Austin, Texas, United States (On-Site)
1 Month ago
Coupa - Sr. Software Engineer

Coupa

Hyderabad, Telangana, India (Hybrid)
1 Month ago
Ansys - Senior Application Engineer

Ansys

Ann Arbor, Michigan, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Eccentric - 3D Management Intern

Eccentric

Mumbai, Maharashtra, India (On-Site)
3 Months ago
Capgemini - SDWAN (Consultant/Lead/Architect)

Capgemini

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Marvell - Package Development, Signal Integrity and Power Integrity Engineer, Senior Staff

Marvell

Austin, Texas, United States (On-Site)
1 Month ago
Starkflow - Principal Full Stack Developer

Starkflow

Karnataka, India (Hybrid)
4 Months ago
Casumo - Junior IT System Administrator

Casumo

Skopje, Greater Skopje, North Macedonia (On-Site)
1 Month ago
The Walt Disney Company - Technical Lighter (All Levels)

The Walt Disney Company

London, England, United Kingdom (Hybrid)
2 Months ago
FalconX - Trading System Support Engineer

FalconX

Bengaluru, Karnataka, India (On-Site)
4 Weeks ago
Airlab Inc  - Senior Lead Programmer (Game Industry)

Airlab Inc

Montreal, Quebec, Canada (On-Site)
1 Year ago
Qualcomm - Ethernet PHY HW Application Engineer

Qualcomm

Taipei City, Taiwan (On-Site)
3 Months ago
Applied materials  - Unity Developer

Applied materials

Bengaluru, Karnataka, India (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Varonis  - Email Security Architect

Varonis

United States (Remote)
3 Months ago
Riot Games - Game Design Manager, Maps

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
Payactive - Sales Specialist-Mid Market

Payactive

Milpitas, California, United States (Remote)
9 Months ago
Absurd Ventures - Senior Animator

Absurd Ventures

Santa Monica, California, United States (On-Site)
2 Months ago
Rippling - Senior Staff Software Engineer - Time Products

Rippling

San Francisco, California, United States (On-Site)
1 Month ago
MiQ - Marketing Copywriter

MiQ

New York, United States (Hybrid)
1 Month ago
Penn Interactive - Senior Accountant

Penn Interactive

Philadelphia, Pennsylvania, United States (On-Site)
1 Month ago
Cognite - Senior Implementation Project Manager

Cognite

Austin, Texas, United States (Hybrid)
2 Months ago
eBay - MTS 2, Software Engineer

eBay

Austin, Texas, United States (Hybrid)
2 Months ago
Brillio - Zuora CPQ Technical Architect

Brillio

San Ramon, California, United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Software Development & Engineering Jobs

Alphawave Semi - Senior Design Verification Engineer (HSI- High Speed Interfaces)

Alphawave Semi

Toronto, Ontario, Canada (On-Site)
2 Months ago
Next Level Business Services - SAP PP Consultant

Next Level Business Services

Atlanta, Georgia, United States (On-Site)
9 Months ago
Tencent - Software Engineer Intern

Tencent

(On-Site)
3 Months ago
Ansys - Application Engineer II

Ansys

Austin, Texas, United States (On-Site)
1 Month ago
Coupa - Sr. Lead Software Engineer

Coupa

Pune, Maharashtra, India (On-Site)
3 Months ago
Capgemini - Software Engineering Unit Manager

Capgemini

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Qualcomm - UWB Software Customer Engineer

Qualcomm

Suwon-si, Gyeonggi-do, South Korea (On-Site)
1 Month ago
Mozilla - Sync Software Engineer

Mozilla

United States (Remote)
4 Weeks ago
Token Metrics - Crypto Software Engineering Manager

Token Metrics

Austin, Texas, United States (Remote)
2 Years ago
Intel  - Software Application Development Engineer

Intel

Phoenix, Arizona, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Dubai, Dubai, United Arab Emirates (On-Site)

Beijing, Beijing, China (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug