Senior AI-HPC Storage Engineer

4 Months ago • 8 Years + • Research Development • $184,000 PA - $356,500 PA

Job Summary

Job Description

As a Senior AI-HPC Storage Engineer at NVIDIA, you'll lead the design and implementation of cutting-edge storage solutions for demanding AI/HPC workloads. Responsibilities include researching and implementing distributed storage services, designing on-prem and cloud-based AI/HPC infrastructure, developing automation tools, and collaborating with teams to optimize workflows. You'll perform performance analysis, root cause analysis, and contribute to the evolution of NVIDIA's global computing environment's storage strategy. The role requires expertise in parallel file systems (Lustre, GPFS), cloud environments (AWS, Azure, GCP), and AI/HPC cluster management.
Must have:
  • 8+ years large-scale storage infrastructure experience
  • AI/HPC workload performance analysis & tuning
  • Lustre/GPFS experience
  • Proficient in Linux, Python, Bash scripting
  • Cloud storage experience (AWS, Azure, GCP)
  • Experience with SLURM/LSF
  • Docker, Kubernetes experience
Good to have:
  • NVIDIA GPU, CUDA, NCCL, MLPerf experience
  • Machine learning/deep learning knowledge
  • InfiniBand, IB/RDMA experience
  • SDN and AI/HPC cluster networking
  • PyTorch/TensorFlow familiarity
Perks:
  • Highly competitive salary
  • Comprehensive benefits package
  • Equity

Job Details

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life’s work, to amplify human creativity and intelligence. Make the choice to join us today!

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking fast storage solutions to enable runs of demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. As an expert, you will help us with the next-gen storage solutions strategic challenges we encounter with storage design for large scale, high performance workloads, evolving our private/public cloud strategy, capacity modelling, and growth planning across our global computing environment.

What you'll be doing:

  • Research and implementation of distributed storage services.

  • Design, implement an on-prem AI/HPC infrastructure supplemented with cloud computing to support the growing needs of NVIDIA.

  • Design and implement scalable and efficient next-gen storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.

  • Develop tooling to automate management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.

  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.

  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

  • Root cause analysis and suggest corrective action for problems large and small scales

What we need to see:

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.

  • 8+ years of experience designing and operating large scale storage infrastructure.

  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.

  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.

  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

  • Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]

  • Experience with AI/HPC cluster job schedulers such as SLURM, LSF

  • In depth understating of container technologies like Docker, Enroot

  • Experience with AI/HPC workflows that use MPI

Ways to stand out from the crowd:

  • Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking

  • Experience with Machine Learning and Deep Learning concepts, algorithms and models

  • Familiarity with InfiniBand with IBOIP and RDMA

  • Background with Software Defined Networking and AI/HPC cluster networking

  • Familiarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most resourceful and talented people in the world working for us and, due to unprecedented growth, our extraordinary engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

world relief - CRM & Data Systems Specialist

world relief

Towson, Maryland, United States (Remote)
1 Day ago
Eneba Games - Technical Writer-Editor, Marketing

Eneba Games

(Remote)
2 Months ago
WebFX - Jr. Paid Ads and Analytics Specialist

WebFX

Harrisburg, Pennsylvania, United States (On-Site)
8 Months ago
WebFX - Copywriter (Digital Marketing & B2B) (Philippines)

WebFX

Philippines (Remote)
8 Months ago
Hitachi - Senior Offshore Azure Infrastructure - EST Shift

Hitachi

Pune, Maharashtra, India (On-Site)
8 Months ago
Capgemini - ML OPS

Capgemini

Hyderabad, Telangana, India (On-Site)
1 Month ago
Match Group - Senior Machine Learning Engineer

Match Group

Seoul, South Korea (Hybrid)
1 Week ago
Apple - Senior Machine Learning Applied Researcher

Apple

Seattle, Washington, United States (On-Site)
3 Weeks ago
Jane Street - Machine Learning Performance Engineer

Jane Street

New York, United States (On-Site)
1 Month ago
Rippling - Senior Program Manager, R&D/Engineering Enablement

Rippling

San Francisco, California, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

DMG - Account Executive

DMG

Cincinnati, Ohio, United States (On-Site)
3 Weeks ago
bytedance - Software Engineer, Architecture and Infrastructure

bytedance

San Jose, California, United States (On-Site)
8 Months ago
Aristocrat - VP, Global Customer Success - iGaming

Aristocrat

Las Vegas, Nevada, United States (Remote)
1 Month ago
Homa Games - Senior User Acquisition Manager

Homa Games

Paris, Île-de-France, France (Hybrid)
1 Month ago
HoYoverse - Senior Gameplay Programmer AI

HoYoverse

Québec City, Quebec, Canada (Remote)
3 Months ago
WebFX - Jr. Content Marketer

WebFX

Harrisburg, Pennsylvania, United States (On-Site)
8 Months ago
Thales - Data Engineer (Microsoft & Talend)

Thales

Jakarta, Indonesia (On-Site)
1 Month ago
Dentsu - Paid Search Manager

Dentsu

London, England, United Kingdom (Hybrid)
1 Month ago
XBorg - Senior Back-End Software Engineer

XBorg

(Remote)
4 Months ago
Nintendo - Senior Advertising Specialist

Nintendo

Redmond, Washington, United States (Hybrid)
11 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Westford, Massachusetts, United States

Ziff Davis - Director, Investor Relations

Ziff Davis

New York, United States (Remote)
2 Weeks ago
Apple - Machine Learning Engineer, Siri Automatic Speech Recognition

Apple

Cambridge, Massachusetts, United States (On-Site)
1 Month ago
Safe security - GTM Recruiter

Safe security

New York, United States (On-Site)
2 Months ago
BioFire - Research Associate I

BioFire

Salt Lake City, Utah, United States (On-Site)
2 Weeks ago
Workato - Field Marketing Manager

Workato

New York, United States (On-Site)
1 Month ago
Alpha Sense - Associate Product Manager, Web Curation

Alpha Sense

New York, United States (On-Site)
1 Month ago
cyara - Account Executive - New Logo North America

cyara

United States (Remote)
2 Weeks ago
BioFire - Materials Handler I

BioFire

Salt Lake City, Utah, United States (On-Site)
2 Months ago
zoox - Systems Engineer - Collision Avoidance

zoox

Foster City, California, United States (Hybrid)
8 Months ago
lifechruh - APIs Staff Software Engineer

lifechruh

Edmond, Oklahoma, United States (On-Site)
8 Months ago

Get notifed when new similar jobs are uploaded

Research Development Jobs

DevRev - Applied AI Engineer - Internship

DevRev

Buenos Aires, Buenos Aires, Argentina (On-Site)
1 Month ago
Reddit - Senior Machine Learning Engineer, Conversion Lift

Reddit

Canada (Remote)
1 Month ago
Apple - Senior Machine Learning Engineer - Marketplace, Apple Ads

Apple

Cupertino, California, United States (On-Site)
1 Day ago
Snorkel AI - Head of Applied AI

Snorkel AI

New York, United States (Hybrid)
1 Month ago
bytedance - Research Scientist, Reinforcement Learning

bytedance

San Jose, California, United States (On-Site)
8 Months ago
bytedance - Student Researcher (Doubao (Seed) - Foundation Model - Video Generation) - 2025 Start (PhD)

bytedance

San Jose, California, United States (On-Site)
8 Months ago
Valeo - Machine Learning Software Engineer Intern (Summer 2025)

Valeo

San Mateo, California, United States (On-Site)
2 Months ago
bytedance - Student Researcher (Doubao (Seed) - Foundation Model - Speech Understanding) - 2025 Start (PhD)

bytedance

Seattle, Washington, United States (On-Site)
8 Months ago
bytedance - Machine Learning Engineer - MLDev

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago
PayPal - Senior Staff Machine Learning Scientist

PayPal

San Jose, California, United States (Hybrid)
3 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

Beijing, Beijing, China (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug