Senior HPC Performance Engineer

1 Day ago • 3 Years + • Research & Development • $148,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA's GPU Communications Libraries and Networking team seeks a Senior HPC Performance Engineer to contribute to the development of innovative communication libraries. Responsibilities include in-depth performance analysis on large multi-GPU/multi-node clusters, evaluating proof-of-concepts, troubleshooting performance issues, building data visualization tools, and collaborating with a global team. The ideal candidate possesses strong HPC and parallel programming experience, expertise in performance benchmarking, and a solid understanding of computer architecture and systems software. Experience with high-speed interconnects and networking technologies is highly desirable.
Must have:
  • M.S./Ph.D. in CS or related field
  • 3+ years HPC experience
  • Parallel programming expertise
  • Performance benchmarking experience
  • Strong system architecture understanding
  • C/C++ micro-benchmarking
  • Debugging across HW/SW stack
  • Python scripting proficiency
Good to have:
  • Infiniband/Ethernet network experience
  • Experience debugging network issues
  • CUDA programming experience
  • Deep Learning framework familiarity (PyTorch, TensorFlow)
  • Experience with Kubernetes, SLURM, Ansible, Docker
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.

  • Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack

  • Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available

  • Triage and root-cause performance issues reported by our customers

  • Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information

  • Collaborate with a very dynamic team across multiple time zones

What we need to see:

  • M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience

  • 3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Experience conducting performance benchmarking and triage on large scale HPC clusters

  • Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)

  • Implement micro-benchmarks in C/C++, read and modify the code base when required

  • Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)

  • Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

  • Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control

  • Experience debugging network issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs

  • Experience with Deep Learning Frameworks such PyTorch, TensorFlow

The base salary range is 148,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Ciklum - Expert Data Scientist

Ciklum

Pune, Maharashtra, India (Hybrid)
5 Months ago
Samsung Semiconductor - Senior Engineer, AI

Samsung Semiconductor

San Jose, California, United States (Hybrid)
6 Months ago
ByteDance - Software Engineer in Machine Learning Systems

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Canva - Senior Machine Learning Engineer - Photo AI

Canva

Prague, Czechia (Remote)
2 Months ago
Ubisoft - Lead R&D Scientist

Ubisoft

Shanghai, Shanghai, China (On-Site)
2 Weeks ago
Google - Staff Software Engineer, Google Cloud

Google

Pune, Maharashtra, India (On-Site)
5 Months ago
Ubisoft - Architecte de Stockage

Ubisoft

Montreal, Quebec, Canada (On-Site)
4 Months ago
NVIDIA - System Software Engineer, GPU Development Tools

NVIDIA

Bengaluru, Karnataka, India (On-Site)
1 Day ago
Meta - ASIC Engineer, Design

Meta

Austin, Texas, United States (On-Site)
4 Months ago
NVIDIA - Senior Software Architect, Accelerated Computing SDN

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

PropReturns - Senior Data Scientist

PropReturns

Maharashtra, India (On-Site)
7 Months ago
NVIDIA - Senior Solution Engineer, Mission Control

NVIDIA

Santa Clara, California, United States (On-Site)
2 Weeks ago
Rackspace Technology - Machine Learning Architect (AWS)

Rackspace Technology

San Diego, California, United States (Remote)
2 Weeks ago
Rackspace Technology - Senior Machine Learning Engineer

Rackspace Technology

Vietnam (Remote)
2 Months ago
Netflix - Machine Learning Software Engineer L4/L5

Netflix

United States (Remote)
3 Months ago
PlayStation Global - Mid-Career Machine Learning Engineer - Recommendation Systems

PlayStation Global

San Francisco, California, United States (On-Site)
2 Weeks ago
Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Sausalito, California, United States (On-Site)
5 Months ago
NVIDIA - Senior Technical Marketing Engineer - AI Infrastructure

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
Framestore - Machine Learning Developer - London Launchpad Internship 2025

Framestore

England, United Kingdom (On-Site)
2 Weeks ago
Samsung Semiconductor - Senior Engineer, System Software

Samsung Semiconductor

San Jose, California, United States (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Netflix - Solutions Software Engineer (L5), Studio Tech Solutions

Netflix

Los Angeles, California, United States (On-Site)
5 Months ago
ByteDance - University Relations Lead - Early Careers, AMS

ByteDance

Los Angeles, California, United States (On-Site)
16 Hours ago
Next Level Business Services - .NET Developer

Next Level Business Services

Chicago, Illinois, United States (On-Site)
5 Months ago
Epic Games - Lead Level Designer

Epic Games

Cary, North Carolina, United States (On-Site)
7 Months ago
ByteDance - Senior Data Scientist

ByteDance

San Jose, California, United States (On-Site)
2 Weeks ago
Next Level Business Services - Project Manager - Mobility

Next Level Business Services

Collegeville, Pennsylvania, United States (On-Site)
5 Months ago
Epic Games - Senior Animator

Epic Games

Cary, North Carolina, United States (On-Site)
2 Months ago
Sleeper - Motion Graphic Designer (Mid-level)

Sleeper

Los Angeles, California, United States (Remote)
2 Weeks ago
NVIDIA - Senior Solutions Architect, OEM AI

NVIDIA

Texas, United States (Remote)
4 Days ago
Egnyte - Business Development Representative

Egnyte

Raleigh, North Carolina, United States (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Rockstar Games - Senior Production Coordinator, Creator Platform

Rockstar Games

Leeds, England, United Kingdom (On-Site)
6 Months ago
NVIDIA - Speed Modeling and Prototyping Engineer

NVIDIA

Santa Clara, California, United States (Hybrid)
2 Months ago
NVIDIA - Senior Software Engineer, RTL Optimization Tools

NVIDIA

Santa Clara, California, United States (On-Site)
1 Week ago
ByteDance - Software Engineer in ML Systems Graduate (AML - Machine Learning Systems) - 2024 Start (BS/MS)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
ByteDance - Research Scientist in Foundation Model, Speech Understanding - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Google - Student Researcher, PhD, Winter/Summer 2025

Google

Waterloo, Ontario, Canada (On-Site)
5 Months ago
ByteDance - Machine Learning Engineer Intern (Search-TikTok Recommendation)

ByteDance

Seattle, Washington, United States (On-Site)
3 Weeks ago
Rivos - Silicon Logic Formal Verification - Full Time

Rivos

Austin, Texas, United States (Hybrid)
5 Months ago
Starkflow - Specialist

Starkflow

Chennai, Tamil Nadu, India (On-Site)
2 Weeks ago
Krafton  - HR Recruiting Planning/Operations

Krafton

Seoul, South Korea (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Hanoi, Hanoi, Vietnam (On-Site)

Shenzhen, Guangdong Province, China (On-Site)

Bengaluru, Karnataka, India (On-Site)

Shanghai, Shanghai, China (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Shanghai, Shanghai, China (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug