Senior System Software Engineer, NCCL - Partner Enablement

1 Day ago • 5 Years + • DevOps • Research & Development • $148,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA's GPU Communications Libraries and Networking team seeks a Senior System Software Engineer to focus on NCCL (NVIDIA Collective Communications Library) partner enablement. Responsibilities include troubleshooting functional and performance issues with NCCL, conducting performance analysis on GPU clusters, developing diagnostic tools and automation, providing HPC expertise to customers and support teams, creating training materials and webinars, and collaborating with internal teams across different time zones. The role requires deep expertise in parallel programming, high-performance networking (Infiniband, RoCE, Ethernet), Linux, and scripting languages (Python).
Must have:
  • 5+ years relevant experience
  • Parallel programming & communication runtime experience
  • Excellent C/C++ programming skills
  • HPC or AI community support experience
  • High-performance networking expertise (Infiniband/RoCE/Ethernet)
  • Linux fundamentals & Python scripting
Good to have:
  • HPC cluster infrastructure experience
  • System administration experience (large clusters)
  • Network configuration debugging in large deployments
  • CUDA programming and/or GPU familiarity
  • Deep Learning framework experience (PyTorch, TensorFlow)
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL. Most DL/HPC applications run on large clusters with high-speed networking (Infiniband, RoCE, Ethernet). This is an outstanding opportunity to get an end to end understanding of the AI networking stack. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Engage with our partners and customers to root cause functional and performance issues reported with NCCL

  • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters

  • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)

  • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters

  • Document and conduct trainings/webinars for NCCL

  • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

What we need to see:

  • B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design

  • Experience working with engineering or academic research community supporting HPC or AI

  • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control

  • Expert in Linux fundamentals and a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)

  • Adaptability and passion to learn new areas and tools

  • Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters. Prior system administration experience, esp for large clusters. Experience debugging network configuration issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs. Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow

  • Deep understanding of technology and passionate about what you do

The base salary range is 148,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Google - Fullstack Software Engineer

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
• 1 Month ago
Tech Solve Engine - Support & Linux System Engineer

Tech Solve Engine

Bengaluru, Karnataka, India (On-Site)
• 4 Months ago
Passive Logic - Senior Electrical Engineer

Passive Logic

Salt Lake City, Utah, United States (On-Site)
• 4 Months ago
Microsoft - Senior Azure Advanced Cloud Engineer - Customer Focused

Microsoft

(Remote)
• 1 Month ago
Meta - Technical Program Manager, Net Infra (Backbone)

Meta

Denver, Colorado, United States (On-Site)
• 3 Months ago
Netflix - Solutions Support Engineer (L5) - Observability

Netflix

Warsaw, Masovian Voivodeship, Poland (Hybrid)
• 1 Month ago
Take-Two Interactive - Site Reliability Engineer I

Take-Two Interactive

Bengaluru, Karnataka, India (On-Site)
• 3 Weeks ago
Demonware - Platform Engineering Co-op

Demonware

Vancouver, British Columbia, Canada (Hybrid)
• 3 Weeks ago
Patterned Learning Career - Senior Software Engineer, Infrastructure

Patterned Learning Career

(Remote)
• 1 Week ago
SmileGate - SRE Strategy PM

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
• 1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Immersive Video Research Intern (Multimedia Streaming) 2023 Summer/Fall (BS)

ByteDance

San Diego, California, United States (On-Site)
• 3 Months ago
NVIDIA - Optics Firmware Verification Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
• 1 Month ago
NVIDIA - Firmware PHY Verification Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
• 1 Month ago
ByteDance - Software Development Engineer Graduate (Network Monitoring & Alerts) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
• 3 Months ago
ION - Senior Security Architect

ION

London, England, United Kingdom (On-Site)
• 4 Months ago
Gaming Innovation Group  - Infrastructure Engineer

Gaming Innovation Group

Sliema, Malta (Hybrid)
• 3 Months ago
Google - Software Engineering Intern, PhD, Summer 2025

Google

Mountain View, California, United States (On-Site)
• 3 Months ago
Infoblox - Business Strategy and Pricing Manager

Infoblox

Mumbai, Maharashtra, India (Hybrid)
• 3 Months ago
Cloud Software Group - Lead Product Security Engineer

Cloud Software Group

Bengaluru, Karnataka, India (On-Site)
• 3 Months ago
ByteDance - Research Scientist- Applied Machine learning Graduates (AML) - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
• 3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

NVIDIA - Senior Software and System Architect

NVIDIA

Santa Clara, California, United States (Remote)
• 1 Month ago
Visa - Chief Systems Architect - VAS Digital Marketing & Engagement

Visa

Bellevue, Washington, United States (On-Site)
• 4 Months ago
Milestone - Public Safety Key Account Manager, West

Milestone

United States (Remote)
• 1 Week ago
The Walt Disney Company - Graphic Design Intern, Summer/Fall 2025

The Walt Disney Company

Kissimmee, Florida, United States (Hybrid)
• 14 Hours ago
NVIDIA - AI and ML Infra Software Engineer, GPU Clusters

NVIDIA

Santa Clara, California, United States (On-Site)
• 1 Month ago
Google - Software Engineer III, Machine Learning, Google Ads

Google

Mountain View, California, United States (On-Site)
• 3 Months ago
Spin Master - Associate Brand Manager, US Marketing

Spin Master

California, United States (On-Site)
• 18 Hours ago
AGS - American Gaming Systems - Senior Accountant

AGS - American Gaming Systems

Nevada, United States (On-Site)
• 1 Month ago
My Fitness Pal - Machine Learning Intern - Summer 2025

My Fitness Pal

United States (Remote)
• 1 Month ago
Wind River Systems - Star Lab - Field Applications Engineer, System Architect

Wind River Systems

Washington, District Of Columbia, United States (Hybrid)
• 3 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Luxoft - Orchestrade - Azure infrastructure cloud Regular engineer

Luxoft

Poland, Ohio, United States (Remote)
• 3 Months ago
Trend Micro - (Sr.) Software Engineer in Linux

Trend Micro

Taipei City, Taiwan (On-Site)
• 4 Months ago
NVIDIA - Senior DevOps Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
• 1 Month ago
Next Level Business Services - DevOps Consultant

Next Level Business Services

San Diego, California, United States (On-Site)
• 4 Months ago
10 Chambers - DevOps Lead

10 Chambers

Stockholm, Stockholm County, Sweden (On-Site)
• 3 Weeks ago
ByteDance - Backend Software Engineer (Cloud Platform), Cloud Infrastructure

ByteDance

Singapore (On-Site)
• 3 Months ago
NVIDIA - Senior DevOps Engineer - Accelerated Computing

NVIDIA

Westford, Massachusetts, United States (Hybrid)
• 1 Month ago
Google - Staff Software Engineer, Site Reliability Engineering, Google Cloud

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
• 1 Month ago
Social Discovery Group - Senior Infrastructure Platform Engineer

Social Discovery Group

Serbia (Remote)
• 3 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

United States (Remote)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Bengaluru, Karnataka, India (Hybrid)

Bengaluru, Karnataka, India (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug