Senior System Software Engineer, NCCL - Partner Enablement

1 Month ago • 5 Years + • DevOps • Research & Development • $148,000 PA - $287,500 PA

Job Summary

Job Description

NVIDIA's GPU Communications Libraries and Networking team seeks a Senior System Software Engineer to focus on NCCL (NVIDIA Collective Communications Library) partner enablement. Responsibilities include troubleshooting functional and performance issues with NCCL, conducting performance analysis on GPU clusters, developing diagnostic tools and automation, providing HPC expertise to customers and support teams, creating training materials and webinars, and collaborating with internal teams across different time zones. The role requires deep expertise in parallel programming, high-performance networking (Infiniband, RoCE, Ethernet), Linux, and scripting languages (Python).
Must have:
  • 5+ years relevant experience
  • Parallel programming & communication runtime experience
  • Excellent C/C++ programming skills
  • HPC or AI community support experience
  • High-performance networking expertise (Infiniband/RoCE/Ethernet)
  • Linux fundamentals & Python scripting
Good to have:
  • HPC cluster infrastructure experience
  • System administration experience (large clusters)
  • Network configuration debugging in large deployments
  • CUDA programming and/or GPU familiarity
  • Deep Learning framework experience (PyTorch, TensorFlow)
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL. Most DL/HPC applications run on large clusters with high-speed networking (Infiniband, RoCE, Ethernet). This is an outstanding opportunity to get an end to end understanding of the AI networking stack. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Engage with our partners and customers to root cause functional and performance issues reported with NCCL

  • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters

  • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)

  • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters

  • Document and conduct trainings/webinars for NCCL

  • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

What we need to see:

  • B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design

  • Experience working with engineering or academic research community supporting HPC or AI

  • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control

  • Expert in Linux fundamentals and a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)

  • Adaptability and passion to learn new areas and tools

  • Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters. Prior system administration experience, esp for large clusters. Experience debugging network configuration issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs. Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow

  • Deep understanding of technology and passionate about what you do

The base salary range is 148,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

ByteDance - Backend Software Engineer - Global E-Commerce Warehousing

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Hashone Careers - Cloud Engineer

Hashone Careers

Bengaluru, Karnataka, India (Remote)
4 Months ago
Google - Data Center Operations Manager, Server Operations (English, Dutch)

Google

Eemshaven, Groningen, Netherlands (On-Site)
3 Months ago
ByteDance - Software Engineer in Machine Learning Systems

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
NXP - Advanced Analog SW Developer - Intern

NXP

Brno, South Moravian Region, Czechia (On-Site)
6 Months ago
Nagarro - Senior Cloud Consultant

Nagarro

Germany (Remote)
6 Days ago
Sigma Software - Technical Support Engineer (FinTech)

Sigma Software

Warsaw, Masovian Voivodeship, Poland (On-Site)
5 Months ago
Electronic Arts - Software Engineer - Python, AWS

Electronic Arts

Hyderabad, Telangana, India (On-Site)
6 Days ago
Nielsen Holdings - DevOps Engineer (Terraform, Jenkins, GitLab CI/CD, Python, Airflow)

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Luxoft - Solutions Architect

Luxoft

Gurugram, Haryana, India (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

CoolGames - Internship Community Manager

CoolGames

North Holland, Netherlands (On-Site)
1 Week ago
ByteDance - Senior Product Solution Manager, Edge Cloud

ByteDance

San Jose, California, United States (On-Site)
6 Days ago
ByteDance - Cloud Network Engineer

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
DraftKings - Talent Acquisition Associate, Corporate Functions

DraftKings

Boston, Massachusetts, United States (On-Site)
2 Weeks ago
ByteDance - Principal Product Manager - IaaS AI Infra

ByteDance

Seattle, Washington, United States (On-Site)
2 Months ago
NVIDIA - Manager, Chip Design Verification

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
Blazesoft - Investment Analyst

Blazesoft

Canada (On-Site)
1 Year ago
PwC - Senior Cloud & Digital Consultant - Financial Sector

PwC

Amsterdam, North Holland, Netherlands (On-Site)
5 Months ago
Garena - Garena - Strategy & Operations

Garena

Taipei City, Taiwan (On-Site)
3 Months ago
Activision - Senior Project Manager, Threat Response - Central Technology

Activision

Quebec, Canada (Remote)
6 Days ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Tap Nation - Unity Developer

Tap Nation

New York, New York, United States (Remote)
4 Months ago
ION - Lead Java Developer, New York

ION

New York, New York, United States (Hybrid)
5 Months ago
Meta - Software Engineer, iOS

Meta

Bellevue, Washington, United States (On-Site)
4 Months ago
Scientific Games  - Director, General Manager of iLottery

Scientific Games

Pennsylvania, United States (Remote)
2 Weeks ago
Universal Music - Senior Manager, Controls Assurance

Universal Music

California, United States (On-Site)
1 Month ago
ByteDance - Backend Software Engineer

ByteDance

San Jose, California, United States (On-Site)
6 Days ago
Aristocrat Gaming - Senior Accountant Revenue

Aristocrat Gaming

Las Vegas, Nevada, United States (Hybrid)
1 Month ago
Onward Search - Software Engineer V

Onward Search

Austin, Texas, United States (Hybrid)
1 Month ago
GoMotive - Account Executive, Enterprise - Great Lakes

GoMotive

United States (Remote)
1 Week ago
Turtle Rock Studios - Senior VFX Artist

Turtle Rock Studios

California, United States (Remote)
1 Week ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Garena - Sea Group - Infrastructure Engineer (DC Site)

Garena

Taipei City, Taiwan (On-Site)
2 Months ago
Rackspace Technology - Cloud Practice Engineer

Rackspace Technology

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Moon Active - IT Infrastructure & Cloud Engineer

Moon Active

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Month ago
Zazz - Data Engineer

Zazz

(Remote)
2 Months ago
Onward Search - DevOps/Automation Engineer

Onward Search

(Remote)
1 Week ago
Rockstar Games - DevOps Engineer

Rockstar Games

Edinburgh, Scotland, United Kingdom (On-Site)
10 Months ago
Evolution - CI/CD (DevOps) Engineer

Evolution

Riga, Latvia (On-Site)
3 Months ago
VGW - Staff Site Reliability Engineer

VGW

Perth, Western Australia, Australia (On-Site)
1 Month ago
Visa - Staff Systems Engineer - Splunk Administrator - PRE

Visa

Austin, Texas, United States (Hybrid)
5 Months ago
SmileGate - AI Cloud Infrastructure Engineer

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Ra'anana, Center District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug