Senior Manager, CSP Engagements – System Software SWAT Team

19 Minutes ago • 12 Years + • $272,000 PA - $425,500 PA
Software Development & Engineering

Job Description

NVIDIA is seeking a Senior Manager to lead the System Software SWAT Team within CSP Engagements, focusing on data center platforms like GB200/GB300. This elite, cross-functional group acts as a rapid-response hub for hyperscaler customers, triaging and resolving complex system software issues across firmware, Linux kernel/device drivers, networking, and virtualization. The role involves building and mentoring the team, partnering with CSP technical leaders, and transforming high-visibility escalations into predictable, customer-validated solutions to enhance NVIDIA’s quality at hyperscale.
Good To Have:
  • Experience building and operating customer-like labs, automation, and telemetry frameworks.
  • Familiarity with GPU computing (CUDA), large-scale AI/HPC workloads, NVLink, Grace, and cluster-level deployment/management.
  • Knowledge of CXL/memory fabric fundamentals and contributions to industry standards (OCP, DMTF).
Must Have:
  • Lead a cross-functional SWAT team for rapid triage, debugging, and resolution of complex system software issues for hyperscaler customers.
  • Drive technical incident response, war-room operations, and escalation management across firmware, Linux kernel, drivers, networking, virtualization, and observability layers.
  • Build and mentor a high-performing team of senior engineers; set operational standards for incident response, on-call rotations, and continuous improvement.
  • Serve as a primary technical and operational focal point for hyperscaler customers, managing expectations, communications, and participant relationships.
  • Collaborate with CSP technical leads, TPMs, and internal engineering teams to deliver customer-validated solutions and influence product quality and release criteria.
  • Operate customer-like labs to reproduce issues, validate fixes, and ensure robust telemetry and observability.
  • Provide executive-level status updates, risk assessments, and recommendations for critical customer issues.
  • 12+ years of proven experience in system software (firmware, Linux kernel, drivers, networking, virtualization), with at least 5 years in data center or HPC software environments.
  • Minimum 3+ years of direct experience working with hyperscalers in production environments.
  • 6+ years of experience in management.
  • Proven leadership in managing customer escalations, technical incident response, and cross-functional teams.
  • Deep technical expertise in Linux kernel, device drivers, ARM (aarch64) & x86, OpenBMC/SBIOS, out-of-band/in-band management, DMTF protocols (Redfish, PLDM, MCTP, SPDM), and networking (TCP/IP, Ethernet, InfiniBand).
  • Strong customer management and team member engagement skills; ability to communicate complex technical issues to executive and engineering audiences.
  • Demonstrated success in reducing time-to-mitigation, improving release predictability, and driving continuous improvement in technical operations.
Perks:
  • Equity
  • Benefits

Add these skills to join the top 1% applicants for this job

cross-functional
problem-solving
communication
game-texts
cuda
networking
incident-response
linux
deep-learning

Company

Job Requisition ID

JR2007221

Job Category

Engineering

Time Type

Full time

NVIDIA is seeking a Senior Manager to lead our System Software SWAT Team within CSP Engagements, focusing on data center platforms such as GB200/GB300 and next‑generation systems. This elite, cross‑functional group is the rapid‑response hub for hyperscaler customers—running triage and war‑rooms, operating customer‑like labs to deliver golden repros, and driving issues from first signal to validated fix across firmware, Linux kernel / device drivers, networking, and virtualization. You will build and mentor the team, partner closely with CSP technical leaders and TPMs, and turn complex, high‑visibility escalations into predictable, customer‑validated outcomes that raise NVIDIA’s quality bar at hyperscale.

What you’ll be doing:

  • Lead a cross-functional SWAT team focused on rapid triage, debugging, and resolution of complex system software issues for hyperscaler customers.
  • Drive technical incident response, war-room operations, and escalation management across firmware, Linux kernel, drivers, networking, virtualization, and observability layers.
  • Build and mentor a high-performing team of senior engineers; set operational standards for incident response, on-call rotations, and continuous improvement.
  • Serve as a primary technical and operational focal point for hyperscaler customers, managing expectations, communications, and participant relationships.
  • Collaborate with CSP technical leads, TPMs, and internal engineering teams to deliver customer-validated solutions and influence product quality and release criteria.
  • Operate customer-like labs to reproduce issues, validate fixes, and ensure robust telemetry and observability.
  • Provide executive-level status updates, risk assessments, and recommendations for critical customer issues.

What we need to see:

  • 12+ overall years of proven experience in system software (firmware, Linux kernel, drivers, networking, virtualization), with at least 5 years in data center or HPC software environments.
  • Bachelor's degree or equivalent experience.
  • Minimum 3+ years of direct experience working with hyperscalers in production environments.
  • 6+ yrs of experience in management.
  • Proven leadership in managing customer escalations, technical incident response, and cross-functional teams.
  • Deep technical expertise in Linux kernel, device drivers, ARM (aarch64) & x86, OpenBMC/SBIOS, out-of-band/in-band management, DMTF protocols (Redfish, PLDM, MCTP, SPDM), and networking (TCP/IP, Ethernet, InfiniBand).
  • Strong customer management and team member engagement skills; ability to communicate complex technical issues to executive and engineering audiences.
  • Demonstrated success in reducing time-to-mitigation, improving release predictability, and driving continuous improvement in technical operations.

Ways to stand out from the crowd:

  • Experience building and operating customer-like labs, automation, and telemetry frameworks.
  • Familiarity with GPU computing (CUDA), large-scale AI/HPC workloads, NVLink, Grace, and cluster-level deployment/management.
  • Knowledge of CXL/memory fabric fundamentals and contributions to industry standards (OCP, DMTF).

NVIDIA is widely considered one of the world's most desirable employers in technology. We have some of the world's most forward-thinking and hard-working people working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We're looking to grow our company and establish teams with the most thoughtful people in the world.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 425,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until November 15, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Set alerts for more jobs like Senior Manager, CSP Engagements – System Software SWAT Team
Set alerts for new jobs by NVIDIA
Set alerts for new Software Development & Engineering jobs in United States
Set alerts for new jobs in United States
Set alerts for Software Development & Engineering (Remote) jobs
Contact Us
hello@outscal.com
Made in INDIA 💛💙