Senior Solutions Architect, Infiniband and Networking Ethernet

1 Month ago • 8 Years + • Network Engineering • DevOps

Job Summary

Job Description

NVIDIA seeks a Senior Networking (ETH/IB) Solutions Architect to design and implement large-scale networking projects for AI/HPC infrastructure. Responsibilities include supporting operational reliability, focusing on performance, monitoring, and alerting of AI clusters. The role involves the entire service lifecycle, from design and deployment to operation and refinement, and requires excellent customer interaction skills. This includes working with customers, partners, and internal teams to analyze, define, and implement solutions. Strong automation skills using tools like Ansible, Salt, and Python are essential.
Must have:
  • 8+ years networking experience (LAN, InfiniBand)
  • Linux system administration/DevOps expertise
  • Automation skills (Ansible, Salt, Python)
  • Customer-focused leadership
  • Strong communication skills
Good to have:
  • Linux or Networking Certifications
  • HPC architecture knowledge
  • Experience with Slurm/PBS
  • Python or Bash scripting
  • GPU/MPI experience
  • BCM (Base Command Manager) knowledge

Job Details

NVIDIA is the world leader in computer graphics, artificial intelligence, and accelerated computing. For over 25 years, we have been at the forefront of research and engineering around the greatest advances in technology. Our history of innovation drives us to solve the worlds hardest problems.

NVIDIA is looking for Senior Networking (ETH/IB) Solutions Architect to join its NVIDIA Infrastructure Specialst Team. Academic and commercial groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability to work on a dynamic customer focused team that requires excellent interpersonal skills. This role will be interacting with customers, partners and internal teams, to analyze, define and implement large scale Networking projects. The scope of these efforts includes a combination of Networking, System Design and Automation and being the face to the customer!

What you'll be doing:

  • Primary responsibilities will include building AI/HPC infrastructure for new and existing customers.

  • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting.

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.

  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

  • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

What we need to see:

  • BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields with at least 8 years work or research experience in networking fundamentals, TCP/IP stack, and data center architecture.

  • 8+ years of experience with configuring, testing, validating, and issue resolution of LAN and InfiniBand networking, including use of validation tools for InfiniBand health and performance including medium to large scale HPC/AI network environments.

  • Knowledge and experience with Linux system administration/dev ops, process management, package management, task scheduling, kernel management, boot procedures, troubleshooting, performance reporting/optimization/logging, and network-routing/advanced networking (tuning and monitoring).

  • Driven focus on customer needs and satisfaction. Self-motivated with excellent leadership skills including working with customers.

  • Extensive knowledge of automation, delivering fully automated network provisioning solutions using Ansible, Salt, and Python.

  • Strong written, verbal, and listening skills in English are essential.

Ways to stand out from the crowd:

  • Linux or Networking Certifications.

  • Experience with High-performance computing architectures. Understanding of how job schedulers(Slurm, PBS) work.

  • Proven knowledge of Python or Bash. Infrastructure Specialist's delivery experience

  • luster management technologies knowledge (bonus credit for BCM (Base Command Manager).)

  • Experience with GPU (Graphics Processing Unit) focused hardware/software.

  • Experience with MPI (Message Passing Interface.)

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking individuals in the world working for us. If you're creative and autonomous, we want to hear from you.

Similar Jobs

Ajmera Infotech - Senior Azure DevOps Engineer (IaaS)

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
1 Month ago
ION - Backup System Engineer, Italy

ION

Italy (Hybrid)
6 Months ago
Velotio Technologies - Lead Engineer (DevOps OpenShift)

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago
Easybrain - Senior Data Engineer

Easybrain

Cyprus (On-Site)
9 Months ago
Rackspace Technology - SOC Analyst L3 (Sentinel is mandatory) - R-19060

Rackspace Technology

Gurugram, Haryana, India (Hybrid)
6 Months ago
ByteDance - Network Engineer Graduate (Tech Infra - IaaS) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Meta - Network Production Engineer, Network Infrastructure

Meta

Seattle, Washington, United States (On-Site)
5 Months ago
The Walt Disney Company - Network Engineer (1-year contract)

The Walt Disney Company

Hong Kong (On-Site)
6 Months ago
ByteDance - Senior Software Engineer - IaaS AI Infra

ByteDance

San Jose, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Steer Studios - Windows and Linux System Administrator

Steer Studios

Riyadh, Riyadh Province, Saudi Arabia (On-Site)
1 Month ago
NVIDIA - Senior SRE Software Engineer, Storage and Data

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Months ago
Moon Active - IT Infrastructure & Cloud Engineer

Moon Active

Warsaw, Masovian Voivodeship, Poland (On-Site)
4 Weeks ago
CloudHire - Senior Cloud AWS Engineer

CloudHire

Karnataka, India (Remote)
1 Month ago
Zazz - Cloud Engineer (Azure)

Zazz

(Remote)
2 Months ago
Zazz - Cloud Engineer (Azure)

Zazz

(Remote)
2 Months ago
NVIDIA - Software Test Developer Intern - Spark Rapids, Big Data & Deep Learning - 2025

NVIDIA

Shanghai, Shanghai, China (On-Site)
1 Month ago
Logifuture - Senior DevOps Engineer

Logifuture

Belgrade, Serbia (Remote)
1 Month ago
Playtech - System Administrator

Playtech

Nicosia, Nicosia, Cyprus (On-Site)
2 Months ago
Toptracer - Junior Software Engineer

Toptracer

Stockholm, Stockholm County, Sweden (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Singapore

OKX - Senior Administrative Manager

OKX

Singapore, Singapore (On-Site)
6 Months ago
PwC - Risk Services - AI Strategy Lead

PwC

Singapore (On-Site)
7 Months ago
Razer - Senior Growth Manager

Razer

Singapore (On-Site)
7 Months ago
HoYoverse - Combat Designer - Fresh Grad

HoYoverse

Singapore (On-Site)
9 Months ago
ByteDance - HR System Solution Expert (Third-party Contractor)

ByteDance

Singapore (On-Site)
4 Months ago
OKX - (Senior/Principal) Product Manager, Blockchain Explorer

OKX

Singapore, Singapore (On-Site)
6 Months ago
ByteDance - Lark APAC Integrated Marketing Intern

ByteDance

Singapore (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Network Engineering Jobs

ByteDance - Software Development Engineer - Network Observation

ByteDance

Singapore (On-Site)
6 Months ago
ByteDance - Software Engineer Graduate (Multi Cloud CDN) - 2025 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
PlayStation Global - Senior Linux Network Software Engineer

PlayStation Global

Adelaide, South Australia, Australia (On-Site)
1 Month ago
ByteDance - Senior Technical Lead - Edge Cloud Infrastructure - San Jose / Seattle / Boston

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
ByteDance - Senior Software Development Engineer, Virtual Network

ByteDance

Seattle, Washington, United States (On-Site)
3 Months ago
Epic Games - Senior Network Programmer

Epic Games

Montreal, Quebec, Canada (On-Site)
2 Months ago
ByteDance - Senior Software Developer, Routing & Emulation

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
ByteDance - Cloud Network Engineer - Physical Network Infrastructure

ByteDance

Singapore (On-Site)
1 Month ago
NVIDIA - Senior Software Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Austin, Texas, United States (Remote)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug