Senior Solutions Architect, Cloud Infrastructure and DevOps

1 Day ago • 8 Years + • DevOps

Job Summary

Job Description

NVIDIA seeks a Senior Cloud Infrastructure/DevOps Solutions Architect to join its Infrastructure Specialist Team. Responsibilities include managing large-scale HPC/AI clusters, developing CI/CD pipelines, automating infrastructure deployment and management, deploying monitoring solutions, troubleshooting across various layers (bare metal to application), developing standard methodologies, supporting R&D, and engaging in POCs/POVs. The role involves interaction with customers, partners, and internal teams, requiring strong interpersonal and communication skills, particularly with Mandarin-speaking customers.
Must have:
  • 8+ years experience in networking
  • HPC and AI solution knowledge
  • Kubernetes expertise for AI/ML
  • HPC cluster management experience
  • Job scheduling (Slurm, Kubernetes)
  • Windows/Linux systems expertise
  • Experience with various storage solutions (Lustre, GPFS, etc.)
  • Python programming and bash scripting
  • CI/CD pipeline knowledge
  • Automation tools (Ansible, Puppet/Chef)
Good to have:
  • CPU/GPU architecture knowledge
  • Kubernetes microservice technologies
  • GPU-focused hardware/software experience (DGX, CUDA)
  • RDMA (InfiniBand or RoCE) fabric experience

Job Details

NVIDIA is the world leader in computer graphics, artificial intelligence, and accelerated computing. For over 25 years, we have been at the forefront of research and engineering around the greatest advances in technology. Our history of innovation drives us to solve the worlds hardest problems.

NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team. Academic and commercial groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability to work on a dynamic customer focused team that requires excellent interpersonal skills. This role will be interacting with customers, partners and internal teams, to analyze, define and implement large scale Networking projects. The scope of these efforts includes a combination of Networking, System Design and Automation and being the face to the customer!

What you'll be doing:

  • Maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.

  • Develop and maintain continuous integration and delivery pipelines

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Deploy monitoring solutions for the servers, network and storage.

  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.

  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements .

What we need to see:

  • BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.

  • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture.

  • Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.

  • Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments.

  • Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting.

  • Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes, and Singularity.

  • Excellent knowledge of Windows and Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS. Familiarity with newer and emerging storage technologies is a plus.

  • Proficiency in Python programming and bash scripting.

  • Knowledge of CI/CD pipelines for software deployment and automation.

  • Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef, etc.

  • Ability to communicate technical concepts and collaborate effectively with Mandarin-speaking customers.

Ways to stand out from the crowd:

  • Knowledge of CPU and/or GPU architecture .

  • Knowledge of Kubernetes, container related microservice technologies.

  • Experience with GPU-focused hardware/software (DGX, CUDA.)

  • Background with RDMA (InfiniBand or RoCE) fabrics.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking individuals in the world working for us. If you're creative and autonomous, we want to hear from you.

Similar Jobs

Volley - Staff Software Engineer, Platform

Volley

San Francisco, California, United States (Hybrid)
2 Months ago
Hitachi - Azure Infra Consultant

Hitachi

Pune, Maharashtra, India (Remote)
6 Months ago
Barracuda Networks  Inc  - Software Engineer

Barracuda Networks Inc

Bengaluru, Karnataka, India (On-Site)
6 Months ago
NVIDIA - Senior Software Developer, HPC Cluster Management

NVIDIA

California, United States (Remote)
3 Months ago
Inworld AI - Staff Platform Engineer  - Canada

Inworld AI

Vancouver, British Columbia, Canada (On-Site)
4 Months ago
Offworld - DevOps Engineer

Offworld

New Westminster, British Columbia, Canada (On-Site)
4 Weeks ago
Truecaller - Senior MLOps Engineer

Truecaller

Stockholm, Stockholm County, Sweden (On-Site)
5 Months ago
Probably Monsters - Senior Build and Release Engineer

Probably Monsters

Dallas, Texas, United States (On-Site)
2 Weeks ago
Google - Program Manager, Google Distributed Cloud

Google

Wrocław, Lower Silesian Voivodeship, Poland (On-Site)
6 Days ago
Luxoft - Senior Java Developer

Luxoft

Pune, Maharashtra, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Avathon - Software Engineer (Scala_Backend)

Avathon

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Single Store - Technical Account Manager

Single Store

Hyderabad, Telangana, India (Remote)
1 Month ago
Microsoft - Software Engineer II - Microsoft Defender for Cloud Apps (TLV)

Microsoft

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Week ago
Sony Interactive Entertainment - Developer Experience Engineer (PlayStation™Network Server Platform Development)

Sony Interactive Entertainment

Tokyo, Japan (On-Site)
2 Weeks ago
ByteDance - Senior Machine Learning Ops Engineer, ML System - Foundation Model

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
NVIDIA - Senior Software Engineer - Conversational AI

NVIDIA

Pune, Maharashtra, India (On-Site)
1 Month ago
N-iX - Data Engineer (with Databricks)

N-iX

Poland (Remote)
1 Month ago
ByteDance - Backend Engineer, Machine Learning Systems - Singapore

ByteDance

Singapore (On-Site)
5 Months ago
Mako IT Lab - Principal Software Engineer

Mako IT Lab

Chennai, Tamil Nadu, India (On-Site)
7 Months ago
Metacore - DevOps Advocate

Metacore

Helsinki, Uusimaa, Finland (Hybrid)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Japan

Axinous - Customer Success Engineer

Axinous

Tokyo, Japan (Remote)
5 Months ago
Netflix - Manager, Marketing Studio Relations - APAC

Netflix

Tokyo, Japan (On-Site)
3 Months ago
Ubisoft - Social Engagement Manager Korea (W/M/NB)

Ubisoft

Tokyo, Japan (On-Site)
11 Months ago
Nagarro - Senior Staff Engineer, Java

Nagarro

Japan (Remote)
6 Months ago
Google - Cyber Engagement Lead

Google

Tokyo, Japan (On-Site)
1 Week ago
NetEase Games - Finance Team - Cashier (Tokyo Office)

NetEase Games

Shinjuku City, Tokyo, Japan (On-Site)
4 Months ago
Google - Account Executive, Small and Medium Business, Google Cloud

Google

Tokyo, Japan (On-Site)
1 Week ago
Limit Break - Sound Designer

Limit Break

Tokyo, Japan (On-Site)
2 Months ago
Google - Senior Field Sales Representative, Retail, Google Cloud

Google

Tokyo, Japan (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

NVIDIA - Senior Software Engineer, DGX Cloud Orchestration

NVIDIA

California, United States (Remote)
1 Week ago
Zazz - Cloud Engineer (AWS)

Zazz

(Remote)
2 Months ago
GoTo Group - Principal SRE Engineer (SE5)

GoTo Group

Gurugram, Haryana, India (On-Site)
6 Months ago
USE Insider - DevOps Engineer

USE Insider

İstanbul, İstanbul, Türkiye (Remote)
4 Months ago
Google - Software Engineer III, Performance, Google Cloud

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
Flexera - Member Technical Staff - Site Reliability Engineer

Flexera

Bengaluru, Karnataka, India (Hybrid)
7 Months ago
Brillio - Azure DB Architect - Migration - R01531206

Brillio

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
RoofStack - Head of Software Development

RoofStack

İstanbul, İstanbul, Türkiye (On-Site)
4 Weeks ago
Garena - Game System Operation Engineer

Garena

Taipei City, Taiwan (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug