Senior Solutions Architect, Cloud Infrastructure and DevOps

1 Month ago • 8 Years + • DevOps

Job Summary

Job Description

NVIDIA seeks a Senior Cloud Infrastructure/DevOps Solutions Architect to design, implement, and maintain large-scale HPC/AI clusters. Responsibilities include managing workload schedulers, developing CI/CD pipelines, automating infrastructure deployment and monitoring, and troubleshooting across various layers. The role involves customer interaction, developing standard methodologies, supporting R&D, and regional travel. Experience with cloud platforms (AWS, Azure, GCP), job scheduling (Slurm, Kubernetes), and automation tools (Ansible, Puppet) is crucial.
Must have:
  • Design and implement large-scale HPC/AI clusters
  • Manage Linux job schedulers & orchestration tools
  • Develop CI/CD pipelines and automation tools
  • Experience with cloud platforms (AWS, Azure, GCP)
  • Troubleshooting from bare metal to application level
  • Knowledge of HPC and AI solution technologies
  • Experience with multiple storage solutions
Good to have:
  • Knowledge of CPU and/or GPU architecture
  • Knowledge of Kubernetes and container technologies
  • Experience with GPU-focused hardware/software
  • Background with RDMA (InfiniBand or RoCE) fabrics

Job Details

NVIDIA is the world leader in computer graphics, artificial intelligence, and accelerated computing. For over 25 years, we have been at the forefront of research and engineering around the greatest advances in technology. Our history of innovation drives us to solve the worlds hardest problems. NVIDIA is looking for Senior Cloud Infrastructure/DevOps Solutions Architect to join its NVIDIA Infrastructure Specialist Team. Academic and commercial groups around the world are using NVIDIA products to revolutionize deep learning and data analytics, and to power data centers. Join the team building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability to work on a dynamic customer focused team that requires excellent interpersonal skills. This role will be interacting with customers, partners and internal teams, to analyze, define and implement large scale Networking projects. The scope of these efforts includes a combination of Networking, System Design and Automation and being the face to the customer!

What you'll be doing:

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting Manage Linux job/workload schedulers and orchestration tools.

  • Develop and maintain continuous integration and delivery pipelines .

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

  • Deploy monitoring solutions for the servers, network and storage.

  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level.

  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams Support Research & Development activities and engage in POCs/POVs for future improvements .

  • Regional travel is required for on-site visits with customers.

What we need to see:

  • BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields with at least 8 years work or research experience in networking fundamentals, TCP/IP stack, and data center architecture.

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software.

  • Direct design, implementation and management experience with cloud computing platforms (e.g. AWS, Azure, Google Cloud).

  • Experience with job scheduling workloads and orchestration technologies such as Slurm, Kubernetes and Singularity.

  • Hands-on, adaptable problem-solver with a collaborative approach and strong ability to thrive in fast-paced, dynamic environments, working effectively with cross-functional teams to deliver innovative solutions

  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.

  • Python programming and bash scripting experience.

  • Comfortable with automation and configuration management tools including Jenkins, Ansible, Puppet/Chef, etc.

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).

  • Strong written, verbal, and listening skills in English are critical.

Ways to stand out from the crowd:

  • Knowledge of CPU and/or GPU architecture .

  • Knowledge of Kubernetes, container related microservice technologies.

  • Experience with GPU-focused hardware/software (DGX, CUDA.)

  • Background with RDMA (InfiniBand or RoCE) fabrics.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking individuals in the world working for us. If you're creative and autonomous, we want to hear from you.

Similar Jobs

Playtech - System Administrator

Playtech

Tartu, Tartu County, Estonia (On-Site)
1 Month ago
RoofStack - Senior Internal Communication and Employer Branding Specialist

RoofStack

İstanbul, İstanbul, Türkiye (On-Site)
1 Month ago
Philips - Field Service Engineer - X-Ray/MRI

Philips

Irvine, California, United States (On-Site)
3 Weeks ago
Zscaler - Technical Support Engineer

Zscaler

Sydney, New South Wales, Australia (Hybrid)
1 Month ago
Canonical - Product Manager - Industrial Sector Lead

Canonical

(Remote)
1 Month ago
Ubisoft - Release & Build Specialist

Ubisoft

Saint-Mandé, Île-de-France, France (Hybrid)
4 Months ago
Google - Customer Engineer III, Navy

Google

San Diego, California, United States (On-Site)
1 Month ago
Lost Boys Interactive - Senior DevOps Engineer

Lost Boys Interactive

(Remote)
4 Months ago
Electronic Arts - Build Software Engineer - Development & Release Engineering

Electronic Arts

Vancouver, British Columbia, Canada (Hybrid)
2 Months ago
ION - Site Reliability Engineer

ION

Pisa, Tuscany, Italy (Hybrid)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Edge Optical Engineer - EDGE Networking

ByteDance

Singapore (On-Site)
3 Months ago
ByteDance - Software Engineer (ElasticSearch / OpenSearch) - Cloud Infrastructure- San Jose

ByteDance

San Jose, California, United States (On-Site)
7 Months ago
Netflix - Broadcast Engineer, Live Broadcast Technology

Netflix

Los Angeles, California, United States (On-Site)
1 Month ago
Remote.com - Account Executive (Norwegian speaker)

Remote.com

(Remote)
1 Month ago
Zazz - Enterprise Solutions Consultant

Zazz

(Remote)
4 Months ago
NVIDIA - Senior Software Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Google - Senior Software Engineer, Authentication and Autofill, Android

Google

Beijing, Beijing, China (On-Site)
1 Month ago
Monzo - Senior Platform Engineer

Monzo

(Remote)
1 Month ago
Toppan Merrril - Sales Executive

Toppan Merrril

(Remote)
1 Month ago
ByteDance - Software Engineer, Cloud Infrastructure

ByteDance

San Jose, California, United States (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Japan

Tencent - Global Publishing PM (Japan)

Tencent

Tokyo, Japan (On-Site)
3 Months ago
Colo pl - Legal Representative

Colo pl

Minato City, Tokyo, Japan (On-Site)
1 Month ago
Brave Group (Language barrier) - Graphic Designer

Brave Group (Language barrier)

Tokyo, Japan (On-Site)
1 Month ago
Thales - Project Manager / Bids Manager

Thales

Tokyo, Japan (Hybrid)
1 Month ago
playphony games - Writer

playphony games

Tokyo, Japan (On-Site)
1 Month ago
Square enix Japan - Project Manager (MMORPG)

Square enix Japan

Shibuya, Tokyo, Japan (On-Site)
1 Month ago
HoYoverse - PR Manager

HoYoverse

Japan (On-Site)
1 Month ago
NetEase Games - Senior Concept Artist

NetEase Games

Shinjuku City, Tokyo, Japan (On-Site)
3 Months ago
Game freak - Planner [Leader]

Game freak

Chiyoda City, Tokyo, Japan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Nagarro - Senior Engineer, Cloud

Nagarro

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Ajmera Infotech - Senior Azure DevOps Engineer (IaaS)

Ajmera Infotech

Ahmedabad, Gujarat, India (On-Site)
2 Months ago
N-iX - Senior DevOps (AWS) Engineer

N-iX

Colombia (Remote)
1 Month ago
Tencent - IaaS Product Solution Architect

Tencent

(On-Site)
1 Month ago
PwC - Senior Associate_Azure Data Engineer_Data & Analytics_Advisory_PAN  India

PwC

Kolkata, West Bengal, India (On-Site)
8 Months ago
Nagarro - Principal Engineer -- PHP Developer

Nagarro

New Jersey, United States (Remote)
7 Months ago
Social Discovery Group - ML Ops Engineer (AI Product)

Social Discovery Group

(Remote)
4 Months ago
Google - Mainframe Modernization Consultant, Google Cloud

Google

Karnataka, India (On-Site)
1 Month ago
ByteDance - Site Reliability Engineer, Traffic Infrastructure

ByteDance

Singapore (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

Pune, Maharashtra, India (On-Site)

Taipei City, Taiwan (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug