AI and ML Infra Software Engineer, GPU Clusters

3 Weeks ago • 5 Years + • Artificial Intelligence • $148,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks an AI/ML Infrastructure Software Engineer to enhance productivity for researchers by improving GPU cluster infrastructure. Responsibilities include collaborating with research teams to understand infrastructure needs, optimizing performance for high availability and scalability, defining key performance indicators for researcher efficiency, and collaborating with diverse teams to build a seamless AI/ML ecosystem. The role requires staying updated on advancements in AI/ML technologies and implementing them within the company. This involves working with HPC infrastructure, accelerated computing, storage systems, scheduling tools, high-speed networking, containers, and large-scale distributed training workloads using frameworks like PyTorch, NeMo, or JAX.
Must have:
  • 5+ years experience in AI/ML and HPC
  • HPC infrastructure expertise
  • Accelerated computing knowledge
  • Experience with PyTorch, NeMo, or JAX
  • Proficiency in Python, Go, Bash
  • Strong communication & collaboration skills
Perks:
  • Competitive salary
  • Comprehensive benefits package
  • Equity

Job Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are currently hiring an AI/ML Infrastructure Software Engineer at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will play a crucial role in boosting productivity for our researchers through implementing advancements across the entire stack. Your primary responsibility will involve working closely with customers to identify and resolve infrastructure gaps, enabling innovative AI and ML research on GPU Clusters. Together, we can create powerful, efficient, and scalable solutions as we shape the future of AI/ML technology!

What you will be doing:

  • Collaborate closely with our AI and ML research teams to understand their infrastructure needs and obstacles, translating those observations into actionable improvements.

  • Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.

  • Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.

  • Collaborate with diverse teams, including researchers, data engineers, and DevOps professionals, to build a seamless and coordinated AI/ML infrastructure ecosystem.

  • Stay on top of the latest advancements in AI/ML technologies, frameworks, and effective strategies, and promote their implementation within the company.

What we need to see:

  • BS or equivalent experience in Computer Science or related field, with 5+ years of proven experience in AI/ML and HPC workloads and infrastructure.

  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).

  • Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX. Also, possess a deep understanding of AI/ML workflows, encompassing data processing, model training, and inference pipelines.

  • Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.

  • Passion for continual learning and keeping abreast of new technologies and effective approaches in the AI/ML infrastructure field.

  • Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.

NVIDIA provides competitive salaries and a comprehensive benefits package. Our engineering teams are expanding rapidly due to exceptional growth. If you're a passionate and independent engineer with a love for technology, we want to hear from you.

The base salary range is 148,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Playtech - System Administrator

Playtech

(On-Site)
4 Weeks ago
NOVOMATIC - QA Engineer (Embedded Systems)

NOVOMATIC

Lesser Poland Voivodeship, Poland (Hybrid)
1 Month ago
ION - ION A Platform - Information Security Analyst

ION

India (On-Site)
6 Months ago
ION - Markets Product Security Engineer - UK

ION

London, England, United Kingdom (On-Site)
6 Months ago
Rackspace Technology - AWS Migration Engineer

Rackspace Technology

India (Remote)
2 Months ago
KPIT - CTO_ML/DL Data scientist

KPIT

Pune, Maharashtra, India (On-Site)
5 Months ago
Google - Greenfield Artificial Intelligence Sales Specialist III

Google

San Francisco, California, United States (On-Site)
1 Week ago
Scale AI - QA Engineer, Generative AI

Scale AI

Argentina (On-Site)
6 Months ago
ByteDance - Machine Learning Engineer - Pico Perception

ByteDance

San Jose, California, United States (On-Site)
1 Week ago
Google - Software Engineer III, AI/ML

Google

Bengaluru, Karnataka, India (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

NVIDIA - Senior SRE Software Engineer, Storage and Data

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Months ago
Corsair - Performance Engineer

Corsair

Vietnam (On-Site)
1 Month ago
The Mill - Senior Systems Engineer

The Mill

New York, New York, United States (On-Site)
9 Months ago
Nagarro - Staff Engineer (Cloud Infrastructure)

Nagarro

Gurugram, Haryana, India (On-Site)
6 Months ago
Every matrix - Application Security Engineer

Every matrix

Bucharest, Bucharest, Romania (Hybrid)
3 Months ago
Krafton  - Publishing Tech PM

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Haptic - Senior DevOps Engineer

Haptic

Paris, Île-de-France, France (Remote)
3 Months ago
Playrix - Senior Release Engineer

Playrix

Portugal (Remote)
5 Months ago
Google - CPU Design Verification Engineer

Google

Mountain View, California, United States (On-Site)
6 Days ago
GoReel - DevOps Lead

GoReel

Poland (Remote)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Santa Clara, California, United States

Meta - UX Researcher, Qualitative

Meta

Los Angeles, California, United States (On-Site)
5 Months ago
Trek - Service Advisor

Trek

Folsom, California, United States (On-Site)
2 Months ago
Google - Data Analytics Sales Specialist III

Google

Austin, Texas, United States (On-Site)
1 Week ago
Google - Senior UX Designer, Billing UX

Google

San Francisco, California, United States (On-Site)
1 Week ago
Riot Games - Manager, Software Engineering - Teamfight Tactics, Core Tech

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
On Location - Athletics Travel Manager

On Location

Oklahoma, United States (On-Site)
10 Hours ago
Meta - Marketing Science Partner (Financial Services)

Meta

New York, New York, United States (On-Site)
5 Months ago
Evolution - Online Casino Dealer

Evolution

Southfield, Michigan, United States (On-Site)
11 Months ago
Google - Senior Data Scientist, Research, Storage Analytics

Google

Sunnyvale, California, United States (On-Site)
6 Days ago
Microsoft - Applied Scientist

Microsoft

Redmond, Washington, United States (On-Site)
6 Days ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Scale AI - QA Engineer, Generative AI

Scale AI

Argentina (On-Site)
6 Months ago
Microsoft - Technical Product Manager, AI Multimodal

Microsoft

London, England, United Kingdom (On-Site)
1 Week ago
Microsoft - Director, AI Advertising Acceleration

Microsoft

London, England, United Kingdom (On-Site)
1 Week ago
Zoox - Senior/Staff Software Engineer - Prediction Integration

Zoox

Foster City, California, United States (Hybrid)
6 Months ago
Reality Games - Machine Learning Engineer - Monopoly World

Reality Games

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
1 Month ago
CharacterAI - Research Engineer, Post-Training

CharacterAI

New York, New York, United States (On-Site)
4 Weeks ago
Rackspace Technology - AI/ML Architect

Rackspace Technology

Vietnam (Remote)
1 Month ago
Ubisoft - Senior ML Programmer

Ubisoft

Montreal, Quebec, Canada (On-Site)
1 Month ago
Google - Senior Software Engineer, AI/ML GenAI, Google Cloud

Google

Seattle, Washington, United States (On-Site)
1 Week ago
Google - Partner Engineer, AI Code Assistant and App Innovation

Google

Maharashtra, India (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug