Senior Observability Engineer, AI and HPC

2 Months ago • 8 Years + • Research & Development • $184,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Observability Engineer to architect and implement distributed observability systems for AI and HPC clusters. Collaborate with AI, HW, and SW teams to build solutions for data collection, aggregation, enrichment, storage, retrieval, and visualization. Develop, test, and deploy data collectors, pipelines, and visualization services. Define data collection and retention policies. Work in a diverse team to provide operational and strategic data to improve performance, productivity, and efficiency. Continuously improve quality, workloads, and processes through better observability. This role requires extensive experience with large-scale, distributed observability systems and proficiency in tools like Apache Spark, Elastic/OpenSearch, Grafana, and Prometheus.
Must have:
  • Large-scale observability system experience
  • Collaboration with data scientists & engineers
  • Raw data to actionable reports
  • Experience with observability platforms (Spark, Elastic/OpenSearch, Grafana, Prometheus)
  • Python programming and API calls
  • Improve others' productivity
  • Planning and interpersonal skills
Good to have:
  • Background in computer science, ML, DL, open-source, infrastructure, GPU technology
  • Experience in infrastructure software, production app development, DevOps
  • Datacenter and large-scale distributed computing management
  • Experience with AI researchers/EDA developers
  • Process improvement and efficiency measurement
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA’s Hardware Infrastructure organization is seeking a Senior Observability Engineer to help architect and implement our distributed observability systems for AI and HPC clusters. We serve and collaborate directly with NVIDIA’s rapidly growing AI, HW, and SW engineering and research teams across the company. You will be working with a team of dedicated engineers on systems for data collection, aggregation, enrichment, storage, retrieval, and visualization to spectacularly improve efficiency, performance, and productivity of AI and HPC workloads. You will develop, deploy, and operate observability solutions for multiple compute clusters around the world.
 

What You’ll Be Doing:

  • Collaborate with AI, HW, SW engineering and research teams to deliver observability solutions that meet their needs in AI/HPC clusters.

  • Develop, test, and deploy data collectors, pipelines, visualization and retrieval services.

  • Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.

  • Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency.

  • Continuously improve quality, workloads, and processes through better observability.

What We Need to See:

  • Experience developing large scale, distributed observability systems.

  • Ability to collaborate with data scientists, researchers, and engineering teams to identify high value data for collection and analysis.

  • Experience with turning raw data into actionable reports

  • Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools

  • Python programming experience and use of API calls

  • Passion for improving the productivity of others

  • Excellent planning and interpersonal skills

  • Flexibility/adaptability working in a dynamic environment with changing requirements

  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field (or equivalent experience)

  • 8+ yrs of proven experience.

Ways To Stand Out from The Crowd:

  • Background in computer science, machine learning, deep learning, open-source software, infrastructure technologies, and GPU technology.

  • Prior experience in infrastructure software, production application software development, software development, release and support methodology and DevOps

  • Experience in the management of datacenters and large-scale distributed computing

  • Experience working with AI researchers and/or EDA developers

  • Consistent track record of driving process improvements and measuring efficiency and a passion for sharing knowledge and experience driving complex projects end-to-end.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

ByteDance - Machine Learning Engineer - Machine Learning Infrastructure

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
Warner Bros Games - Staff Software Engineer, Data Quality & Audience

Warner Bros Games

Hyderabad, Telangana, India (Hybrid)
3 Weeks ago
Warner Bros Games - Staff Data Engineer

Warner Bros Games

Atlanta, Georgia, United States (Hybrid)
1 Month ago
Rackspace Technology - Principal MLOps Engineer

Rackspace Technology

Toronto, Ontario, Canada (Remote)
1 Month ago
Inworld AI - AI Trainer (Contractor) - Writing & Gaming

Inworld AI

Mountain View, California, United States (Remote)
1 Month ago
NVIDIA - Optics Firmware Verification Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
3 Months ago
ByteDance - Machine Learning Research Scientist, AI for Science

ByteDance

Seattle, Washington, United States (On-Site)
4 Months ago
Qt Group - Software Engineer

Qt Group

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Netflix - Software Engineer 5 - Streaming Algorithms

Netflix

United States (Remote)
6 Months ago
NVIDIA - Senior High-Performance System Architect

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Riot Games - Principal Software Engineer (ML Focused) - League Studio, League Data Central

Riot Games

Los Angeles, California, United States (On-Site)
5 Months ago
PwC - Senior Associate_Snowflake Developer_Data &Analytics_Advisory_PAN  India

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Razer - Senior Data Scientist

Razer

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
6 Months ago
Sandsoft Games - Director of Data Science and Engineering

Sandsoft Games

Riyadh, Riyadh Province, Saudi Arabia (On-Site)
1 Month ago
Next Level Business Services - Sr. Cassandra Architect

Next Level Business Services

Sparks, Maryland, United States (On-Site)
6 Months ago
ByteDance - Senior Site Reliability Engineer - Data Infrastructure (San Jose)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Egnyte - Lead Technical Consultant, Professional Services

Egnyte

India (Remote)
4 Months ago
ByteDance - Backend Engineer, Applied Machine Learning Platform - 2025 Start

ByteDance

Singapore (On-Site)
5 Months ago
Inworld AI - Forward Deployed Engineer - Canada

Inworld AI

Vancouver, British Columbia, Canada (Remote)
6 Months ago
ByteDance - Site Reliability Engineer Graduate (Technical Infrastructure) - 2025 Start (BS/MS)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Canada

Epic Games - Senior Technical Designer in Animation

Epic Games

Montreal, Quebec, Canada (On-Site)
1 Month ago
PwC - PwC Private, Philanthropic Tax, Senior Manager (Bilingual FR/EN)

PwC

Montreal, Quebec, Canada (Hybrid)
6 Months ago
ICON Creative Studio - Modeling & Texturing Artist

ICON Creative Studio

Vancouver, British Columbia, Canada (On-Site)
4 Weeks ago
Super - Software Architect (Remote!)

Super

Toronto, Ontario, Canada (Remote)
6 Months ago
Electronic Arts - Advanced Data Analyst, UGX

Electronic Arts

Vancouver, British Columbia, Canada (Hybrid)
1 Month ago
Epic Games - Technical Designer in Animation

Epic Games

Montreal, Quebec, Canada (On-Site)
1 Month ago
Keywords Studios - Senior Business Development Manager

Keywords Studios

Canada (Remote)
1 Month ago
Squeeze Animation Studios - Image Compositing Artist - All Levels

Squeeze Animation Studios

Montreal, Quebec, Canada (Hybrid)
3 Weeks ago
Larian Studios - Technical QA Tester Internship

Larian Studios

Quebec, Canada (On-Site)
1 Month ago
NVIDIA - Senior System Verification Engineer

NVIDIA

Canada (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

NVIDIA - Senior Manager, Device and Modeling

NVIDIA

Canada (Hybrid)
1 Month ago
ByteDance - Tech Lead Manager - Code AI

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Krafton  - [Publishing Platform Div.] Publishing Platform PM (5년 이상)

Krafton

Seoul, South Korea (On-Site)
5 Months ago
Krafton  - [Corp Dev Div.] Investment Team Member (3년~8년)

Krafton

Seoul, South Korea (On-Site)
6 Months ago
Evolution - C++ Developer (Video)

Evolution

Riga, Latvia (On-Site)
3 Months ago
ByteDance - Firmware Software Engineer / Architect

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Netflix - Software Engineer L4, Machine Learning Platform (Metaflow)

Netflix

Los Gatos, California, United States (On-Site)
1 Month ago
Fluence - Lead Engineer - Battery Module

Fluence

Houston, Texas, United States (Hybrid)
6 Months ago
Rivos - Silicon Logic Formal Verification - Full Time

Rivos

Austin, Texas, United States (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Massachusetts, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (Hybrid)

Santa Clara, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug