Outscal Logooutscal logo

Senior Observability Engineer, AI and HPC

1 Month ago • 8 Years + • Research & Development • $184,000 PA - $356,500 PA

Job Summary

Job Description

NVIDIA seeks a Senior Observability Engineer to architect and implement distributed observability systems for AI and HPC clusters. Collaborate with AI, HW, and SW teams to build solutions for data collection, aggregation, enrichment, storage, retrieval, and visualization. Develop, test, and deploy data collectors, pipelines, and visualization services. Define data collection and retention policies. Work in a diverse team to provide operational and strategic data to improve performance, productivity, and efficiency. Continuously improve quality, workloads, and processes through better observability. This role requires extensive experience with large-scale, distributed observability systems and proficiency in tools like Apache Spark, Elastic/OpenSearch, Grafana, and Prometheus.
Must have:
  • Large-scale observability system experience
  • Collaboration with data scientists & engineers
  • Raw data to actionable reports
  • Experience with observability platforms (Spark, Elastic/OpenSearch, Grafana, Prometheus)
  • Python programming and API calls
  • Improve others' productivity
  • Planning and interpersonal skills
Good to have:
  • Background in computer science, ML, DL, open-source, infrastructure, GPU technology
  • Experience in infrastructure software, production app development, DevOps
  • Datacenter and large-scale distributed computing management
  • Experience with AI researchers/EDA developers
  • Process improvement and efficiency measurement
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA’s Hardware Infrastructure organization is seeking a Senior Observability Engineer to help architect and implement our distributed observability systems for AI and HPC clusters. We serve and collaborate directly with NVIDIA’s rapidly growing AI, HW, and SW engineering and research teams across the company. You will be working with a team of dedicated engineers on systems for data collection, aggregation, enrichment, storage, retrieval, and visualization to spectacularly improve efficiency, performance, and productivity of AI and HPC workloads. You will develop, deploy, and operate observability solutions for multiple compute clusters around the world.
 

What You’ll Be Doing:

  • Collaborate with AI, HW, SW engineering and research teams to deliver observability solutions that meet their needs in AI/HPC clusters.

  • Develop, test, and deploy data collectors, pipelines, visualization and retrieval services.

  • Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.

  • Work in a diverse team to provide operational and strategic data to empower our engineers and researchers to improve performance, productivity, and efficiency.

  • Continuously improve quality, workloads, and processes through better observability.

What We Need to See:

  • Experience developing large scale, distributed observability systems.

  • Ability to collaborate with data scientists, researchers, and engineering teams to identify high value data for collection and analysis.

  • Experience with turning raw data into actionable reports

  • Experience with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similar open-source tools

  • Python programming experience and use of API calls

  • Passion for improving the productivity of others

  • Excellent planning and interpersonal skills

  • Flexibility/adaptability working in a dynamic environment with changing requirements

  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field (or equivalent experience)

  • 8+ yrs of proven experience.

Ways To Stand Out from The Crowd:

  • Background in computer science, machine learning, deep learning, open-source software, infrastructure technologies, and GPU technology.

  • Prior experience in infrastructure software, production application software development, software development, release and support methodology and DevOps

  • Experience in the management of datacenters and large-scale distributed computing

  • Experience working with AI researchers and/or EDA developers

  • Consistent track record of driving process improvements and measuring efficiency and a passion for sharing knowledge and experience driving complex projects end-to-end.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

NVIDIA - System Software Engineer Intern, Apache Spark Solutions - 2025

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
GoTo Group - Senior Software Engineer - Data Platform (Mercury)

GoTo Group

Bengaluru, Karnataka, India (On-Site)
5 Months ago
EXUSIA - Senior Data Analyst - Data Engineering / Modeling

EXUSIA

India (Remote)
5 Months ago
Nielsen Holdings - Sr. Staff Data Engineer

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
PwC - IN-Senior Associate_AWS Data Engineer_Data &Analytics_Advisory_Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
5 Months ago
NVIDIA - Senior Software and Cloud Architect

NVIDIA

Ra'anana, Center District, Israel (On-Site)
2 Months ago
NVIDIA - EDA System Software Engineer

NVIDIA

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
NVIDIA - Senior Chip Architect

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
NVIDIA - Senior ASIC Verification Engineer - Networking Chip Design

NVIDIA

Shanghai, Shanghai, China (On-Site)
1 Month ago
Krafton  - Member of Global Publishing Strategic Initiatives

Krafton

Seoul, South Korea (On-Site)
12 Hours ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

PwC - IN-Senior Associate _.Net Developer _Data & Analytics _Advisory _PAN India

PwC

Kolkata, West Bengal, India (On-Site)
5 Months ago
ION - Data Engineer

ION

Budapest, Hungary (On-Site)
5 Months ago
Epic Games - Principal Data Analyst, Ecosystem Economy & UGC

Epic Games

Cary, North Carolina, United States (On-Site)
2 Months ago
NVIDIA - Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

Shanghai, Shanghai, China (On-Site)
2 Months ago
PwC - Senior Associate_Hadoop Developer_Advisory Corporate_Advisory_Bangalore Millenia

PwC

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Rackspace Technology - Sr. Data Engineering Delivery Architect (Azure Data Services)

Rackspace Technology

United States (Remote)
5 Months ago
Nielsen Holdings - Sr. Staff Data Engineer

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)
5 Months ago
Rank group - Chat Host

Rank group

Quatre Bornes, Plaines Wilhems District, Mauritius (On-Site)
4 Months ago
Integral Ad Science - Senior Site Reliability Engineer

Integral Ad Science

Pune, Maharashtra, India (Hybrid)
5 Months ago
Inworld AI - Senior Product Manager, AI Engine - USA

Inworld AI

Mountain View, California, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Canada

Track vfx - Vancouver | Pipeline TD

Track vfx

Vancouver, British Columbia, Canada (On-Site)
6 Months ago
PwC - Salesforce Alliance Sales Director

PwC

Toronto, Ontario, Canada (On-Site)
3 Months ago
Salesforce - Business Development Representative - East (Canada)

Salesforce

Toronto, Ontario, Canada (On-Site)
3 Weeks ago
Scanline VFX - Backend / Service Engineer

Scanline VFX

Toronto, Ontario, Canada (Hybrid)
3 Months ago
Scanline VFX - Senior Pipeline Developer (Maya)

Scanline VFX

Vancouver, British Columbia, Canada (Remote)
5 Months ago
Scanline VFX - Environment TD

Scanline VFX

Vancouver, British Columbia, Canada (Hybrid)
2 Months ago
Epic Games - Lead Gameplay Animator

Epic Games

Montreal, Quebec, Canada (On-Site)
2 Months ago
Epic Games - Art Director

Epic Games

Vancouver, British Columbia, Canada (On-Site)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

NVIDIA - Physical Design Backend Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago
ByteDance - Interaction Technology Lead - Smart Wearable Devices- Pico Lab- San Jose

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
NVIDIA - Senior Verification Engineer - Hardware

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Samsung Semiconductor - Senior Staff Engineer, TCAD

Samsung Semiconductor

San Jose, California, United States (On-Site)
6 Days ago
NVIDIA - Senior ASIC Physical Design Engineer - High Performance Designs

NVIDIA

Austin, Texas, United States (On-Site)
2 Weeks ago
NVIDIA - Senior Platform Software Engineer, PCIe

NVIDIA

Canada (On-Site)
1 Day ago
Krafton  - [Finance Div.] IR Specialist (5년 ~ 10년)

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Rockstar Games - Senior Engine Programmer

Rockstar Games

Dundee, Scotland, United Kingdom (On-Site)
1 Month ago
The Walt Disney Company - Software Engineer, Tools Sets & Layout (Applications)

The Walt Disney Company

Emeryville, California, United States (On-Site)
21 Hours ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Hsinchu, Hsinchu City, Taiwan (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

Seoul, South Korea (Hybrid)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Ra'anana, Center District, Israel (On-Site)

Shanghai, Shanghai, China (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Be'er Sheva, South District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug