HPC Engineer - Research Infrastructure

6 Months ago • 8 Years + • Devops • $150,000 PA - $300,000 PA

Job Summary

Job Description

Help Luma build some of the biggest and fastest AI supercomputing clusters in the world! As a High-Performance Computing (HPC) engineer, you will work at the intersection of hardware and software, designing systems that deliver maximum performance for large-scale AI models. This role combines HPC traditions with a modern cloud environment. You will optimize CPU, GPU, and network devices for peak efficiency in large-scale systems and manage the lowest levels of software platforms, including the Linux kernel and user-space code. You will also write code to automate system monitoring and healing for numerous servers.
Must have:
  • 8+ years as infrastructure/DevOps engineer in complex distributed systems
  • Deep understanding of networking
  • Develop high-quality software in a general-purpose language (preferably Python)
  • Excellent problem-solving skills
  • Strong knowledge of observability/monitoring in distributed systems
  • Tenacious at troubleshooting hardware/network failures
  • Independently driven and able to own problems end-to-end
Good to have:
  • Experience in HPC networking
  • Experience with GPUs in large scale clusters
  • Experience with large scale data center operations
  • Proficiency in cloud orchestration and system tools
Perks:
  • Equity

Job Details

Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud environment. 


For this role, it’s important you understand how to combine CPU’s, GPU’s, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code. You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

  • In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself. 

  • We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. 

  • You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.

  • Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.

  • We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

Experience

  • 8+ years experience as infrastructure engineer or Devops in large and complex distributed systems.

  • Deep understanding of networking, bonus points for experience in HPC networking.

  • Experience developing high-quality software in a general-purpose programming language, preferably including Python.

  • Excellent problem-solving skills and attention to detail.

  • Experience with GPUs in large scale clusters is strongly preferred.

  • Strong knowledge of observability and monitoring in distributed systems.

  • Tenacious at troubleshooting hardware and network topology failures in distributed systems

  • Independently driven and able to own problems and build solutions from end-to-end.

  • Experience with large scale data center operations, proficiency in cloud orchestration and system tools.

Your application is reviewed by real people.

Similar Jobs

Palo Alto Networks - Staff E-TAC Engineer

Palo Alto Networks

Bengaluru, Karnataka, India (On-Site)
1 Month ago
PwC - Berater:in CRM - SAP Customer Experience

PwC

Zürich, Zurich, Switzerland (On-Site)
9 Months ago
Atari - Gaming Catalog Specialist

Atari

India (On-Site)
2 Months ago
Wolters Kluwer - IT Support Engineer

Wolters Kluwer

Chennai, Tamil Nadu, India (On-Site)
2 Months ago
cyara - Sr Software Engineer

cyara

Hyderabad, Telangana, India (Hybrid)
3 Weeks ago
Columbia Sportswear Company - Azure Cloud Developer/Engineer

Columbia Sportswear Company

Bengaluru, Karnataka, India (Hybrid)
9 Months ago
bytedance - Site Reliability Engineer, Edge Services

bytedance

San Jose, California, United States (On-Site)
9 Months ago
Domo - DevOps Engineer - India

Domo

Pune, Maharashtra, India (Hybrid)
2 Weeks ago
The Walt Disney Company - Software Engineer, Platform

The Walt Disney Company

California, United States (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

HCL Tech - ETL Development Lead

HCL Tech

New York, United States (On-Site)
2 Months ago
affinidi - Engineering Manager, Full Stack

affinidi

Dublin, County Dublin, Ireland (Hybrid)
3 Weeks ago
Qualcomm - High Performance DSP core Implementation Engineer, Sr Staff

Qualcomm

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Meta - Product Technical Program Manager

Meta

Seattle, Washington, United States (Remote)
9 Months ago
bytro studios - Java Backend Engineer

bytro studios

St. Julian's, Malta (Hybrid)
5 Months ago
IO Interactive - Senior Audio Programmer

IO Interactive

Copenhagen, Denmark (Hybrid)
10 Months ago
Power Integrations - Materials Planner

Power Integrations

Penang, Malaysia (On-Site)
2 Months ago
Epic Games - Senior Engine Programmer

Epic Games

Porto Alegre, State Of Rio Grande Do Sul, Brazil (On-Site)
4 Months ago
Boomi  - Product Support Engineer—Triage

Boomi

Hyderabad, Telangana, India (Hybrid)
3 Weeks ago
UPF Industries  - OT Field Service Technician

UPF Industries

Bartow, Florida, United States (On-Site)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

Zinnia - Performance Marketing, Sr. Manager

Zinnia

Atlanta, Georgia, United States (Remote)
2 Months ago
Informa Group - Credit Market Analyst

Informa Group

Boston, New York, United States (Hybrid)
1 Month ago
Take-Two Interactive - Media Systems Engineer II

Take-Two Interactive

Austin, Texas, United States (On-Site)
1 Month ago
Blinkhealth - People and Culture Partner, Pharmacy Operations

Blinkhealth

Chesterfield, Missouri, United States (On-Site)
2 Months ago
CityBlock - Paramedic

CityBlock

Toledo, Ohio, United States (Hybrid)
1 Month ago
Next Level Business Services - AS 400 Developer

Next Level Business Services

Greensboro, North Carolina, United States (On-Site)
10 Months ago
Tekion Corp - Training Specialist I (CRM)

Tekion Corp

United States (On-Site)
1 Month ago
Marvell - Administrative Assistant, Staff Specialist

Marvell

Santa Clara, California, United States (On-Site)
2 Weeks ago
Mindtickle - Enterprise Account Executive - Expansions & Renewals

Mindtickle

United States (Remote)
1 Month ago
Poppulo - Strategic Engagement Lead

Poppulo

Denver, Colorado, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Sabre India - Senior Java Devops Software Engineer

Sabre India

Kraków, Lesser Poland Voivodeship, Poland (Hybrid)
3 Months ago
Capgemini - SRE Engineers

Capgemini

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Workato - Senior Infrastructure Engineer

Workato

Nicosia, Nicosia, Cyprus (On-Site)
1 Month ago
SciPlay - Senior Cloud Engineer

SciPlay

Austin, Texas, United States (Hybrid)
1 Month ago
deel. - Back-End Engineer - Infrastructure Team

deel.

Brazil (Remote)
3 Weeks ago
London stock Exchange - Lead Platform Engineer, Manager

London stock Exchange

London, England, United Kingdom (On-Site)
2 Months ago
Salesforce - Lead Solution Engineer

Salesforce

London, England, United Kingdom (On-Site)
2 Months ago
Temporal Technologies - Staff Software Engineer, Cloud Capacity

Temporal Technologies

United States (Remote)
1 Month ago
Visa - Sr. Site Reliability Engineer - ServiceNow

Visa

Ashburn, Virginia, United States (Hybrid)
2 Months ago
Fireworks AI - Partnerships Solutions Architect, Applied AI

Fireworks AI

Redwood City, California, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Luma

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug