Site Reliability Engineer | AI Supercomputing

Luma

| Remote | Full Time | 1 day ago

Apply Now

Job Summary

Luma AI is seeking a Site Reliability Engineer to build and maintain massive-scale GPU clusters for multimodal general intelligence. This role involves architecting the physical and digital foundation of AGI, operating at the frontier of computing power. The engineer will be a technical authority on systems, focusing on bare metal infrastructure rather than managed services, and optimizing supercomputing architecture, network layers, and hardware-software synthesis to maximize training efficiency for foundational models.

Must Have

Elite knowledge of high-performance computing (HPC), job schedulers, and GPU architecture.
Deep systems fluency, comfortable navigating Linux terminal for performance issues.
Ability to utilize tools like perf and strace for OS level optimization.
History of building infrastructure from the ground up.
Ability to design systems where no playbook currently exists.

Job Description

The Opportunity

Luma AI is building the engine for multimodal general intelligence. To teach models to understand the world through video, audio, and images, we operate at the absolute frontier of computing power. We have secured the capital to deploy massive-scale GPU clusters that rival the world's largest supercomputers, while maintaining the agility of a focused engineering lab. This role places you at the intersection of hardware and software, where you architect the physical and digital foundation of AGI.

Where You Come In

You will serve as a technical authority on the systems that power our research and product velocity. This is a role for a builder who prefers bare metal to managed services and understands that at our scale, standard cloud abstractions break down. You will architect, optimize, and maintain the massive, multi-vendor GPU supercomputers required to train our foundational models.

What You Will Build

Supercomputing Architecture: Design and deploy high-performance clusters combining thousands of GPUs, CPUs, and high-throughput networking to maximize training efficiency.
The Network Layer: Optimize low-level networking (InfiniBand, RDMA) to ensure seamless communication between accelerators, eliminating bottlenecks in distributed training jobs.
Hardware-Software Synthesis: Collaborate with hardware partners to push the boundaries of what is possible, debugging failures at the intersection of the kernel, driver, and silicon.

The Profile We Are Looking For

HPC Authority: You possess elite knowledge of high-performance computing (HPC), including job schedulers and the nuances of GPU architecture.
Deep Systems Fluency: You are comfortable navigating the Linux terminal to solve complex performance issues, utilizing tools like perf and strace to optimize at the OS level.
First-Principles Engineering: You have a history of building infrastructure from the ground up, demonstrating the ability to design systems where no playbook currently exists.

Compensation

The base pay range for this role is $170,000 – $360,000 per year.

4 Skills Required For This Role

Problem Solving Game Texts Networking Linux

Similar Jobs

Devops

Solution Architect

SoftwareOne • Hong Kong (On Site)

Solution Architect

Capgemini • Mumbai, Maharashtra, India (On Site)

Senior DevOps Engineer

Wolters Kluwer • Alphen aan den Rijn, Netherlands (Hybrid)

Experienced Hardware DevOps Engineer

Arm • Cambridge, UK (Hybrid)

Senior DevOps Engineer

N-ix • Ukraine (Remote)

Principal Software Architect (Cloud)

OpenText • Richmond Hill, Ontario, Canada (Hybrid)

VCF Automation Architect (Aria / vRA)

onwards Search • Lindenhurst, New York, United States (Remote)

Senior DevEx Platform Developer

CyberArk • Israel (Hybrid)

DevSecOps Engineer F.H

Thales • Élancourt, France (Hybrid)

Principal Solutions Architect

N-ix • Europe (Remote)

Software Development & Engineering

Software Engineer - Reliability

Luma • Remote

Software Engineer - iOS

Luma • Palo Alto, California, United States (Hybrid)

Fullstack Engineer - Internal Tooling (Data Infra)

Luma • London, United Kingdom (Hybrid)

Software Engineer - Backend

Luma • Palo Alto, California, United States (Hybrid)

Software Engineer - Cloud FinOps & Reliability

Luma • Palo Alto, California, United States (Hybrid)

iOS Engineer | Mobile Architecture & Strategy

Luma • Palo Alto, California, United States (Hybrid)

Software Engineer - Full Stack, Ad Product

Luma • Palo Alto, California, United States (Hybrid)

Software Engineer - Backend Systems

Luma • Palo Alto, California, United States (Hybrid)

Software Engineer - Data Infra Reliability

Luma • Palo Alto, California, United States (Hybrid)

Full Stack Engineer | Generative Ads Platform

Luma • Palo Alto, California, United States (Hybrid)