HPC Engineer - Research Infrastructure

5 Months ago • 8 Years + • Devops • $150,000 PA - $300,000 PA

Job Summary

Job Description

Help Luma build some of the biggest and fastest AI supercomputing clusters in the world! As a High-Performance Computing (HPC) engineer, you will work at the intersection of hardware and software, designing systems that deliver maximum performance for large-scale AI models. This role combines HPC traditions with a modern cloud environment. You will optimize CPU, GPU, and network devices for peak efficiency in large-scale systems and manage the lowest levels of software platforms, including the Linux kernel and user-space code. You will also write code to automate system monitoring and healing for numerous servers.
Must have:
  • 8+ years as infrastructure/DevOps engineer in complex distributed systems
  • Deep understanding of networking
  • Develop high-quality software in a general-purpose language (preferably Python)
  • Excellent problem-solving skills
  • Strong knowledge of observability/monitoring in distributed systems
  • Tenacious at troubleshooting hardware/network failures
  • Independently driven and able to own problems end-to-end
Good to have:
  • Experience in HPC networking
  • Experience with GPUs in large scale clusters
  • Experience with large scale data center operations
  • Proficiency in cloud orchestration and system tools
Perks:
  • Equity

Job Details

Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud environment. 


For this role, it’s important you understand how to combine CPU’s, GPU’s, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code. You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

  • In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself. 

  • We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. 

  • You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.

  • Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.

  • We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

Experience

  • 8+ years experience as infrastructure engineer or Devops in large and complex distributed systems.

  • Deep understanding of networking, bonus points for experience in HPC networking.

  • Experience developing high-quality software in a general-purpose programming language, preferably including Python.

  • Excellent problem-solving skills and attention to detail.

  • Experience with GPUs in large scale clusters is strongly preferred.

  • Strong knowledge of observability and monitoring in distributed systems.

  • Tenacious at troubleshooting hardware and network topology failures in distributed systems

  • Independently driven and able to own problems and build solutions from end-to-end.

  • Experience with large scale data center operations, proficiency in cloud orchestration and system tools.

Your application is reviewed by real people.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Palo Alto, California, United States

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Devops Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

Palo Alto, California, United States (Hybrid)

View All Jobs

Get notified when new jobs are added by Luma

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug