Supercompute Infrastructure Engineer

4 Hours ago • All levels
Devops

Job Description

Periodic Labs is an AI + physical sciences lab building state-of-the-art models for scientific discoveries, well-funded and rapidly growing. The Supercompute Infrastructure Engineer will lead, design, build, and operate large-scale compute clusters to power AI scientific research. This role involves writing software for orchestration, resource management, and cluster lifecycle automation, as well as working on bringup, operations, and maintenance of GPU/CPU clusters. The engineer will also build tools and get directly involved in large-scale frontier research experiments.
Good To Have:
  • Experience with >=5,000 GPU clusters.
  • Experience with Cluster scheduling and orchestration tools like k8s and slurm.
  • Experience with Cloud environments such as GCP, AWS, or Azure.
  • Experience with Observability and monitoring tools like DataDog, Prometheus, Grafana, or VictoriaMetrics.
  • Experience with IaC tools like terraform and ansible.
  • Experience with GitOps tools like Github CI and ArgoCD.
Must Have:
  • Lead, design, build, and operate large-scale compute clusters to power AI scientific research.
  • Write software that orchestrates large GPU and CPU clusters, manages resource allocation and automates cluster lifecycle operations.
  • Work on bringup, operations and maintenance of all aspects of these clusters.
  • Build tools and get directly involved in large scale frontier research experiments.
  • Experience as a distributed systems engineer with managing large-scale compute environments, high-performance clusters, or similar hyperscale infrastructure.

Add these skills to join the top 1% applicants for this job

resource-allocation
github
game-texts
resource-planning
aws
azure
prometheus
ansible
terraform
grafana

About Periodic Labs

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.

About the Role

You will lead, design, build, and operate large-scale compute clusters to power AI scientific research.

You will write software that orchestrates large GPU and CPU clusters, manages resource allocation and automates cluster lifecycle operations. You will work on bringup, operations and maintenance of all aspects of these clusters.

You will build tools and get directly involved in large scale frontier research experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers.

We’re looking for distributed systems engineers with experience in managing large-scale compute environments, high-performance clusters, or similar hyperscale infrastructure.

You might thrive in this role if you have experience with:

  • >=5,000 GPU clusters
  • Cluster scheduling and orchestration tools like k8s and slurm
  • Cloud environments such as GCP, AWS, or Azure
  • Observability and monitoring tools like DataDog, Prometheus, Grafana, or VictoriaMetrics
  • IaC tools like terraform and ansible
  • GitOps tools like Github CI and ArgoCD

Set alerts for more jobs like Supercompute Infrastructure Engineer
Set alerts for new jobs by Periodic Labs
Set alerts for new Devops jobs in United States
Set alerts for new jobs in United States
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙