ML Cluster Operations Engineer

TensorWave

Job Summary

TensorWave is seeking a senior-level ML Cluster Operations Engineer specializing in training and AI workload scheduling. This role involves managing distributed machine learning workloads at scale using Slurm and/or Kubernetes. The engineer will lead the evolution of managed Slurm and Kubernetes offerings, develop CI/automation, ensure cluster health, debug workloads, and establish best practices for running jobs at scale, while also mentoring other ML engineers.

Must Have

  • 5+ years experience in cloud infrastructure, HPC, or ML
  • Significant hands-on Slurm experience in production HPC/ML environments
  • Strong knowledge of distributed ML frameworks (Python, PyTorch, Megatron, c10d, MPI)
  • Understanding of node lifecycle, health checks, and draining scripts
  • Deep understanding of security, compliance, resilience in containerized workloads

Good to Have

  • 3+ years hands-on Kubernetes experience (API, internals, networking, storage)
  • Proficiency in Kubernetes manifests, Helm charts, and release management
  • Experience with DAGs using K8s native tools like Argo Workflows
  • Foundation in networking (RDMA, RoCE, Infiniband)
  • Experience with low-level kernel libraries (CUDA, Composable Kernel)
  • Contributions to open-source projects or ML/AI tooling

Perks & Benefits

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance
  • Life and Voluntary Supplemental Insurance
  • Short Term Disability Insurance
  • Flexible Spending Account
  • 401(k)
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Mental Health Benefits through Spring Health

Job Description

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role:

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and/or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities:

  • Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
  • Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
  • Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
  • Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
  • Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
  • Mentor and upskill ML engineers in best practices.

Qualifications:

Must-Have:

  • 5+ years of experience in cloud infrastructure, HPC, or machine learning roles.
  • Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
  • Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
  • Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
  • Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have:

  • 3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
  • Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
  • Experience with DAGs using K8s native tools such as Argo Workflows.
  • Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.
  • Experience with low level kernel libraries, such as CUDA and Composable Kernel.
  • Contributions to open-source projects or ML/AI tooling.

What Success Looks Like

  • A production-grade integrated Slurm platform that can support thousands of GPUs, with self-healing, scaling, and strong observability.
  • Infrastructure is resilient, secure, resource-optimized, and compliant.
  • Best practices and tooling are well-documented, standardized, and continuously improved across the company.
  • Make GPUs go Brrrrrrr

What We Bring:

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance
  • Life and Voluntary Supplemental Insurance
  • Short Term Disability Insurance
  • Flexible Spending Account
  • 401(k)
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Mental Health Benefits through Spring Health

8 Skills Required For This Role

Game Texts Cuda Networking Helm Pytorch Kubernetes Python Machine Learning

Similar Jobs