ML Cluster Operations Engineer

TensorWave

5+ Years | Las Vegas, NV, USA (On Site) | Full Time | 1 day ago

Apply Now

Job Summary

TensorWave is seeking a senior-level ML Cluster Operations Engineer specializing in training and AI workload scheduling. This role involves managing distributed machine learning workloads at scale using Slurm and/or Kubernetes. The engineer will lead the evolution of managed Slurm and Kubernetes offerings, develop CI/automation, ensure cluster health, debug workloads, and establish best practices for running jobs at scale, while also mentoring other ML engineers.

Must Have

5+ years experience in cloud infrastructure, HPC, or ML
Significant hands-on Slurm experience in production HPC/ML environments
Strong knowledge of distributed ML frameworks (Python, PyTorch, Megatron, c10d, MPI)
Understanding of node lifecycle, health checks, and draining scripts
Deep understanding of security, compliance, resilience in containerized workloads

Good to Have

3+ years hands-on Kubernetes experience (API, internals, networking, storage)
Proficiency in Kubernetes manifests, Helm charts, and release management
Experience with DAGs using K8s native tools like Argo Workflows
Foundation in networking (RDMA, RoCE, Infiniband)
Experience with low-level kernel libraries (CUDA, Composable Kernel)
Contributions to open-source projects or ML/AI tooling

Perks & Benefits

Stock Options
100% paid Medical, Dental, and Vision insurance
Life and Voluntary Supplemental Insurance
Short Term Disability Insurance
Flexible Spending Account
401(k)
Flexible PTO
Paid Holidays
Parental Leave
Mental Health Benefits through Spring Health

Job Description

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role:

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and/or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities:

Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
Mentor and upskill ML engineers in best practices.

Qualifications:

Must-Have:

5+ years of experience in cloud infrastructure, HPC, or machine learning roles.
Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have:

3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
Experience with DAGs using K8s native tools such as Argo Workflows.
Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.
Experience with low level kernel libraries, such as CUDA and Composable Kernel.
Contributions to open-source projects or ML/AI tooling.

What Success Looks Like

A production-grade integrated Slurm platform that can support thousands of GPUs, with self-healing, scaling, and strong observability.
Infrastructure is resilient, secure, resource-optimized, and compliant.
Best practices and tooling are well-documented, standardized, and continuously improved across the company.
Make GPUs go Brrrrrrr

What We Bring:

Stock Options
100% paid Medical, Dental, and Vision insurance
Life and Voluntary Supplemental Insurance
Short Term Disability Insurance
Flexible Spending Account
401(k)
Flexible PTO
Paid Holidays
Parental Leave
Mental Health Benefits through Spring Health

8 Skills Required For This Role

Game Texts Cuda Networking Helm Pytorch Kubernetes Python Machine Learning

Similar Jobs