ML Cluster Operations Engineer
TensorWave
Job Summary
TensorWave is seeking a senior-level ML Cluster Operations Engineer specializing in training and AI workload scheduling. This role involves managing distributed machine learning workloads at scale using Slurm and/or Kubernetes. The engineer will lead the evolution of managed Slurm and Kubernetes offerings, develop CI/automation, ensure cluster health, debug workloads, and establish best practices for running jobs at scale, while also mentoring other ML engineers.
Must Have
- 5+ years experience in cloud infrastructure, HPC, or ML
- Significant hands-on Slurm experience in production HPC/ML environments
- Strong knowledge of distributed ML frameworks (Python, PyTorch, Megatron, c10d, MPI)
- Understanding of node lifecycle, health checks, and draining scripts
- Deep understanding of security, compliance, resilience in containerized workloads
Good to Have
- 3+ years hands-on Kubernetes experience (API, internals, networking, storage)
- Proficiency in Kubernetes manifests, Helm charts, and release management
- Experience with DAGs using K8s native tools like Argo Workflows
- Foundation in networking (RDMA, RoCE, Infiniband)
- Experience with low-level kernel libraries (CUDA, Composable Kernel)
- Contributions to open-source projects or ML/AI tooling
Perks & Benefits
- Stock Options
- 100% paid Medical, Dental, and Vision insurance
- Life and Voluntary Supplemental Insurance
- Short Term Disability Insurance
- Flexible Spending Account
- 401(k)
- Flexible PTO
- Paid Holidays
- Parental Leave
- Mental Health Benefits through Spring Health
Job Description
At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.
About the Role:
We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and/or Kubernetes.
As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.
Key Responsibilities:
- Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
- Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
- Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
- Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
- Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
- Mentor and upskill ML engineers in best practices.
Qualifications:
Must-Have:
- 5+ years of experience in cloud infrastructure, HPC, or machine learning roles.
- Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
- Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
- Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
- Deep understanding of security, compliance, and resilience in containerized workloads.
Nice-to-Have:
- 3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
- Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
- Experience with DAGs using K8s native tools such as Argo Workflows.
- Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.
- Experience with low level kernel libraries, such as CUDA and Composable Kernel.
- Contributions to open-source projects or ML/AI tooling.
What Success Looks Like
- A production-grade integrated Slurm platform that can support thousands of GPUs, with self-healing, scaling, and strong observability.
- Infrastructure is resilient, secure, resource-optimized, and compliant.
- Best practices and tooling are well-documented, standardized, and continuously improved across the company.
- Make GPUs go Brrrrrrr
What We Bring:
- Stock Options
- 100% paid Medical, Dental, and Vision insurance
- Life and Voluntary Supplemental Insurance
- Short Term Disability Insurance
- Flexible Spending Account
- 401(k)
- Flexible PTO
- Paid Holidays
- Parental Leave
- Mental Health Benefits through Spring Health