Kubernetes Platform Engineer

TensorWave

Job Summary

As a Kubernetes Platform Engineer at TensorWave, you will maintain the stability and reliability of our bare-metal Kubernetes infrastructure. This role involves troubleshooting, incident response, and day-to-day cluster operations across multi-tenant workloads, supporting cutting-edge AI environments. You will work closely with senior engineers to deepen your Kubernetes expertise and contribute to the next generation of AI innovation.

Must Have

  • Own and troubleshoot operational issues within Kubernetes environments
  • Maintain and monitor core services (e.g., Cilium, HAProxy, Prometheus, etc.)
  • Ensure uptime, performance, and reliability of multi-tenant clusters
  • Assist with Ingress/Egress connectivity and network debugging
  • Support internal and customer teams in secure, isolated VPC environments
  • Collaborate with senior engineers on automation and cluster lifecycle improvements
  • 2–4 years experience in DevOps, SRE, or Linux infrastructure roles
  • 1+ years of hands-on experience with Kubernetes in production
  • Familiarity with networking, CNI plugins, and core Linux troubleshooting
  • Strong infrastructure-as-code mindset using tools like Helm, Terraform, or Ansible
  • Solid experience with monitoring and logging tools (e.g., Prometheus, Grafana, Loki)
  • Understanding of secure infrastructure design principles and least-privilege access
  • Comfortable working in a team-oriented, fast-paced operational environment

Good to Have

  • Experience with RKE2, Rancher, or similar platforms
  • Experience troubleshooting or supporting AI or GPU-based workloads
  • Familiarity with HAProxy, Cilium, or other Kubernetes ingress/networking tools

Perks & Benefits

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance
  • Life and Voluntary Supplemental Insurance
  • Short Term Disability Insurance
  • Flexible Spending Account
  • 401(k)
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Mental Health Benefits through Spring Health

Job Description

At TensorWave, we're leading the charge in AI compute, building a versatile cloud platform that's driving the next generation of AI innovation. We're focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what's possible in the AI landscape.

About the Role:

As a Kubernetes Platform Engineer focused on support and operations, you’ll play a critical role in maintaining the stability and reliability of our bare-metal Kubernetes infrastructure. You will work closely with senior engineers, taking point on troubleshooting, incident response, and day-to-day cluster operations across multi-tenant workloads.

This is a great opportunity for engineers ready to deepen their Kubernetes expertise while supporting cutting-edge AI environments in real-time.

Responsibilities:

  • Own and troubleshoot operational issues within Kubernetes environments
  • Maintain and monitor core services (e.g., Cilium, HAProxy, Prometheus, etc.)
  • Ensure uptime, performance, and reliability of multi-tenant clusters
  • Assist with Ingress/Egress connectivity and network debugging
  • Support internal and customer teams in secure, isolated VPC environments
  • Collaborate with senior engineers on automation and cluster lifecycle improvements

Required Skills & Experience:

  • 2–4 years experience in DevOps, SRE, or Linux infrastructure roles
  • 1+ years of hands-on experience with Kubernetes in production
  • Familiarity with networking, CNI plugins, and core Linux troubleshooting
  • Strong infrastructure-as-code mindset using tools like Helm, Terraform, or Ansible
  • Solid experience with monitoring and logging tools (e.g., Prometheus, Grafana, Loki)
  • Understanding of secure infrastructure design principles and least-privilege access
  • Comfortable working in a team-oriented, fast-paced operational environment

Nice to Have:

  • Experience with RKE2, Rancher, or similar platforms
  • Experience troubleshooting or supporting AI or GPU-based workloads
  • Familiarity with HAProxy, Cilium, or other Kubernetes ingress/networking tools

What We Bring:

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance
  • Life and Voluntary Supplemental Insurance
  • Short Term Disability Insurance
  • Flexible Spending Account
  • 401(k)
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Mental Health Benefits through Spring Health

12 Skills Required For This Role

Problem Solving Game Texts Networking Linux Incident Response Prometheus Ansible Terraform Grafana Rancher Helm Kubernetes

Similar Jobs