This role involves designing and building Kubernetes-native AI cloud systems for deploying and managing large-scale AI services. Key responsibilities include implementing core cloud features like dynamic workload scheduling, logging, monitoring, authentication, high availability, and resolving performance bottlenecks. The engineer will also collaborate with internal and external teams to deliver and maintain cloud-native solutions and contribute to future product enhancements.
Good To Have:
Exceptional problem-solving skills, with a proactive and analytical approach.
3-5 years of direct experience building commercial services and infrastructure, including creating Kubernetes operators or custom controllers.
Familiarity with AI/ML specific orchestration tools built atop Kubernetes (e.g., Kubeflow, Ray, Argo).
Must Have:
Design and build Kubernetes-native AI cloud systems tailored for massive-scale AI services.
Implement core cloud features: dynamic workload scheduling, logging/monitoring/metering, authentication/authorization, high availability, QoS, and failover.
Identify and resolve performance bottlenecks and operational issues affecting cluster stability and availability.
Work closely with customers and internal teams to deliver and maintain cloud-native systems.
Bachelor’s or higher degree in Computer Science, Electrical Engineering, or a related technical field.
Proven, hands-on experience designing and operating large-scale Kubernetes clusters in a production environment.
Strong proficiency in production-quality systems code using Python, Go, or C++.
Experience in full-stack development.
Add these skills to join the top 1% applicants for this job
cpp
game-texts
kubernetes
python
Responsibilities and Opportunities
Design and build a Kubernetes-native AI cloud system, specifically tailored for deploying and managing massive-scale, performance AI services
Implement core cloud features – dynamic workload scheduling, logging/monitoring/metering, authentication/authorization, high availability, QoS, and failover – for our internal platform or customer-facing solutions
Identify and resolve performance bottlenecks and operational issues that affect cluster stability and availability
Work closely with customers and internal teams to deliver and maintain cloud-native systems and help shape future product enhancements and capabilities
Key Qualifications
Bachelor’s or higher degree in Computer Science, Electrical Engineering, or a related technical field
Proven, hands-on experience designing and operating large-scale Kubernetes clusters in a production environment
Strong proficiency in production-quality systems code using Python, Go, or C++
Experience in full-stack development
Ideal Qualifications
Exceptional problem-solving skills, with a proactive and analytical approach to technical challenges