AI Cloud Engineer
Rebellions
Job Summary
This role involves designing and building Kubernetes-native AI cloud systems for deploying and managing large-scale AI services. Key responsibilities include implementing core cloud features like dynamic workload scheduling, logging, monitoring, authentication, high availability, and resolving performance bottlenecks. The engineer will also collaborate with internal and external teams to deliver and maintain cloud-native solutions and contribute to future product enhancements.
Must Have
- Design and build Kubernetes-native AI cloud systems tailored for massive-scale AI services.
- Implement core cloud features: dynamic workload scheduling, logging/monitoring/metering, authentication/authorization, high availability, QoS, and failover.
- Identify and resolve performance bottlenecks and operational issues affecting cluster stability and availability.
- Work closely with customers and internal teams to deliver and maintain cloud-native systems.
- Bachelor’s or higher degree in Computer Science, Electrical Engineering, or a related technical field.
- Proven, hands-on experience designing and operating large-scale Kubernetes clusters in a production environment.
- Strong proficiency in production-quality systems code using Python, Go, or C++.
- Experience in full-stack development.
Good to Have
- Exceptional problem-solving skills, with a proactive and analytical approach.
- Certified Kubernetes Administrator (CKA) certification.
- 3-5 years of direct experience building commercial services and infrastructure, including creating Kubernetes operators or custom controllers.
- Familiarity with AI/ML specific orchestration tools built atop Kubernetes (e.g., Kubeflow, Ray, Argo).
Job Description
Responsibilities and Opportunities
- Design and build a Kubernetes-native AI cloud system, specifically tailored for deploying and managing massive-scale, performance AI services
- Implement core cloud features – dynamic workload scheduling, logging/monitoring/metering, authentication/authorization, high availability, QoS, and failover – for our internal platform or customer-facing solutions
- Identify and resolve performance bottlenecks and operational issues that affect cluster stability and availability
- Work closely with customers and internal teams to deliver and maintain cloud-native systems and help shape future product enhancements and capabilities
Key Qualifications
- Bachelor’s or higher degree in Computer Science, Electrical Engineering, or a related technical field
- Proven, hands-on experience designing and operating large-scale Kubernetes clusters in a production environment
- Strong proficiency in production-quality systems code using Python, Go, or C++
- Experience in full-stack development
Ideal Qualifications
- Exceptional problem-solving skills, with a proactive and analytical approach to technical challenges
- Certified Kubernetes Administrator (CKA) certification
- 3-5 years of direct experience building commercial services and infrastructure, including creating Kubernetes operators or custom controllers
- Familiarity with AI/ML specific orchestration tools built atop Kubernetes (e.g., Kubeflow, Ray, Argo)
4 Skills Required For This Role
Cpp
Game Texts
Kubernetes
Python