ML Platform Engineer
eBay
Job Summary
eBay is building a next-generation AI platform to power intelligent experiences globally. The AI Platform (AIP) provides a scalable, secure, and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale. We are seeking an experienced Machine Learning Platform Support Engineer to join the AI Platform team. This role involves being the first line of support (L1) for ML workloads on Kubernetes and Ray.io clusters, responsible for triaging, monitoring, and resolving platform-related issues across ML training, inference, model deployment, and GPU resource allocation. The position includes on-call rotations and collaboration with ML Platform engineers to ensure reliability and scalability.
Must Have
- Serve as the first point of contact (L1) for all support requests related to the AI/ML Platform.
- Provide operational and on-call (PagerDuty) support for Ray.io and Kubernetes clusters.
- Monitor, triage, and resolve platform incidents.
- Manage GPU quota allocation and scheduling.
- Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference.
- Troubleshoot Kubernetes workloads.
- Collaborate with platform engineers, SREs, and ML practitioners to resolve infrastructure, orchestration, and dependency issues.
- Improve observability, monitoring, and alerting for Ray and Kubernetes clusters.
- Maintain and enhance runbooks, automation scripts, and knowledge base documentation.
- Participate in root cause analysis (RCA) and post-incident reviews.
- 5+ years of experience in ML operations, DevOps, or platform support for distributed AI/ML systems.
- Proven experience providing L1/L2 and on-call support for Ray.io and Kubernetes-based clusters.
- Strong understanding of Ray cluster operations.
- Hands-on experience managing Kubernetes control plane and data plane components.
- Expertise in GPU scheduling, allocation, and monitoring.
- Proficiency in Python and/or Go for automation, diagnostics, and operational tooling.
- Working knowledge of Kubernetes and cloud-native environments (AWS, GCP, Azure) and CI/CD pipelines.
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident management tools (PagerDuty, ServiceNow).
- Familiarity with ML frameworks such as TensorFlow and PyTorch.
- Strong debugging, analytical, and communication skills.
- A customer-centric, operationally disciplined mindset focused on maintaining platform reliability, performance, and user satisfaction.
Job Description
At eBay, we're more than a global ecommerce leader — we’re changing the way the world shops and sells. Our platform empowers millions of buyers and sellers in more than 190 markets around the world. We’re committed to pushing boundaries and leaving our mark as we reinvent the future of ecommerce for enthusiasts.
Our customers are our compass, authenticity thrives, bold ideas are welcome, and everyone can bring their unique selves to work — every day. We're in this together, sustaining the future of our customers, our company, and our planet.
Join a team of passionate thinkers, innovators, and dreamers — and help us connect people and build communities to create economic opportunity for all.
At eBay, we are building the next-generation AI platform to power intelligent experiences for millions of users worldwide. Our AI Platform (AIP) provides the scalable, secure, and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale. We enable teams across eBay to move from experimentation to global deployment with speed, reliability, and efficiency.
We are seeking an experienced Machine Learning Platform Support Engineer to join our AI Platform team. In this role, you will be the first line of support (L1) for ML workloads running on Kubernetes and Ray.io clusters. You will be responsible for triaging, monitoring, and resolving platform-related issues across ML training, inference, model deployment, and GPU resource allocation.
This position includes participation in on-call rotations (PagerDuty) and requires close collaboration with ML Platform engineers, researchers, and platform teams to ensure the reliability, scalability, and usability of the AI Platform. You will play a critical role in ensuring operational excellence and maintaining the uptime of the core infrastructure that powers eBay’s global AI and ML systems.
What you will accomplish
- Serve as the first point of contact (L1) for all support requests related to the AI/ML Platform, including ML training, inference, model deployment, and GPU allocation.
- Provide operational and on-call (PagerDuty) support for Ray.io and Kubernetes clusters running distributed ML workloads across cloud and on-prem environments.
- Monitor, triage, and resolve platform incidents involving job failures, scaling errors, cluster instability, or GPU resource contention.
- Manage GPU quota allocation and scheduling across multiple user teams, ensuring compliance with approved quotas and optimal resource utilization.
- Support Ray Train/Tune for large-scale distributed training and Ray Serve for autoscaled inference, maintaining performance and service reliability.
- Troubleshoot Kubernetes workloads, including pod scheduling, networking, image issues, and resource exhaustion in multi-tenant namespaces.
- Collaborate with platform engineers, SREs, and ML practitioners to resolve infrastructure, orchestration, and dependency issues impacting ML workloads.
- Improve observability, monitoring, and alerting for Ray and Kubernetes clusters using Prometheus, Grafana, and OpenTelemetry to enable proactive issue detection.
- Maintain and enhance runbooks, automation scripts, and knowledge base documentation to accelerate incident resolution and reduce recurring support requests.
- Participate in root cause analysis (RCA) and post-incident reviews, contributing to platform improvements and automation initiatives to minimize downtime.
What you will bring
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical discipline (or equivalent experience).
- 5+ years of experience in ML operations, DevOps, or platform support for distributed AI/ML systems.
- Proven experience providing L1/L2 and on-call support for Ray.io and Kubernetes-based clusters supporting ML training and inference workloads.
- Strong understanding of Ray cluster operations, including autoscaling, job scheduling, and workload orchestration across heterogeneous compute (CPU/GPU/accelerators).
- Hands-on experience managing Kubernetes control plane and data plane components, multi-tenant namespaces, RBAC, ingress, and resource isolation.
- Expertise in GPU scheduling, allocation, and monitoring (NVIDIA device plugin, MIG configuration, CUDA/NCCL optimization).
- Proficiency in Python and/or Go for automation, diagnostics, and operational tooling in distributed environments.
- Working knowledge of Kubernetes and cloud-native environments (AWS, GCP, Azure) and CI/CD pipelines.
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident management tools (PagerDuty, ServiceNow).
- Familiarity with ML frameworks such as TensorFlow and PyTorch, and their integration within distributed Ray/Kubernetes clusters.
- Strong debugging, analytical, and communication skills to collaborate effectively with cross-functional engineering and research teams.
- A customer-centric, operationally disciplined mindset focused on maintaining platform reliability, performance, and user satisfaction.