AI Platform Systems Software Engineer

37 Minutes ago • 8-10 Years • $132,000 PA - $222,100 PA

System Design

Job Description

Join eBay's AI Platform team as a Systems Software Engineer to design, implement, and optimize core infrastructure for AI/ML workloads. You will work on highly distributed systems, cloud-native services, and performance-critical components, impacting the scalability, performance, and reliability of AI applications globally. This role involves orchestrating AI/ML clusters, developing intelligent scheduling, integrating Ray for training and inference, and enhancing the control plane for secure multi-tenancy.

Good To Have:

Experience optimizing for GPU/accelerator integration (NVIDIA, AMD, TPU, etc.).
Familiarity with other systems-level languages.

Must Have:

Design and scale services for AI/ML clusters across cloud and on-prem environments.
Develop and optimize intelligent scheduling and resource management systems.
Integrate Ray Train/Tune for distributed training and Ray Serve for low-latency inference.
Build features for reliability, performance, observability, and cost-efficiency of AI workloads.
Enhance the control plane for secure multi-tenancy and enterprise-grade governance.
Implement systems for container management, dependency resolution, and large-scale model distribution.
Collaborate with ML researchers, applied scientists, and distributed systems engineers.
Provide production support and resolve infrastructure issues.
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
8-10 years of experience building and maintaining infrastructure for highly available, scalable, and performant distributed systems.
Proven expertise with cloud-native technologies (AWS, GCP, Azure) and Kubernetes-based deployments.
Hands-on experience running ML training and inference with Ray (Ray Train/Tune, Ray Serve).
Deep understanding of networking, security, authentication, and identity management.
Hands-on experience with observability stacks (Prometheus, Grafana, OpenTelemetry).
Strong coding skills in Go and/or Python.
Knowledge of Linux internals, containers, and storage systems.

Perks:

Medical benefits
Financial benefits
401(k) eligibility
Paid time off (PTO)
Parental leave

Add these skills to join the top 1% applicants for this job

resource-allocation

game-texts

resource-planning

networking

linux

aws

azure

prometheus

grafana

kubernetes

python

machine-learning

About the team & role:

At eBay, we are building the next-generation AI platform to power experiences for millions of users worldwide. Our AI Platform (AIP) provides the scalable, secure, and efficient foundation for deploying and optimizing advanced machine learning and large language model (LLM) workloads at production scale. We enable teams across eBay to move from experimentation to global deployment with speed, reliability, and efficiency.

We are seeking an experienced AI Platform Systems Software Engineer (Infrastructure) to join our AI Platform team. In this role, you will design, implement, and optimize the core infrastructure that powers AI/ML workloads across eBay. You will work on highly distributed systems, cloud-native services, and performance-critical components that make large-scale inference and training possible.

You will be part of the team responsible for both the control plane (cluster management, scheduling, user access) and the data plane (execution, resource allocation, accelerator integration). Your work will directly impact the scalability, performance, and reliability of AI applications that serve eBay’s global marketplace.

What you will accomplish:

Design and scale services to orchestrate AI/ML clusters across cloud and on-prem environments, supporting VM and Kubernetes-based deployments, including Ray (ray.io) clusters for distributed training and online inference.
Develop and optimize intelligent scheduling and resource management systems for heterogeneous compute clusters (CPU, GPU, accelerators).
Integrate Ray Train/Tune for large-scale distributed training workflows and Ray Serve for low-latency, autoscaled inference; build platform hooks for observability, canary/A-B rollouts, and fault tolerance.
Build features to improve reliability, performance, observability, and cost-efficiency of AI workloads at scale.
Enhance the control plane to support secure multi-tenancy and enterprise-grade governance.
Implement systems for container management, dependency resolution, and large-scale model distribution.
Collaborate with ML researchers, applied scientists, and distributed systems engineers to drive platform innovation.
Provide production support and work closely with field teams to resolve infrastructure issues.

What you will bring:

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (or equivalent experience).
8-10 years of experience building and maintaining infrastructure for highly available, scalable, and performant distributed systems.
Proven expertise with cloud-native technologies (AWS, GCP, Azure) and Kubernetes-based deployments.
Hands-on experience running ML training and inference with Ray (ray.io)—e.g., Ray Train/Tune for distributed training and Ray Serve for production inference—covering autoscaling, fault tolerance, observability and multi-tenant operations.
Deep understanding of networking, security, authentication, and identity management in distributed/cloud environments.
Hands-on experience with observability stacks (Prometheus, Grafana, OpenTelemetry, etc.).
Strong coding skills in Go and/or Python; familiarity with other systems-level languages is a plus.
Knowledge of Linux internals, containers, and storage systems.
Experience optimizing for GPU/accelerator integration (NVIDIA, AMD, TPU, etc.) is highly desirable.

#LI-Hybrid

Set alerts for more jobs like AI Platform Systems Software Engineer

Set alerts for new jobs by eBay

Set alerts for new System Design jobs in United States

Set alerts for new jobs in United States

Set alerts for System Design (Remote) jobs