About the team and the role:
The eBay Cloud team powers the foundational cloud infrastructure that supports thousands of eBay applications. As one of the largest private cloud platforms in the industry, we operate and manage hundreds of Kubernetes clusters across diverse environments, comprising millions of compute instances.
Our team is responsible for the full lifecycle management of these clusters—including provisioning, OS and Kubernetes upgrades, technical refreshes, and decommissioning. We also customize the Linux operating system for our Kubernetes platform, enhancing the kernel to meet eBay’s rigorous scalability, reliability, and security requirements.
The ideal candidate will have at least 5 years of experience in the field, focusing on kernel development and cluster automation(build, os/kubernetes upgrade and decommission). You will also drive the implementation of observability practices to monitor, troubleshoot, and ensure the reliability of our infrastructure at scale.
What you will accomplish:
- Design, develop, and maintain a robust, high-performance Kubernetes fleet management system encompassing cluster, availability zone (AZ), and node lifecycle operations, with a rapid adoption of the latest Kubernetes releases.
- Contribute to kernel development and performance tuning to enhance system scalability, reliability, and efficiency; stay up to date with the latest advancements in kernel and security technologies.
- Build high-performance tools and services using Go and Python to support infrastructure automation and diagnostics.
- Collaborate with cross-functional teams to validate, adopt, and integrate optimized Linux OS distributions across diverse infrastructure environments.
- Implement robust observability frameworks to monitor system health, ensure performance, and support proactive issue resolution at scale.
What you will bring:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Minimum of 5 years of hands-on experience with Linux systems, including a strong understanding of Linux kernel development and OS internals—such as process scheduling, memory management, file systems, and networking.
- Proficient in programming with C++, Go, or Python.
- Deep expertise in orchestrating containerized applications and building scalable cluster management systems.
- Skilled at identifying system-level gaps and cross-functional issues, proposing effective solutions, and driving end-to-end resolution.
- Demonstrated ability to lead and mentor team members, manage small projects, and collaborate effectively across teams to drive impactful change.