Principal Software Engineer, ML Infrastructure
zoox
Job Summary
Zoox is seeking a Principal Software Engineer, ML Infrastructure to shape and build next-generation ML infrastructure to accelerate the development and deployment of large-scale ML and Foundational models for their autonomous robotaxis. This role involves leading the design and development of data, compute, model training, and serving infrastructure, collaborating with AI teams across Perception, Prediction, Planner, and Simulation. The engineer will be responsible for building and operating data infrastructure for PBs of sensor data, compute infrastructure for model training and validation across thousands of GPUs, and the base layer of ML tools and frameworks. Key responsibilities include developing a strategic vision for ML Infrastructure, leading the design and implementation of cutting-edge infrastructure across the ML lifecycle, collaborating with cross-functional teams, and mentoring engineers.
Must Have
- Experience building and managing large-scale ML infrastructure
- Excellent leadership skills and ability to lead teams
- Strong experience with training frameworks (PyTorch, JAX)
- Experience with GPU-accelerated inference (TensorRT, Ray Serve)
- Proficient in Python and/or C++
Good to Have
- Experience enabling development/deployment of large-scale Foundation models
- Experience with large-scale data infrastructure (Apache Spark)
- Experience in the AV domain (Perception, Prediction, Planner)
Perks & Benefits
- Paid time off (sick leave, vacation, bereavement)
- Unpaid time off
- Zoox Stock Appreciation Rights
- Amazon Restricted Stock Units
- Health insurance
- Long-term care insurance
- Long-term and short-term disability insurance
- Life insurance
Job Description
- Vision: Develop and execute a strategic vision for ML Infrastructure that will unlock innovation in autonomous driving and enhance our rider experience.
- Technical acumen: Lead the design and implementation of cutting-edge infrastructure spanning all stages of an ML lifecycle from data preparation to training to evaluation, deployment, and serving.
- Partnership: Collaborate closely with cross-functional teams, including ML researchers, software engineers, data engineers, and hardware engineers, to define requirements and align on architectural decisions.
- Mentorship: Enable the engineers in the team to grow their careers by providing technical guidance and mentorship.
Qualifications
- Experience building and managing large-scale ML infrastructure that powers the development of large-scale ML models
- Excellent leadership skills with a demonstrated ability to lead high-performing engineering teams.
- Strong experience with training frameworks like PyTorch, JAX, etc., leveraging GPUs efficiently for distributed model training.
- Experience with GPU-accelerated inference using TensorRT, Ray Serve, or similar frameworks.
- Proficient in Python and/or C++.
Bonus Qualifications
- Experience enabling the development and deployment of large-scale Foundation models.
- Experience working on large-scale data infrastructure and big data processing frameworks like Apache Spark.
- Experience working in the AV domain supporting Perception, Prediction, Planner et al.