Engineering Manager, ML Training Platform
zoox
Job Summary
Zoox is seeking an Engineering Manager for their ML Training Platform to support autonomous driving innovations. This role involves managing a team of software engineers to build and operate the core ML platform for model training at scale, including deep learning frameworks and distributed infrastructure. Responsibilities include developing and executing a strategic vision for the platform, ensuring scalability and performance for Foundation and RL models, and leading the design, implementation, and operation of the ML training platform. The manager will also be responsible for hiring and mentoring a diverse engineering team, fostering innovation and collaboration, and partnering with cross-functional teams to define requirements and architectural decisions.
Must Have
- 8+ years of total experience
- 3+ years of engineering management experience
- Excellent leadership skills
- Experience enabling large-scale distributed model training
- Experience with training frameworks (PyTorch, Hugging Face, Ray, DeepSpeed, JAX)
- Experience building model lifecycle management tools
Good to Have
- Experience with cost-efficient ML compute infrastructure
- Experience leveraging GPUs, TPUs, or Trainium
- Experience managing AWS costs for ML needs
Perks & Benefits
- Salary range: $230,000 to $315,000
- Sign-on bonus may be offered
- Amazon Restricted Stock Units (RSUs)
- Zoox Stock Appreciation Rights
- Comprehensive benefits package (paid time off, health insurance, long-term care insurance, disability insurance, life insurance)
Job Description
In this role, you will
- Vision: Develop and execute a strategic vision for our ML training platform, ensuring scalability, reliability, and performance to support large-scale Foundation and RL models.
- Technical acumen: Lead the design, implementation, and operation of a robust and efficient ML training platform to enable the training, experimentation, validation, and monitoring of ML models.
- Hiring: Attract, hire, and inspire a diverse world-class engineering team, fostering a culture of innovation, collaboration, and excellence.
- Partnership: Collaborate closely with cross-functional teams, including ML researchers, software engineers, data engineers, and hardware engineers to define requirements and align on architectural decisions.
- Mentorship: Enable the engineers in the team to grow their careers by providing the right opportunities along with clear and timely feedback.
Qualifications
- 8+ years of total experience, including 3+ years of engineering management experience.
- Excellent leadership skills with a demonstrated ability to build and manage high-performing engineering teams.
- Experience enabling large-scale, cost-efficient distributed model training and ML compute infrastructure.
- Experience with training frameworks such as PyTorch, Hugging Face, Ray, DeepSpeed, JAX, etc., leveraging GPUs, TPUs, or Trainium.
- Experience building model lifecycle management tools and managing AWS costs for our ML needs.