Staff ML Ops Engineer

Calix

3+ Years | Remote | Full Time | 1 day ago

Apply Now

Job Summary

Calix is seeking a Staff ML Ops Engineer with strong GCP experience to join their cutting-edge AI/ML team. This role involves designing, implementing, and maintaining scalable infrastructure for machine learning and generative AI applications. Key responsibilities include deploying and troubleshooting production ML pipelines, building CI/CD for model deployment, scaling compute resources, and optimizing cloud resources on GCP. The engineer will also establish monitoring, logging, and alerting for ML systems and enforce MLOps best practices.

Must Have

Design, implement, and maintain scalable infrastructure for ML and GenAI applications.
Deploy, operate, and troubleshoot production ML pipelines and generative AI services.
Build and optimize CI/CD pipelines for ML model deployment and serving.
Scale compute resources across CPU/GPU/TPU/NPU architectures to meet performance requirements.
Implement container orchestration with Kubernetes for ML workloads.
Architect and optimize cloud resources on GCP for ML training and inference.
Set up and maintain runtime frameworks and job management systems (Airflow, KubeFlow, MLflow).
Establish monitoring, logging, and alerting for ML system observability.
Collaborate with data scientists and ML engineers to translate models into production systems.
Optimize system performance and resource utilization for cost efficiency.
Develop and enforce MLOps best practices across the organization.
8+ years of overall software engineering experience.
3+ years of focused experience in MLOps or similar ML infrastructure roles.
Strong experience with Docker container services and Kubernetes orchestration.
Demonstrated expertise in cloud infrastructure management, preferably on GCP.
Proficiency with workflow management and ML runtime frameworks such as Airflow, Kubeflow, and MLflow.
Strong CI/CD expertise with experience implementing automated testing and deployment pipelines.
Experience with scaling distributed compute architectures utilizing various accelerators (CPU/GPU/TPU/NPU).
Solid understanding of system performance optimization techniques.
Experience implementing comprehensive observability solutions for complex systems.
Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack).
Proficient in at least two of the following: Shell Scripting, Python, Go, C/C++.

Good to Have

Familiarity with ML frameworks such as PyTorch and ML platforms like SageMaker or Vertex AI.
Excellent problem-solving skills and ability to work independently.
Strong communication skills and ability to work effectively in cross-functional teams.

Perks & Benefits

Eligible for a bonus
Comprehensive benefits package

Job Description

Calix provides the cloud, software platforms, systems and services required for communications service providers to simplify their businesses, excite their subscribers and grow their value.

Calix is where passionate innovators come together with a shared mission: to reimagine broadband experiences and empower communities like never before. As a true pioneer in broadband technology, we ignite transformation by equipping service providers of all sizes with an unrivaled platform, state-of-the-art cloud technologies, and AI-driven solutions that redefine what’s possible. Every tool and breakthrough we offer is designed to simplify operations and unlock extraordinary subscriber experiences through innovation.

Calix is seeking a highly skilled ML Ops Engineer with hands-on experience with GCP to join our cutting-edge AI/ML team. In this role, you will be responsible for building, scaling, and maintaining the infrastructure that powers our machine learning and generative AI applications. You will work closely with data scientists, ML engineers, and software developers to ensure our ML/AI systems are robust, efficient, and production ready.

This is a remote-based position that can be located anywhere in the United States or Canada.

Key Responsibilities:

Design, implement, and maintain scalable infrastructure for ML and GenAI applications.
Deploy, operate, and troubleshoot production ML pipelines and generative AI services.
Build and optimize CI/CD pipelines for ML model deployment and serving.
Scale compute resources across CPU/GPU/TPU/NPU architectures to meet performance requirements.
Implement container orchestration with Kubernetes for ML workloads.
Architect and optimize cloud resources on GCP for ML training and inference.
Set up and maintain runtime frameworks and job management systems (Airflow, KubeFlow, MLflow).
Establish monitoring, logging, and alerting for ML system observability.
Collaborate with data scientists and ML engineers to translate models into production systems.
Optimize system performance and resource utilization for cost efficiency.
Develop and enforce MLOps best practices across the organization.

Qualifications:

Bachelor's degree in computer science, Information Technology, or a related field (or equivalent experience).
8+ years of overall software engineering experience.
3+ years of focused experience in MLOps or similar ML infrastructure roles.
Strong experience with Docker container services and Kubernetes orchestration.
Demonstrated expertise in cloud infrastructure management, preferably on GCP (AWS or Azure experience also valued).
Proficiency with workflow management and ML runtime frameworks such as Airflow, Kubeflow, and MLflow.
Strong CI/CD expertise with experience implementing automated testing and deployment pipelines.
Experience with scaling distributed compute architectures utilizing various accelerators (CPU/GPU/TPU/NPU).
Solid understanding of system performance optimization techniques.
Experience implementing comprehensive observability solutions for complex systems.
Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack).
Proficient in at least two of the following: Shell Scripting, Python, Go, C/C++.
Familiarity with ML frameworks such as PyTorch and ML platforms like SageMaker or Vertex AI.
Excellent problem-solving skills and ability to work independently.
Strong communication skills and ability to work effectively in cross-functional teams.

The base pay range for this position varies based on the geographic location. More information about the pay range specific to candidate location and other factors will be shared during the recruitment process. Individual pay is determined based on location of residence and multiple factors, including job-related knowledge, skills and experience.

San Francisco Bay Area:

0 - 0 USD Annual

All Other US Locations:

0 - 0 USD Annual

As a part of the total compensation package, this role may be eligible for a bonus. For information on our benefits click here.

About Us

PLEASE NOTE: All emails from Calix will come from a '@calix.com' email address. Please verify and confirm any communication from Calix prior to disclosing any personal or financial information. If you receive a communication that you think may not be from Calix, please report it to us at talentandculture@calix.com.

Calix delivers a broadband platform and managed services that enable our customers to improve life one community at a time. We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.

This is the Calix mission - to enable BSPs of all sizes to Simplify. Innovate. Grow.

To learn more, visit the Calix web site at www.calix.com

To learn more about our international job opportunities, please visit our International Careers Page

If you are a person with a disability needing assistance with the application process please:

Email us at calix.interview@calix.com;
or Call us at +1 (408) 514-3000.

Calix is a Drug Free Workplace.

You may access a copy of Calix Candidate Privacy Policy HERE and other Calix Privacy Policies HERE.

20 Skills Required For This Role

Cross Functional Communication Talent Acquisition Cpp Game Texts Automated Testing Html Aws Azure Prometheus Grafana Elk Model Deployment Pytorch Ci Cd Docker Kubernetes Python Shell Machine Learning

Similar Jobs