Sr ML Ops Engineer

1 Day ago • 5 Years + • $152,100 PA - $203,900 PA
Research Development

Job Description

The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering machine learning and AI frameworks. This role is crucial for seamless workflows in model training, retraining, and deployment, ensuring cutting-edge AI solutions operate reliably at scale. As a Sr ML Ops Engineer, you will bridge the gap between data science, research, and production engineering, supporting the development of transformative audio solutions for speech processing, style transfer, and source separation in media production workflows. This is a Hybrid role, requiring 2-3 days onsite in Nicasio, CA.
Good To Have:
  • Experience with data orchestration tools like DataChain, Weights and Biases.
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks.
  • Familiarity with model monitoring tools like Prometheus, Grafana for model drift and data quality checks.
  • Experience integrating pre-trained foundational models and managing their deployment at scale.
  • Contributions to open-source ML Ops projects or relevant research publications.
Must Have:
  • Develop, deploy, and maintain scalable infrastructure for ML model training, retraining, and inference.
  • Design and optimize CI/CD pipelines for machine learning workflows.
  • Implement robust monitoring and logging systems for model performance.
  • Collaborate with AI researchers and data scientists.
  • Manage compute resources (cloud and on-premises) for large-scale distributed training and inference.
  • Containerize ML models using Docker and deploy via Kubernetes.
  • Automate deployment workflows for serving ML models using TorchServe, TensorFlow Serving, FastAPI.
  • Implement model versioning, rollback strategies, and governance.
  • Optimize cost efficiency and performance of ML workflows in cloud environments (AWS, GCP, Azure).
  • Stay updated with emerging ML Ops tools and practices.
  • Bachelor’s in Computer Science, Engineering, or related field.
  • 5+ years in DevOps, SRE, or related, with 2+ years in ML Ops.
  • Expertise in building and maintaining CI/CD pipelines for ML applications.
  • Strong proficiency with Docker and Kubernetes.
  • Proficiency in deploying ML models using TensorFlow Serving, TorchServe, or custom APIs.
  • Deep understanding of cloud infrastructure (AWS, GCP, Azure) for ML workloads, including GPUs and TPU.
  • Experience managing large-scale distributed training workflows.
  • Familiarity with MLflow, DVC, Weight+Biases for data and model tracking.
  • Solid understanding of security best practices for ML systems and sensitive data handling.
  • Strong scripting and programming skills in Python, Bash, or Go.
Perks:
  • Bonus and/or long-term incentive units
  • Full range of medical benefits
  • Full range of financial benefits
  • Other benefits dependent on level and position offered

Add these skills to join the top 1% applicants for this job

data-analytics
resource-allocation
game-texts
resource-planning
aws
azure
prometheus
grafana
model-deployment
fastapi
data-science
ci-cd
docker
kubernetes
python
bash
tensorflow
machine-learning

Job Summary:

The Skywalker Sound Development Group is seeking a highly skilled Sr ML Ops Engineer to build and maintain the infrastructure powering our machine learning and AI frameworks. This position is crucial in enabling seamless workflows for model training, retraining, and deployment, ensuring that cutting-edge AI solutions operate reliably at scale.

As a Sr ML Ops Engineer, you will act as the backbone of our AI/ML efforts, bridging the gap between data science, research, and production engineering. Your expertise in DevOps principles, model deployment strategies, and scalable infrastructure will support the development of transformative audio solutions for speech processing, style transfer, and source separation in media production workflows.

This role is considered Hybrid, which means the employee will work 2-3 days onsite at our Nicasio, CA office and occasionally from home.

What You'll Do:

  • Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference.
  • Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production.
  • Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments.
  • Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation.
  • Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks.
  • Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems.
  • Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI.
  • Implement model versioning, rollback strategies, and governance for maintaining production stability.
  • Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure.
  • Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability.

What We’re Looking For:

  • Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred
  • 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops.
  • Expertise in building and maintaining CI/CD pipelines for machine learning applications.
  • Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes).
  • Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs.
  • Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization.
  • Experience managing large-scale distributed training workflows and optimizing resource allocation.
  • Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning.
  • Solid understanding of security best practices for machine learning systems and sensitive data handling.
  • Strong scripting and programming skills in Python, Bash, or Go.

Preferred Qualifications:

  • Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows.
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks.
  • Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks.
  • Experience integrating pre-trained foundational models and managing their deployment at scale.
  • Contributions to open-source ML Ops projects or relevant research publications.

The hiring range for this position in San Francisco, CA is $152,100 to $203,900 per year. The base pay actually offered will take into account internal equity and also may vary depending on the candidate’s geographic region, job-related knowledge, skills, and experience among other factors. A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits, dependent on the level and position offered.

About Lucasfilm:

Lucasfilm is a global leader in film, television and digital entertainment production. In addition to its motion-picture and television production, the company's activities include visual effects, audio post-production and cutting-edge digital animation, interactive entertainment software, and the management of the global merchandising activities for its entertainment properties including the legendary STAR WARS and INDIANA JONES franchises. Lucasfilm Ltd. is headquartered in northern California.

About The Walt Disney Company:

The Walt Disney Company, together with its subsidiaries and affiliates, is a leading diversified international family entertainment and media enterprise that includes three core business segments: Disney Entertainment, ESPN, and Disney Experiences. From humble beginnings as a cartoon studio in the 1920s to its preeminent name in the entertainment industry today, Disney proudly continues its legacy of creating world-class stories and experiences for every member of the family. Disney’s stories, characters and experiences reach consumers and guests from every corner of the globe. With operations in more than 40 countries, our employees and cast members work together to create entertainment experiences that are both universally and locally cherished.

This position is with Lucasfilm Ent Co Ltd, LLC Payroll Svc, which is part of a business we call Lucasfilm.

Lucasfilm Ent Co Ltd, LLC Payroll Svc is an equal opportunity employer. Applicants will receive consideration for employment without regard to race, religion, color, sex, sexual orientation, gender, gender identity, gender expression, national origin, ancestry, age, marital status, military or veteran status, medical condition, genetic information or disability, or any other basis prohibited by federal, state or local law. Disney champions a business environment where ideas and decisions from all people help us grow, innovate, create the best stories and be relevant in a constantly evolving world.

Set alerts for more jobs like Sr ML Ops Engineer
Set alerts for new jobs by lucas films
Set alerts for new Research Development jobs in United States
Set alerts for new jobs in United States
Set alerts for Research Development (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙