Senior ML Platform Engineer

9 Minutes ago • 5 Years + • Research Development

Job Summary

Job Description

Mistplay is seeking a Senior ML Platform Engineer to join its Data Team. This role involves researching and developing machine learning solutions for complex business problems, working with cross-functional teams to design and implement scalable solutions. The engineer will focus on building and operating production-grade ML/data platforms, including designing training-to-serving pipelines, managing real-time inference on SageMaker, implementing low-latency serving patterns with Redis/Valkey, and provisioning infrastructure with Terraform. The role requires strong software engineering skills and expertise in ML lifecycle governance and observability.
Must have:
  • Design, build, and operate standardized training-to-serving pipelines with Airflow.
  • Own real-time and batch inference on SageMaker.
  • Implement ultra-low-latency serving patterns with Redis/Valkey.
  • Provision and manage ML/data infrastructure with Terraform.
  • Build platform abstractions and golden paths.
  • Establish and run model lifecycle governance.
  • Implement end-to-end observability for ML workflows.
  • Partner with Security, SRE, and Data Engineering on infrastructure and policies.
  • Evaluate, integrate, and rationalize platform tooling.
  • 5+ years building and operating production-grade ML/data platforms.
  • Strong software engineering in Python, Go, or Java.
  • Deep experience with AWS SageMaker inference.
  • Expertise with online feature stores like Redis/Valkey.
  • Proven Terraform experience managing ML and data infrastructure.
  • Airflow orchestration at scale.
  • Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) for platform integration.
  • Observability for ML Workflows.
  • Excellent communication and cross-functional collaboration.
Perks:
  • Team Lunches
  • Game nights
  • Company-wide events
  • Culture rooted in growth
  • Utilize data to constantly learn, improve, and adapt
  • Environment where everyone is encouraged to share ideas, push boundaries, take calculated risks

Job Details

Reporting to the Director of Data and Machine Learning Platform, the Senior ML Platform Engineer within the Data Team will play a key role in researching and developing machine learning solutions to solve complex business problems. The Senior ML Platform Engineer will work closely with a cross-functional team to identify areas for improvement and design and implement scalable solutions. Relevant experience can range from working on a wide variety of optimization and classification problems, e.g. collaborative filtering/recommendation, fraud detection, segmentation, propensity modeling, text/sentiment classification, etc.

What you’ll do

  • Design, build, and operate standardized training-to-serving pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.
  • Own real-time and batch inference on SageMaker: multi-model endpoints, serverless inference where appropriate, blue/green and canary strategies, autoscaling policies, and cost controls (spot strategies, instance right-sizing).
  • Implement ultra-low-latency serving patterns with Redis/Valkey: feature caching, online feature retrieval, request-scoped state, model response caching, and rate limiting/backpressure for bursty traffic.
  • Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configs, ECR/ECS/EKS resources, networking/VPC endpoints, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
  • Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDKs, cookie-cutter repos, and CI/CD pipelines that take models from notebooks to production predictably.
  • Establish and run model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage, and audit trails integrated with Airflow runs and Terraform state.
  • Implement end-to-end observability: data/feature freshness checks, drift/quality gates, model performance/latency SLOs, infra health dashboards, tracing, and alerting—plus incident response and postmortems.
  • Partner with Security, SRE, and Data Engineering on private networking, policy-as-code, PII handling, least-privilege IAM, and cost-efficient architectures across environments.
  • Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, serving gateways); lead migrations with clear change management and minimal downtime.

What you’ll bring

  • 5+ years building and operating production-grade ML/data platforms with a focus on serving, reliability, and developer experience.
  • Strong software engineering in Python, Go, or Java; experience building resilient services, APIs, and automation tooling with high test coverage.
  • Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, serverless vs. real-time trade-offs, MME, A/B and canary releases.
  • Expertise with online feature stores like Redis/Valkey in ML serving contexts.
  • Proven Terraform experience managing ML and data infra end-to-end: modules, workspaces, drift detection, change reviews, and safe rollbacks; familiarity with GitOps patterns.
  • Airflow orchestration at scale: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.
  • Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform-integration perspective to support diverse runtimes and containers.
  • Observability for ML Worflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.
  • Excellent communication and cross-functional collaboration with Data Science, Data Engineering, DevOps and Backend.

Why Mistplay?

We strive to make our work environment as inviting and fun as possible! Working at is coupled with a whole array of perks that we've adopted virtually and in-person: Team Lunches, game nights, company-wide events, and so much more. Our culture is deeply rooted in growth and upheld by a team of smart, dynamic, and enthusiastic people. We utilize data to constantly learn, improve, and adapt. We foster an environment where everyone is encouraged to share their ideas, push boundaries, take calculated risks, and witness their visions come to life.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Montréal, Québec, Canada

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Research Development Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

Montreal, Quebec, Canada (Hybrid)

Montreal, Quebec, Canada (On-Site)

Montreal, Quebec, Canada (Hybrid)

London, England, United Kingdom (Remote)

Toronto, Ontario, Canada (Hybrid)

Tokyo, Japan (Remote)

Seoul, South Korea (Remote)

Tokyo, Japan (Remote)

Toronto, Ontario, Canada (Hybrid)

Toronto, Ontario, Canada (Hybrid)

View All Jobs

Get notified when new jobs are added by Mistplay

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug
Contact Us
hello@outscal.com
Made in INDIA 💛💙