Machine Learning Engineer II
Condé Nast
Job Summary
Condé Nast is seeking a motivated and skilled Machine Learning Engineer II to support the productionization of machine learning projects in Databricks or AWS environments for the Data Science team. This role focuses on deploying, optimizing, and operating ML models rather than building or researching new machine learning models. The ideal candidate will transform data science prototypes into scalable, reliable production pipelines, owning the end-to-end production lifecycle including deployment, monitoring, incident response, and performance optimization.
Must Have
- Design, build, and operate scalable, highly available ML systems.
- Own end-to-end production lifecycle of ML services.
- Build and maintain AWS-native ML architectures (EKS, SageMaker, API Gateway, Lambda, DynamoDB).
- Develop and deploy low-latency ML inference services using FastAPI/Flask or gRPC on Kubernetes (EKS).
- Design autoscaling strategies, rollout mechanisms, traffic routing, and resource tuning for ML workloads.
- Engineer near-real-time data and inference pipelines.
- Collaborate with Data Scientists to translate prototypes into production-ready systems.
- Implement and maintain CI/CD pipelines using GitHub Actions and Infrastructure-as-Code.
- Improve observability, logging, alerting, and SLA/SLO adherence for ML systems.
- Follow agile engineering practices.
- 4-7+ years in Machine Learning Engineering, MLOps, or Backend Engineering.
- Strong foundation in system design, distributed systems, API architectures.
- Proven experience deploying and operating production-grade ML systems on AWS.
- Proficiency in Python, ML frameworks (PyTorch, TensorFlow, scikit-learn), data processing libraries (Pandas, NumPy, PySpark).
- Solid experience with AWS services (EC2, S3, API Gateway, Lambda, IAM, VPC Networking, DynamoDB).
- Hands-on experience with Docker and Kubernetes (EKS).
- Experience building and deploying ML inference services (FastAPI/Flask/gRPC, TorchServe, TensorFlow Serving, Triton, vLLM).
- Strong understanding of data structures, data modeling, software architecture.
- Experience designing and managing CI/CD pipelines and Infrastructure-as-Code (Terraform).
- Strong debugging, performance optimization, and production troubleshooting skills.
- Excellent communication and collaboration skills.
- Outstanding analytical and problem-solving skills.
- Undergraduate or Postgraduate degree in Computer Science or related discipline.
Good to Have
- Experience with workflow orchestration and ML lifecycle tools (Airflow, Astronomer, MLflow, Kubeflow).
- Experience with Databricks, Amazon SageMaker, or Spark-based ML pipelines.
- Familiarity with ML observability, monitoring, or feature management.
- Experience designing or integrating vector search, embedding-based retrieval, or RAG-style systems.
- Prior experience operating low-latency or high-throughput services.
Job Description
About The Role :
Condé Nast is seeking a motivated and skilled Machine Learning Engineer I to support the productionization of machine learning projects in Databricks or AWS environments for the Data Science team.
This role is ideal for an engineer with a strong foundation in software development, data engineering, and machine learning, who enjoys transforming data science prototypes into scalable, reliable production pipelines.
Note: This role focuses on deploying, optimizing, and operating ML models rather than building or researching new machine learning models.
Primary Responsibilities
- Design, build, and operate scalable, highly available ML systems for batch and real-time inference.
- Own the end-to-end production lifecycle of ML services, including deployment, monitoring, incident response, and performance optimization.
- Build and maintain AWS-native ML architectures using services such as EKS, SageMaker, API Gateway, Lambda, and DynamoDB.
- Develop and deploy low-latency ML inference services using FastAPI/Flask or gRPC, running on Kubernetes (EKS).
- Design autoscaling strategies (HPA/Karpenter), rollout mechanisms, traffic routing, and resource tuning for ML workloads.
- Engineer near-real-time data and inference pipelines processing large volumes of events and requests.
- Collaborate closely with Data Scientists to translate prototypes into robust, production-ready systems.
- Implement and maintain CI/CD pipelines for ML services and workflows using GitHub Actions and Infrastructure-as-Code.
- Improve observability, logging, alerting, and SLA/SLO adherence for critical ML systems.
- Follow agile engineering practices with a strong focus on code quality, testing, and incremental delivery.
Desired Skills & Qualifications
- 4-7+ years of experience in Machine Learning Engineering, MLOps, or Backend Engineering.
- Strong foundation in system design, distributed systems, and API-based service architectures.
- Proven experience deploying and operating production-grade ML systems on AWS.
- Strong proficiency in Python, with experience integrating ML frameworks such as PyTorch, TensorFlow, scikit-learn, and working with data processing libraries like Pandas, NumPy, and PySpark.
- Solid experience with AWS services, including (but not limited to): EC2, S3, API Gateway, Lambda, IAM, VPC Networking, DynamoDB.
- Hands-on experience building and operating containerized microservices using Docker and Kubernetes (preferably EKS).
- Experience building and deploying ML inference services, using:
- FastAPI / Flask / gRPC
- TorchServe, TensorFlow Serving, Triton, vLLM, or custom inference services
- Strong understanding of data structures, data modeling, and software architecture.
- Experience designing and managing CI/CD pipelines and Infrastructure-as-Code (Terraform) for ML systems.
- Strong debugging, performance optimization, and production troubleshooting skills.
- Excellent communication skills and ability to collaborate effectively across teams.
- Outstanding analytical and problem-solving skills.
- Undergraduate or Postgraduate degree in Computer Science or a related discipline.
Preferred Qualifications
- Experience with workflow orchestration and ML lifecycle tools such as Airflow, Astronomer, MLflow, or Kubeflow.
- Experience working with Databricks, Amazon SageMaker, or Spark-based ML pipelines in production environments.
- Familiarity with ML observability, monitoring, or feature management (e.g., model performance tracking, drift detection, feature stores).
- Experience designing or integrating vector search, embedding-based retrieval, or RAG-style systems in production is a plus.
- Prior experience operating low-latency or high-throughput services in a production environment.