Machine Learning Engineer (Operations)

1 Month ago • All levels

Research Development

Job Description

This Machine Learning Engineer (Operations) role focuses on the end-to-end lifecycle of ML models and Large Language Models (LLMs) within an AWS ecosystem. Key responsibilities include designing, deploying, and maintaining ML pipelines using services like SageMaker, Glue, and Step Functions. The role requires strong proficiency in Python, Docker, and Kubernetes for containerized deployments, along with experience in optimizing AWS resources and implementing monitoring. Candidates will also work with generative AI applications using Amazon Bedrock and apply prompt engineering strategies for LLMs, ensuring scalable and reliable production deployments.

Good To Have:

Relevant AWS certifications (e.g., AWS Certified Machine Learning - Specialty, AWS Certified DevOps Engineer).
Experience with Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform.
Familiarity with CI/CD pipelines and tools for automating ML workflows.
Understanding of data governance and security best practices in the context of ML.

Must Have:

Understand machine learning concepts, algorithms, and best practices.
Create, manage, and deploy ML models using core AWS services.
Extract document data using AWS Textract.
Design, develop, and maintain automated data processing and ML training pipelines using AWS Glue and AWS Step Functions.
Ensure seamless data ingestion, transformation, and storage strategies within AWS.
Optimize AWS resource usage for cost-effectiveness and efficiency in ML operations.
Leverage and manage foundation models in generative AI applications with Amazon Bedrock.
Utilize database services like Amazon RDS or DynamoDB for metadata and model predictions.
Implement monitoring, logging, and alerting mechanisms using AWS CloudWatch.
Manage container orchestration with AWS container services like EKS or ECS.
Implement scalable and reliable ML model deployments in production.
Implement, deploy, and optimize Large Language Models (LLMs) for production use cases.
Monitor LLM performance, fine-tune parameters, and continuously update/refine models.
Create and experiment with effective prompt engineering strategies for LLMs.
Package ML models and applications into containers using Docker.
Manage deployments, scaling, and networking with Kubernetes.
Apply best practices for container security, performance optimization, and resource utilization.
Proficiently use Python for data processing, model training, deployment automation, and scripting.
Implement robust testing and debugging practices for Python code.
Adhere to best practices and coding standards in Python development.
Integrate external systems like Veeva Promomat with ML workflows.
Possess strong analytical and problem-solving skills for ML systems and data pipelines.
Maintain a proactive, results-oriented mindset focused on continuous improvement in MLOps.

Add these skills to join the top 1% applicants for this job

problem-solving

performance-analysis

game-texts

networking

aws

terraform

ci-cd

docker

kubernetes

python

algorithms

machine-learning

Description

Title - Machine Learning Engineer (Operations)

Location – South San Franciso CA (Hybrid, 3 days/week) (Not remote)

Strong understanding of machine learning concepts, algorithms, and best practices.
Proven experience in creating, managing, and deploying ML models using core AWS services such as Amazon SageMaker (for model building, training, and deployment), EC2 (for compute instances), S3 (for data storage), and Lambda (for serverless functions).
Experience with AWS Textract for document data extraction.
Demonstrable experience in designing, developing, and maintaining automated data processing and ML training pipelines using AWS Glue (for ETL) and AWS Step Functions (for workflow orchestration).
Proficiency in ensuring seamless data ingestion, transformation, and storage strategies within the AWS ecosystem.
Experience in optimizing AWS resource usage for cost-effectiveness and efficiency in ML operations.
Experience with Amazon Bedrock for leveraging and managing foundation models in generative AI applications.
Knowledge of database services like Amazon RDS or Amazon DynamoDB for storing metadata, features, or serving model predictions where applicable.
Hands-on experience with implementing monitoring, logging, and alerting mechanisms using AWS CloudWatch.
Experience with AWS container services like EKS (Elastic Kubernetes Service) or ECS (Elastic Container Service) for managing container orchestration.
Experience in implementing scalable and reliable ML model deployments in a production environment.
Practical experience in implementing, deploying, and optimizing Large Language Models (LLMs) for production use cases.
Ability to monitor LLM performance, fine-tune parameters, and continuously update/refine models based on new data and performance metrics.
Proven ability to create and experiment with effective prompt engineering strategies to improve LLM performance, accuracy, and relevance.
Proficiency in using Docker to package ML models and applications into containers.
Experience with Kubernetes for container orchestration, including managing deployments, scaling, and networking.
Knowledge of best practices for container security, performance optimization, and resource utilization.
Strong proficiency in Python programming for data processing, model training, deployment automation, and general scripting.
Experience in implementing robust testing (e.g., unit tests, integration tests) and debugging practices for Python code.
Adherence to best practices and coding standards in Python development.
Experience or familiarity with integrating external systems or platforms, such as Veeva Promomat (or similar content management/regulatory systems), with ML workflows.
Strong analytical and problem-solving skills with the ability to troubleshoot complex issues in ML systems and data pipelines.
A proactive and results-oriented mindset with a focus on continuous improvement and innovation in MLOps practices.
Relevant AWS certifications (e.g., AWS Certified Machine Learning - Specialty, AWS Certified DevOps Engineer) is a plus.
Experience with Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform.
Familiarity with CI/CD pipelines and tools for automating ML workflows.
Understanding of data governance and security best practices in the context of ML.