This Machine Learning Engineer (Operations) role focuses on the end-to-end lifecycle of ML models and Large Language Models (LLMs) within an AWS ecosystem. Key responsibilities include designing, deploying, and maintaining ML pipelines using services like SageMaker, Glue, and Step Functions. The role requires strong proficiency in Python, Docker, and Kubernetes for containerized deployments, along with experience in optimizing AWS resources and implementing monitoring. Candidates will also work with generative AI applications using Amazon Bedrock and apply prompt engineering strategies for LLMs, ensuring scalable and reliable production deployments.
Good To Have:- Relevant AWS certifications (e.g., AWS Certified Machine Learning - Specialty, AWS Certified DevOps Engineer).
- Experience with Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform.
- Familiarity with CI/CD pipelines and tools for automating ML workflows.
- Understanding of data governance and security best practices in the context of ML.
Must Have:- Understand machine learning concepts, algorithms, and best practices.
- Create, manage, and deploy ML models using core AWS services.
- Extract document data using AWS Textract.
- Design, develop, and maintain automated data processing and ML training pipelines using AWS Glue and AWS Step Functions.
- Ensure seamless data ingestion, transformation, and storage strategies within AWS.
- Optimize AWS resource usage for cost-effectiveness and efficiency in ML operations.
- Leverage and manage foundation models in generative AI applications with Amazon Bedrock.
- Utilize database services like Amazon RDS or DynamoDB for metadata and model predictions.
- Implement monitoring, logging, and alerting mechanisms using AWS CloudWatch.
- Manage container orchestration with AWS container services like EKS or ECS.
- Implement scalable and reliable ML model deployments in production.
- Implement, deploy, and optimize Large Language Models (LLMs) for production use cases.
- Monitor LLM performance, fine-tune parameters, and continuously update/refine models.
- Create and experiment with effective prompt engineering strategies for LLMs.
- Package ML models and applications into containers using Docker.
- Manage deployments, scaling, and networking with Kubernetes.
- Apply best practices for container security, performance optimization, and resource utilization.
- Proficiently use Python for data processing, model training, deployment automation, and scripting.
- Implement robust testing and debugging practices for Python code.
- Adhere to best practices and coding standards in Python development.
- Integrate external systems like Veeva Promomat with ML workflows.
- Possess strong analytical and problem-solving skills for ML systems and data pipelines.
- Maintain a proactive, results-oriented mindset focused on continuous improvement in MLOps.