SRE / DevOps engineer (with Python and ML framework)

5 Minutes ago • 3-5 Years

Devops

Job Description

The AI Platform Team provides highly available, scalable, and automated machine learning infrastructure for researchers and data scientists globally. This SRE / DevOps engineer role focuses on maintaining, deploying, and improving AI/ML platform services with a strong emphasis on DevOps, SRE practices, and automation. The engineer will collaborate closely with developers, researchers, and infrastructure teams to ensure robust, scalable, and highly available ML systems, driving operational excellence and platform reliability.

Good To Have:

Experience with AI/ML model training and inferencing platforms
Familiarity with LLM fine-tuning systems

Must Have:

Design, implement, and maintain CI/CD pipelines for AI/ML platform services
Manage and troubleshoot Kubernetes clusters, Docker containers, and cloud infrastructure
Ensure high availability (99.999%), system reliability, and security across platforms
Automate operational tasks, monitoring, and deployment workflows
Collaborate with AI platform developers to deploy and scale ML frameworks efficiently
Analyze and resolve production issues, performance bottlenecks, and functional problems
Define operational standards, versioning practices, and advise teams on DevOps best practices
Design, build, and refactor Python services and ML framework integrations
Work with ML frameworks such as PyTorch, TensorFlow, and Triton
Handle framework-related issues, version upgrades, and environment compatibility
Support AI/ML model training, inferencing platforms, and LLM fine-tuning systems
Strong Python development experience (2–4 years)
Overall 3–5 years of relevant DevOps / SRE experience
Hands-on experience with ML frameworks (PyTorch, TensorFlow, Triton)
Solid understanding of Kubernetes, Docker, Linux fundamentals, and DevOps practices
Experience with CI/CD pipelines (Jenkins or similar), test automation, and monitoring
Strong debugging and triaging skills

Perks:

Flexible working format - remote, office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

Add these skills to join the top 1% applicants for this job

cross-functional

communication

problem-solving

game-texts

linux

pytorch

ci-cd

docker

kubernetes

python

tensorflow

jenkins

machine-learning

N-iX is a global software development service company that helps businesses across the world develop successful software products. Founded in 2002, N-iX has come a long way, expanding its presence across Europe, the US, and Latin America. Today, we are a strong community of 2,000+ professionals and a reliable partner for global industry leaders and Fortune 500 companies.

Our client is a global commerce leader where you can influence how the world buys, sells, and gives. You’ll be part of a work culture that’s been genuinely committed to diversity and inclusion since its founding over twenty five years ago. Here, you can be yourself, do your best work along with a team of professionals, and have a meaningful impact on people across the globe. We seek people with drive, ideas, and a passion for helping small businesses succeed to help.

About the team:We are the AI Platform Team, providing highly available, scalable, and automated machine learning infrastructure for researchers and data scientists globally. We are looking for a motivated, self-reliant SRE / DevOps engineer with Python and ML framework experience to drive operational excellence, automation, and platform reliability.

Role Overview:This role focuses on maintaining, deploying, and improving AI/ML platform services with strong emphasis on DevOps, SRE practices, and automation. You will collaborate closely with developers, researchers, and infrastructure teams to ensure robust, scalable, and highly available ML systems.

Responsibilities:

DevOps (~60%):

Design, implement, and maintain CI/CD pipelines for AI/ML platform services.
Manage and troubleshoot Kubernetes clusters, Docker containers, and cloud infrastructure.
Ensure high availability (99.999%), system reliability, and security across platforms.
Automate operational tasks, monitoring, and deployment workflows.
Collaborate with AI platform developers to deploy and scale ML frameworks efficiently.
Analyze and resolve production issues, performance bottlenecks, and functional problems.
Define operational standards, versioning practices, and advise teams on DevOps best practices.
Prepare documentation, training materials, and provide technical support to platform users.

Development (~40%):

Design, build, and refactor Python services and ML framework integrations.
Work with ML frameworks such as PyTorch, TensorFlow, and Triton.
Handle framework-related issues, version upgrades, and environment compatibility.
Support AI/ML model training, inferencing platforms, and LLM fine-tuning systems.
Collaborate with developers to integrate ML pipelines into automated CI/CD workflows.

Requirements:

Strong Python development experience (2–4 years).
Overall 3–5 years of relevant DevOps / SRE experience.
Hands-on experience with ML frameworks (PyTorch, TensorFlow, Triton).
Experience with AI/ML model training and inferencing platforms is a plus.
Familiarity with LLM fine-tuning systems is a plus.
Solid understanding of Kubernetes, Docker, Linux fundamentals, and DevOps practices.
Experience with CI/CD pipelines (Jenkins or similar), test automation, and monitoring.
Strong debugging and triaging skills.
Excellent communication and collaboration skills with cross-functional teams.
Strong organizational skills to manage multiple projects in a fast-paced environment.
Fluent in English (spoken and written).

We offer\*:

Flexible working format - remote, office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

\*not applicable for freelancers

Set alerts for more jobs like SRE / DevOps engineer (with Python and ML framework)

Set alerts for new jobs by N-ix

Set alerts for new Devops jobs in India

Set alerts for new jobs in India

Set alerts for Devops (Remote) jobs