Principal Site Reliability Engineer

Autodesk

8+ Years | Bangalore, Karnataka, India (Hybrid) | Full Time | 1 day ago

Apply Now

Job Summary

Join the CloudOS team at Autodesk as a Principal Site Reliability Engineer, leading the development and operation of a cutting-edge Continuous Deployment platform. This role involves streamlining provisioning and management of cloud components across global regions, utilizing open-source tools and public cloud services like AWS, Azure, and GCP. You will enhance platform functionality, ensure reliability, automate complex workflows, and promote adoption, contributing to resilient developer platforms and scalable containerized environments.

Must Have

Lead design, development, deployment, testing, maintenance, and enhancement of CloudOS platform features.
Define architectural roadmap and future vision for the CloudOS platform.
Drive strategic initiatives to improve platform reliability, performance, and scalability.
Oversee and manage infrastructure supporting CloudOS using Kubernetes, AWS, Azure, and GCP.
Develop and enforce policies, standards, and procedures for cloud infrastructure management.
Configure, manage, and upgrade Jenkins and Spinnaker.
Lead development of complex automation scripts using Python, Go, or Java and Terraform.
Collaborate with internal engineering teams to understand CI/CD needs and provide support.
Implement, manage, and optimize monitoring, logging, and alerting systems.
Lead initiatives to identify and resolve performance bottlenecks.
Create and maintain comprehensive technical documentation and runbooks.
Stay informed with the latest industry trends in CI/CD, cloud-native technologies, DevOps, and platform engineering.
Oversee on-call rotation for incident response and support.
Bachelor’s degree in computer science, computer engineering, or a related field.
Minimum of 8+ years in Platform Engineering, DevOps, SRE, or related role.
Extensive understanding of CI/CD principles and hands-on experience with relevant tools.
Experience working with major public cloud providers (AWS, Azure, GCP).
Proficiency in scripting and/or programming languages (Python, Go, Java).
Hands-on experience with containerisation (Docker) and container orchestration (Kubernetes).
Experience with Infrastructure-as-Code (IaC) tools like Terraform or CloudFormation.
In-depth knowledge of monitoring and observability tools.
Strong troubleshooting, problem-solving skills, and a strategic mindset.
Excellent communication and collaboration skills.

Good to Have

Direct, hands-on experience managing, configuring, or extending CI/CD Pipelines.
Experience contributing to open-source projects, particularly related to cloud-native or CI/CD tooling.
Strong AWS expertise.
Deep expertise in Kubernetes operations and management.
Experience building and supporting internal developer platforms or tools.
Familiarity with multi-region cloud application architectures and deployment strategies.
Experience with GitOps workflows.

Perks & Benefits

Annual cash bonuses
Commissions for sales roles
Stock grants
Comprehensive benefits package
Meaningful work that helps build a better world designed and made for all

Job Description

Position Overview

Become part of the team responsible for developing the core infrastructure that drives Autodesk's cloud services. You will be working on CloudOS, our cutting-edge Continuous Deployment (CD) platform, which is widely utilized by our internal cloud engineering teams. As a crucial element of the Platform Services and Emerging Technologies team, CloudOS streamlines and standardises the provisioning and management of front-end, back-end, and infrastructure components across multiple global regions. CloudOS harnesses the power and flexibility of industry-standard open-source tools, enabling the rapid delivery of new platform capabilities while benefiting from the contributions of the global developer community. This modern approach allows for increased agility and the replacement of outdated, less flexible internal deployment systems.

We are looking for a skilled and passionate Principal Site Reliability Engineer (SRE) to join our CloudOS team. As a Principal SRE, you will be a strategic leader and a technical authority in developing, operating, and evolving CloudOS platform. You will help in streamlining and standardising the provisioning and management of front-end, back-end, and infrastructure components across multiple global regions. You will collaborate closely with internal engineering teams to understand their deployment needs, enhance the platform’s functionality, ensure its reliability, and promote its adoption across Autodesk. If you are enthusiastic about building resilient developer platforms, automating complex workflows, and working on public cloud services using containers like ECS and EKS at scale, this position is ideal for you. While we primarily utilise AWS, we also have workloads on Azure and GCP.

This is a hybrid role which requires few days per week working from Bengaluru office.

Responsibilities:

Lead the design, development, deployment, testing, maintenance, and enhancement of features and functionality within the CloudOS platform.
Define the architectural roadmap and future vision for the CloudOS platform, ensuring alignment with Autodesk's business goals.
Drive strategic initiatives to improve the platform's reliability, performance, and scalability.
Oversee and manage the infrastructure supporting CloudOS, ensuring reliability, scalability, security, and cost optimization using technologies like Kubernetes, AWS, Azure, and GCP
Develop and enforce policies, standards, and procedures for cloud infrastructure management.
Configure, manage, and upgrade Jenkins and Spinnaker; contribute to upstream improvements or develop custom extensions as needed
Lead the development of complex automation scripts for provisioning, deployment, monitoring, and operational tasks using languages such as Python, Go, or Java and Infrastructure-as-Code tools like Terraform.
Collaborate closely with internal engineering teams to understand their CI/CD needs, provide proactive support, troubleshoot complex issues, and promote best practices for utilizing CloudOSImplement, manage, and optimize monitoring, logging, and alerting systems to ensure the health and performance of the CloudOS platform.
Lead initiatives to identify and resolve performance bottlenecks, ensuring optimal system performance.
Create and maintain comprehensive technical documentation, runbooks, and facilitate knowledge transfer within the team and among platform users
Stay informed with the latest industry trends and advancements in CI/CD, cloud-native technologies, DevOps, and platform engineering
Oversee the on-call rotation to provide incident response and support for the CloudOS platform, ensuring timely resolution of issues and continuous improvement from incident learnings.

Minimum Qualifications:

Bachelor’s degree in computer science, Computer Engineering, or a related field, with a minimum of 8 + years in a Platform Engineering, DevOps, SRE, or related role
Extensive understanding of CI/CD principles and hands-on experience with relevant tools (e.g., Spinnaker, Jenkins, GitLab CI, Argo CD, Argo Workflows)
Experience working with major public cloud providers (AWS, Azure, GCP).
Proficiency in scripting and/or programming languages (e.g., Python, Go, Java).
Hands-on experience with containerisation (Docker) and container orchestration (Kubernetes).
Experience with Infrastructure-as-Code (IaC) tools like Terraform or CloudFormation.
In-depth knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, Dynatrace, Datadog, ELK Stack).
Strong troubleshooting, problem-solving skills, and a strategic mindset.
Excellent communication and collaboration skills.

Preferred Qualifications:

Direct, hands-on experience managing, configuring, or extending CI/CD Pipelines
Experience contributing to open-source projects, particularly related to cloud-native or CI/CD tooling
Strong AWS expertise
Deep expertise in Kubernetes operations and management
Experience building and supporting internal developer platforms or tools.
Familiarity with multi-region cloud application architectures and deployment strategies
Experience with GitOps workflows

22 Skills Required For This Role

Problem Solving Communication Game Texts Gitlab Incident Response Aws Azure Argo Cd Prometheus Grafana Terraform Elk Spinnaker Ci Cd Docker Front End Kubernetes Back End Python Autodesk Jenkins Java