Principal Site Reliability Engineer

Cubic corporation

8+ Years | Hyderabad, Telangana, India (On Site) | Full Time | 1 weeks ago

Apply Now

Job Summary

The Senior Site Reliability Engineer is a leader within the team, responsible for designing, building, and owning the complex infrastructure and deployment systems that underpin our live environments. This role is both hands-on and strategic, requiring deep technical expertise and strong collaboration skills. You will mentor junior engineers and work closely with development teams to architect and implement systems that are reliable, scalable, and highly automated. Senior SREs are expected to drive the adoption of robust, automated solutions and ensure those solutions are well-documented and understood across engineering.

Must Have

Experience on AWS cloud
SRE role experience
Understanding of programming (Java EE preferably)
Security (IAC, Cloud and web apps)
IAC Architecture
8+ years in a senior SRE, DevOps, or related infrastructure role
Minimum 2 years of hands-on programming experience in Java
Deep, hands-on expertise with AWS (ECS, EKS, Aurora (Postgres), EC2, S3, VPC)
Strong, production-level proficiency with Kubernetes and Helm
Extensive experience designing, building, and managing complex CI/CD pipelines (GitHub Actions, Argo CD, GHCR)
Expertise in Infrastructure as Code (Terraform or CloudFormation)
Proven experience with observability stacks (Kube-Prometheus-Grafana stack)
Ability to perform basic performance analysis and debugging of applications
Experience leading incident response, conducting blameless post-mortems

Good to Have

Experience in Java/any programming language

Job Description

Job Summary:

Required: NExperience on AWS cloud, SRE role, and understanding of programming (Java EE preferably), Security(IAC, Cloud and web apps), and IAC Architecture.

Good to have: Experience in Java/any programming language.

Core Responsibilities:

Infrastructure Design & Maintenance

Lead the design, build, and maintenance of our core infrastructure using infrastructure-as-code (IaC) tools (e.g., Terraform, CloudFormation).
Own the provisioning and lifecycle management of production, staging, and other critical environments.
Architect and implement shared infrastructure components (e.g., logging, metrics, service mesh, load balancing).
Drive continuous improvements to infrastructure scalability, availability, and performance.
Act as a key partner to development teams, providing infrastructure primitives and strategic guidance on deployment needs.

Deployment Systems & CI/CD

Design, own, and enhance our CI/CD pipelines (GitHub Actions, Argo CD) to maximize reliability, velocity, and automation.
Establish and enforce best practices across all environments for deployment, rollback, and observability.
Partner with developers to architect and streamline the testing and delivery of code to production.
Champion the elimination of manual steps in deployment and operations workflows.

Reliability, Observability & Tooling

Architect and manage our monitoring, alerting, and logging infrastructure (Kube-Prometheus-Grafana stack).
Define, implement, and track SLOs/SLIs for core services, holding service owners accountable.
Proactively identify and eliminate single points of failure, performance bottlenecks, and sources of instability.
Lead reliability reviews, blameless post-incident analyses, and capacity planning initiatives.
Perform basic debugging of Java applications to assist development teams in troubleshooting.

Documentation & Knowledge Sharing

Ensure all systems and processes built or maintained by the SRE team are accompanied by thorough, up-to-date documentation.
Mentor other engineers and contribute to shared knowledge bases, runbooks, and developer-facing materials.
Lead internal training sessions, walkthroughs, and pairings to cross-train teammates and reduce knowledge silos.

Collaboration & Culture

Work closely with the SRE Lead to define team strategy, prioritize work, and execute on team goals.
Mentor junior team members and act as a technical leader across engineering.
Participate in on-call rotations, acting as an escalation point for complex issues.
Champion a culture of blameless learning, transparency, and continuous improvement.

Qualifications and Skills:

Experience: 8+ years in a senior SRE, DevOps, or related infrastructure role.
Minimum 2 years of hands-on programming experience in Java (preferred), with a strong ability to develop automation and tooling for reliability and scalability.
Cloud: Deep, hands-on expertise with AWS, including services like ECS, EKS, Aurora (Postgres), EC2, S3, and VPC.
Containers & Orchestration: Strong, production-level proficiency with Kubernetes and Helm. Deep understanding of container runtimes and networking.
CI/CD: Extensive experience designing, building, and managing complex CI/CD pipelines using tools like GitHub Actions and Argo CD. Experience with container registries like GHCR.
IaC: Expertise in Infrastructure as Code, with strong proficiency in Terraform or CloudFormation.
Observability: Proven experience with observability stacks, particularly the Kube-Prometheus-Grafana stack, including custom metric instrumentation and advanced dashboarding.
Debugging: Ability to perform basic performance analysis and debugging of applications (Java experience is a strong plus).
Leadership: Demonstrated ability to mentor junior engineers, lead technical projects, and drive architectural decisions.
Incident Management: Experience leading incident response, conducting blameless post-mortems, and driving resulting action items to completion.

#LI-NB1

18 Skills Required For This Role

Problem Solving Performance Analysis Github Game Texts Networking Incident Response Aws Service Mesh Load Balancing Argo Cd Prometheus Terraform Grafana Helm Ci Cd Kubernetes Github Actions Java

Similar Jobs