Infrastructure Automation Site Reliability Engineer (SRE)

IBKR External

3-6 Years | Hyderabad, Telangana, India (On Site) | Full Time | 2 months ago

Apply Now

Job Summary

The Infrastructure Automation Site Reliability Engineer (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operational challenges. This role supports existing Infrastructure Developers by taking ownership of application support and process work required to manage applications at scale in a 24x7 environment, allowing developers to focus on new features. Key functions include application/tool support, service introduction, infrastructure automation, monitoring & observability, and operational excellence.

Must Have

Support existing applications and services hosted by the Infrastructure Automation (InfAuto) team
Develop runbooks for application support and maintenance
Create detailed alerts for incident management and monitoring tools
Implement and manage an updated operations platform for the Technical Operations team
Develop communication plans for service and tool launches
Improve messaging around service interruptions and maintenance
Expand use of cloud development pipelines for new observability capabilities
Support cloud infrastructure integration
Use scripts to perform maintenance tasks
Define KPIs and SLAs for managed services
Assist with dashboard development and management
Integrate cloud infrastructure with monitoring and reporting tools
Conduct capacity planning to support proactive scaling
Design and execute high availability (HA) and disaster recovery (DR) infrastructure testing
Partner with operations teams to expedite issue analysis
Coordinate change management activities with application users
3–6 years experience using Infrastructure as Code tools like Terraform, CloudFormation
3–6 years experience using Configuration Management tools like Ansible, Puppet, or Chef
3–6 years experience with Container Technologies like Docker, Podman, basic Kubernetes concepts
3–6 years experience with Observability Platforms like Grafana, Elastic (ELK), DataDog, Splunk
3–6 years experience with Issue / Project Tracking tools like JIRA, ServiceNow, Trello
3–6 years experience with CI/CD Pipelines like Jenkins, GitLab CI, GitHub Actions
3–6 years experience with Documentation Tools like SharePoint, Confluence
3–6 years experience with Linux Operating Systems like Red Hat Enterprise Linux or similar
3–6 years experience with Database Operations like SQL, PostgreSQL
3–6 years experience with IDEs like Visual Studio Code (VS Code), JetBrains IntelliJ IDEA

Good to Have

2–4 years in an L1 SRE or DevOps role
Experience as a Systems Engineer (infrastructure design and implementation)
Platform Engineer (internal tooling and platform development)
Cloud Engineer (multi-cloud experience and migration projects)
Application Support (production troubleshooting)
Release Engineer (software deployment and release management)
Incident Response (on-call experience and production issue resolution)

Perks & Benefits

Competitive salary package
Performance-based annual bonus (cash and stocks)
Hybrid working model (3 days office/week)
Group Medical & Life Insurance
Modern offices with free amenities & fully stocked cafeterias
Monthly food card & company-paid snacks
Hardship/shift allowance with company-provided pickup & drop facility
Attractive employee referral bonus
Frequent company-sponsored team-building events and outings

Job Description

About the Role:

The Infrastructure Automation Site Reliability Engineer (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operational challenges. Responsibilities include creating support documentation, developing key metrics for tracking and reporting, managing monitoring services, using automation tools, and coordinating cross-team communications related to releases and maintenance.

Automation SREs support existing Infrastructure Developers by taking ownership of application support and process work required to manage these applications at scale in a 24×7 environment. This allows developers to focus on building new features and functionality.

Key Functions:

Application / Tool Support

Support existing applications and services hosted by the Infrastructure Automation (InfAuto) team
Develop runbooks for application support and maintenance
Create detailed alerts for incident management and monitoring tools
Implement and manage an updated operations platform for the Technical Operations team

Service Introduction & Communications

Develop communication plans for service and tool launches
Improve messaging around service interruptions and maintenance

Infrastructure & Automation

Expand use of cloud development pipelines for new observability capabilities
Support cloud infrastructure integration
Use scripts to perform maintenance tasks

Monitoring & Observability

Define KPIs and SLAs for managed services
Assist with dashboard development and management
Integrate cloud infrastructure with monitoring and reporting tools
Conduct capacity planning to support proactive scaling

Operational Excellence

Design and execute high availability (HA) and disaster recovery (DR) infrastructure testing
Partner with operations teams to expedite issue analysis
Coordinate change management activities with application users

Required Skills and Tools Experience:

Experience Range: 3–6 years using tools in the following categories:

Infrastructure as Code: Terraform, CloudFormation, or similar
Configuration Management: Ansible, Puppet, or Chef
Container Technologies: Docker, Podman, basic Kubernetes concepts
Observability Platforms: Grafana, Elastic (ELK), DataDog, Splunk
Issue / Project Tracking: JIRA, ServiceNow, Trello, or similar
CI/CD Pipelines: Jenkins, GitLab CI, GitHub Actions
Documentation Tools: SharePoint, Confluence (for user guides, runbooks, etc.)
Linux Operating Systems: Red Hat Enterprise Linux or similar (CentOS, Rocky, Fedora)
Database Operations: SQL, PostgreSQL
IDEs: Visual Studio Code (VS Code), JetBrains IntelliJ IDEA

Desired Skills

2–4 years in an L1 SRE or DevOps role
Experience as a Systems Engineer (infrastructure design and implementation)
Platform Engineer (internal tooling and platform development)
Cloud Engineer (multi-cloud experience and migration projects)
Application Support (production troubleshooting)
Release Engineer (software deployment and release management)
Incident Response (on-call experience and production issue resolution)

Company Benefits & Perks:

Competitive salary package.
Performance-based annual bonus (cash and stocks).
Hybrid working model (3 days office/week).
Group Medical & Life Insurance.
Modern offices with free amenities & fully stocked cafeterias.
Monthly food card & company-paid snacks.
Hardship/shift allowance with company-provided pickup & drop facility*
Attractive employee referral bonus.
Frequent company-sponsored team-building events and outings.

_Depending upon the shifts.

**The benefits package is subject to change at the management's discretion._

27 Skills Required For This Role

Problem Solving Github Game Texts Release Management Postgresql Gitlab Incident Response Linux Ansible Terraform Podman Grafana Chef Elk Puppet Ci Cd Docker Kubernetes Confluence Splunk Jira Sql Visual Studio Github Actions Intelli J Jenkins Trello

Similar Jobs