Site Reliability Engineer II

NCR Atleos

2-5 Years | Hyderabad, Telangana, India (Hybrid) | Full Time | 2 months ago

Apply Now

Job Summary

As a Site Reliability Engineer II at NCR Atleos, you will ensure the reliability, performance, and observability of transaction processing and settlement platforms. This role involves providing hands-on support for critical business applications across hybrid environments (on-prem and cloud), contributing to automation and monitoring enhancements, and supporting incident response and platform stability. You will maintain and enhance monitoring systems, provide L1/L2 application support, lead incident management, and develop automation scripts. The position requires 24x7 availability, including rotational shifts and on-call duties.

Must Have

Maintain and enhance monitoring systems using Prometheus, Grafana, Splunk, and SolarWinds.
Provide L1/L2 support for business-critical applications.
Lead response for moderate to complex incidents, perform root cause analysis.
Develop and maintain automation scripts using Python, PowerShell, or Bash.
Monitor and support infrastructure health across on-prem and cloud platforms (GCP, Azure).
Support containerized workloads and microservices running on Kubernetes clusters.
Participate in ITIL-aligned processes for incident, change, and problem management.
Document SOPs, recurring issues, and resolutions.
Provide on-call support for critical issues in a 24/7 support model.
Bachelor’s / master's degree in Computer Science, Engineering, or related field.
2–5 years of experience in SRE, infrastructure operations, system administration, or application support.
Proficiency in monitoring tools (Prometheus, Grafana, Splunk, SolarWinds).
Strong scripting skills (Python, Bash, PowerShell).
Experience with cloud platforms (GCP, Azure, AWS).
Hands-on experience with Kubernetes in production environments.
Solid understanding of ITIL practices and enterprise support workflows.
Hands-on experience with automation tools (ActiveBatch or similar).
Strong analytical, communication, and problem-solving skills.
Willingness to work in a 24×7 support environment and take ownership of reliability outcomes.

Job Description

Job Description:

As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and observability of our transaction processing and settlement platforms, while also providing hands-on support for critical business applications. You will work across hybrid environments (on-prem and cloud), contribute to automation and monitoring enhancements, and support incident response and platform stability. This position requires availability to work in a 24×7 support model, including rotational shifts and on-call duties.

Key Responsibilities:

Monitoring & Observability: Maintain and enhance monitoring systems using Prometheus, Grafana, Splunk, and SolarWinds. Ensure timely detection and resolution of issues through effective alerting and dashboards.
Application Support: Provide L1/L2 support for business-critical applications, including incident triage, health checks, deployment validation, and coordination with development and product teams.
Incident Management: Lead response for moderate to complex incidents, perform root cause analysis, and contribute to post-incident reviews and documentation.
Automation & Scripting: Develop and maintain automation scripts using Python, PowerShell, or Bash to streamline operational tasks and reduce manual effort.
Infrastructure Support: Monitor and support infrastructure health across on-prem and cloud platforms (GCP, Azure), including performance tuning and capacity planning.
Kubernetes Operations: Support containerized workloads and microservices running on Kubernetes clusters. Perform health checks, troubleshoot deployments, and optimize resource usage.
Process Adherence: Participate in ITIL-aligned processes for incident, change, and problem management. Ensure compliance with operational standards and audit requirements.
Knowledge Sharing: Document SOPs, recurring issues, and resolutions. Mentor junior engineers and contribute to team knowledge base.
Collaboration: Work closely with development, QA, and platform teams to support deployments, platform transitions, and reliability improvements.
Continuous Improvement: Proactively identify areas for improvement in system reliability, alerting, and operational workflows.
24/7 Support: Provide on-call support for critical issues.

Qualifications:

Bachelor’s / master's degree in computer science, Engineering, or related field.
2–5 years of experience in SRE, infrastructure operations, system administration, or application support.
Proficiency in monitoring tools (Prometheus, Grafana, Splunk, SolarWinds).
Strong scripting skills (Python, Bash, PowerShell).
Experience with cloud platforms (GCP, Azure, AWS).
Hands-on experience with Kubernetes in production environments.
Solid understanding of ITIL practices and enterprise support workflows.
Hands-on experience with automation tools (ActiveBatch or similar).
Strong analytical, communication, and problem-solving skills.
Willingness to work in a 24×7 support environment and take ownership of reliability outcomes.

13 Skills Required For This Role

Game Texts Quality Control Incident Response Aws Azure Prometheus Grafana Powershell Microservices Kubernetes Python Splunk Bash

Similar Jobs

Devops

Engineer, Site Reliability Engineering

London stock Exchange • Bangalore, Karnataka, India (On Site)

DevOps Engineer

Kforce Inc • Greenwood Village, Colorado, United States (On Site)

Global Endpoint DevOps Engineer

GLu Mobile • Vancouver, British Columbia, Canada (On Site)

Shift engineer (SRE Team)

Gaijin Entertainment • On Site

Senior Solutions Architect - New Logo

Temporal Technologies • United States (Remote)

Cloud Infrastructure Engineer

Pay2 • Gurugram, India (On Site)

Universal Music Group • Nashville, Tennessee, United States (On Site)

Sr Cloud Engineer

King • Stockholm, Sweden (On Site)

site reliability engineer - core and data

Cred • Bangalore, Karnataka, India (On Site)

Senior Site Reliability Engineer

Progress • Provincia de Heredia, Belén, Costa Rica (Hybrid)

Software Development & Engineering

Software Engineer I

Motive Studio • Hyderabad, Telangana, India (Hybrid)

Software Engineer II

Motive Studio • Hyderabad, Telangana, India (Hybrid)

Salesforce Senior Developer

Ness • Bangalore, Karnataka, India (Hybrid)

Engineering Manager, Create:Source Code

gitlab • Remote

Process Engineer III

Applied materials • Xi'An, Shaanxi, China (On Site)

200mm Lab-Etch Engineer

Applied materials • Xi'An, Shaanxi, China (On Site)

PDC(PROVision) Process Support Engineer

Applied materials • Icheon, Gyeonggi-do, South Korea (On Site)

Coastal Engineer - INTERNAL ONLY

TSA • Bundall Queensland, Australia (On Site)

Software Engineer, BigQuery AI Developer Experience

Google • Kirkland, Washington, United States of America (On Site)

Senior Programmer

big ant • Melbourne VIC, Australia (On Site)