Site Reliability Engineer II
NCR Atleos
Job Summary
As a Site Reliability Engineer II at NCR Atleos, you will ensure the reliability, performance, and observability of transaction processing and settlement platforms. This role involves providing hands-on support for critical business applications across hybrid environments (on-prem and cloud), contributing to automation and monitoring enhancements, and supporting incident response and platform stability. You will maintain and enhance monitoring systems, provide L1/L2 application support, lead incident management, and develop automation scripts. The position requires 24x7 availability, including rotational shifts and on-call duties.
Must Have
- Maintain and enhance monitoring systems using Prometheus, Grafana, Splunk, and SolarWinds.
- Provide L1/L2 support for business-critical applications.
- Lead response for moderate to complex incidents, perform root cause analysis.
- Develop and maintain automation scripts using Python, PowerShell, or Bash.
- Monitor and support infrastructure health across on-prem and cloud platforms (GCP, Azure).
- Support containerized workloads and microservices running on Kubernetes clusters.
- Participate in ITIL-aligned processes for incident, change, and problem management.
- Document SOPs, recurring issues, and resolutions.
- Provide on-call support for critical issues in a 24/7 support model.
- Bachelor’s / master's degree in Computer Science, Engineering, or related field.
- 2–5 years of experience in SRE, infrastructure operations, system administration, or application support.
- Proficiency in monitoring tools (Prometheus, Grafana, Splunk, SolarWinds).
- Strong scripting skills (Python, Bash, PowerShell).
- Experience with cloud platforms (GCP, Azure, AWS).
- Hands-on experience with Kubernetes in production environments.
- Solid understanding of ITIL practices and enterprise support workflows.
- Hands-on experience with automation tools (ActiveBatch or similar).
- Strong analytical, communication, and problem-solving skills.
- Willingness to work in a 24×7 support environment and take ownership of reliability outcomes.
Job Description
Job Description:
As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and observability of our transaction processing and settlement platforms, while also providing hands-on support for critical business applications. You will work across hybrid environments (on-prem and cloud), contribute to automation and monitoring enhancements, and support incident response and platform stability. This position requires availability to work in a 24×7 support model, including rotational shifts and on-call duties.
Key Responsibilities:
- Monitoring & Observability: Maintain and enhance monitoring systems using Prometheus, Grafana, Splunk, and SolarWinds. Ensure timely detection and resolution of issues through effective alerting and dashboards.
- Application Support: Provide L1/L2 support for business-critical applications, including incident triage, health checks, deployment validation, and coordination with development and product teams.
- Incident Management: Lead response for moderate to complex incidents, perform root cause analysis, and contribute to post-incident reviews and documentation.
- Automation & Scripting: Develop and maintain automation scripts using Python, PowerShell, or Bash to streamline operational tasks and reduce manual effort.
- Infrastructure Support: Monitor and support infrastructure health across on-prem and cloud platforms (GCP, Azure), including performance tuning and capacity planning.
- Kubernetes Operations: Support containerized workloads and microservices running on Kubernetes clusters. Perform health checks, troubleshoot deployments, and optimize resource usage.
- Process Adherence: Participate in ITIL-aligned processes for incident, change, and problem management. Ensure compliance with operational standards and audit requirements.
- Knowledge Sharing: Document SOPs, recurring issues, and resolutions. Mentor junior engineers and contribute to team knowledge base.
- Collaboration: Work closely with development, QA, and platform teams to support deployments, platform transitions, and reliability improvements.
- Continuous Improvement: Proactively identify areas for improvement in system reliability, alerting, and operational workflows.
- 24/7 Support: Provide on-call support for critical issues.
Qualifications:
- Bachelor’s / master's degree in computer science, Engineering, or related field.
- 2–5 years of experience in SRE, infrastructure operations, system administration, or application support.
- Proficiency in monitoring tools (Prometheus, Grafana, Splunk, SolarWinds).
- Strong scripting skills (Python, Bash, PowerShell).
- Experience with cloud platforms (GCP, Azure, AWS).
- Hands-on experience with Kubernetes in production environments.
- Solid understanding of ITIL practices and enterprise support workflows.
- Hands-on experience with automation tools (ActiveBatch or similar).
- Strong analytical, communication, and problem-solving skills.
- Willingness to work in a 24×7 support environment and take ownership of reliability outcomes.