Job Description:
As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and observability of our transaction processing and settlement platforms, while also providing hands-on support for critical business applications. You will work across hybrid environments (on-prem and cloud), contribute to automation and monitoring enhancements, and support incident response and platform stability. This position requires availability to work in a 24×7 support model, including rotational shifts and on-call duties.
Key Responsibilities:
- Monitoring & Observability: Maintain and enhance monitoring systems using Prometheus, Grafana, Splunk, and SolarWinds. Ensure timely detection and resolution of issues through effective alerting and dashboards.
- Application Support: Provide L1/L2 support for business-critical applications, including incident triage, health checks, deployment validation, and coordination with development and product teams.
- Incident Management: Lead response for moderate to complex incidents, perform root cause analysis, and contribute to post-incident reviews and documentation.
- Automation & Scripting: Develop and maintain automation scripts using Python, PowerShell, or Bash to streamline operational tasks and reduce manual effort.
- Infrastructure Support: Monitor and support infrastructure health across on-prem and cloud platforms (GCP, Azure), including performance tuning and capacity planning.
- Kubernetes Operations: Support containerized workloads and microservices running on Kubernetes clusters. Perform health checks, troubleshoot deployments, and optimize resource usage.
- Process Adherence: Participate in ITIL-aligned processes for incident, change, and problem management. Ensure compliance with operational standards and audit requirements.
- Knowledge Sharing: Document SOPs, recurring issues, and resolutions. Mentor junior engineers and contribute to team knowledge base.
- Collaboration: Work closely with development, QA, and platform teams to support deployments, platform transitions, and reliability improvements.
- Continuous Improvement: Proactively identify areas for improvement in system reliability, alerting, and operational workflows.
- 24/7 Support: Provide on-call support for critical issues.
Qualifications:
- Bachelor’s / master's degree in computer science, Engineering, or related field.
- 2–5 years of experience in SRE, infrastructure operations, system administration, or application support.
- Proficiency in monitoring tools (Prometheus, Grafana, Splunk, SolarWinds).
- Strong scripting skills (Python, Bash, PowerShell).
- Experience with cloud platforms (GCP, Azure, AWS).
- Hands-on experience with Kubernetes in production environments.
- Solid understanding of ITIL practices and enterprise support workflows.
- Hands-on experience with automation tools (ActiveBatch or similar).
- Strong analytical, communication, and problem-solving skills.
- Willingness to work in a 24×7 support environment and take ownership of reliability outcomes.