About the job
About the role
We are seeking a talented Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in software engineering and systems administration, with a passion for building scalable and reliable systems. As an SRE, you will collaborate with development and operations teams to ensure our services are reliable, performant, and highly available.
Key Responsibilities
- Experience maintaining and supporting solutions in a Cloud based environment (GCP or AWS)
- Experience working with various monitoring tools. (eg. ELK, Dyntrace, Cloudwatch, Cloud logging, Cloud Monitoring, BMC Surveyor, BMC Patrol, Grafana, Prometheus)
- Ensure monitoring and self-healing strategies are implemented and maintained to proactively prevent production incidents.
- Perform root cause analysis of production issues
- Design and manage on call and escalation processes – Nice to Have
- Participate in design reviews and production reviews for new features, products, or pieces of infrastructure
- Designing and implementing ELK (Elasticsearch, Logstash and Kibana) stack, Prometheus and Grafana solutions for monitoring and alerting.
- Debug production issues across services and levels of the stack.
- Establish KPIs to demonstrate maturity, efficiency, and value to our business partners
- Works as an integral part of the DevOps team with complimentary skills and common goals
- L3 Support experience is an asset.
- Work to create a Release management process and help with Out-of-business-hour deployments and support (Rotation with team members)
- Familiar and comfortable with agile development techniques.
Technology skills (Mandatory)
ELK, Dyntrace, Cloudwatch, Cloud logging, Cloud Monitoring, BMC Surveyor, BMC Patrol, Grafana, Prometheus
Required qualifications to be successful in this role:
- Bachelor’s degree in computer science engineering, or related field.
- 8 -10 years of experience as a SRE.
- Proven experience as an SRE, DevOps engineer, or similar role.
- Strong programming skills in languages such as Python, Go, Java, or Ruby.
- Strong problem-solving skills and ability to work under pressure.
- Excellent communication and collaboration skills.
- Flexible to work in EST time zones ( 9-5 EST)