Senior Site Reliability Engineer
Playson
Job Summary
Playson is seeking an experienced Senior Site Reliability Engineer/DevOps to join their dynamic Platform Tribe. This role involves managing day-to-day alerts, system checks, and issue escalation, alongside providing 24x7 on-call support for critical SaaS events. Key responsibilities include proactively creating monitors within the EKS/K8s ecosystem, deploying to EKS/K8s clusters using Terraform and Helm/Flux, and enhancing infrastructure health. The engineer will also maintain deployment code, integrate new technologies into Cloud Infrastructure, and collaborate with other teams to ensure top-notch support and minimal impact during deployments.
Must Have
- Manage day-to-day alerts, system checks, and issue escalation.
- Provide 24x7 on-call support for critical SaaS events.
- Document issues and remediation steps.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to EKS/K8s cluster using Terraform and Helm/Flux.
- Enhance infrastructure health with checks and scripts.
- Maintain and develop deployment code.
- Implement/integrate new technologies into Cloud Infrastructure.
- Collaborate with other teams for support and assistance.
- Prioritize customer focus in planning deployments/updates.
- Conduct RCA and take corrective actions to prevent recurrence.
- Assign alert-related actions to appropriate teams after investigation.
- Handle support requests for environment-specific actions.
- Strong experience with issue processing (RCA, Postmortems).
- Proficiency in Kubernetes (deployment, scaling, troubleshooting).
- Familiarity with AWS, Terraform, Docker, CI/CD.
- Experience with monitoring tools like DataDog, Prometheus, Grafana.
- Experience with logging solutions like ELK Stack or AWS CloudWatch.
- Strong understanding of networking concepts and protocols.
- Proficiency in at least one scripting language (e.g., Python, NodeJS, Go).
- Experience with configuration management tools like FluxCD/ArgoCD.
- Proficiency in Git or other version control systems.
- Familiarity with incident response and management tools.
Perks & Benefits
- Professional development
- Flexibility in your schedule
- Full Medical Insurance for you and your +1
- Special Life Event financial support
- Unlimited paid vacation leave
- Bonus system
- Unlimited sick leave
- Remote work
- Courses and training reimbursement
Job Description
We are currently seeking an experienced Senior Site Reliability Engineer/DevOps to join our dynamic Platform Tribe.
###### What will you be doing:
- Manage day-to-day alerts, system checks, and issue escalation as necessary.
- Provide 24x7 on-call support for critical SaaS events.
- Document issues and remediation steps.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to EKS/K8s cluster using Terraform and Helm/Flux.
- Enhance infrastructure health by implementing checks and scripts to address known issues.
- Maintain and develop deployment code.
- Implement/integrate new technologies into our Cloud Infrastructure.
- Collaborate with other teams to provide top-notch support and assistance.
- Prioritize customer focus in planning deployments/updates, ensuring minimal impact.
- Conduct RCA and take necessary corrective actions to prevent issue recurrence.
- Assign alert-related actions to the appropriate team after investigation.
- Handle support requests for environment-specific actions.
###### To succeed in this role, you will need:
- Strong experience with issue processing (RCA, Postmortems).
- Proficiency in Kubernetes (deployment, scaling, troubleshooting).
- Familiarity with AWS, Terraform, Docker, CI/CD.
- Experience with monitoring tools like DataDog, Prometheus, Grafana, and logging solutions like Elasticsearch, Logstash, and Kibana (ELK Stack) or AWS CloudWatch.
- Strong understanding of networking concepts and protocols.
- Proficiency in at least one scripting language (e.g., Python, NodeJS, Go).
- Experience with configuration management tools like FluxCD/ArgoCD.
- Proficiency in Git or other version control systems.
- Familiarity with incident response and management tools like PagerDuty, Opsgenie, or VictorOps.
- Ownership, proactiveness, persistence, and passion for maintaining a high-traffic online platform.
###### Recruitment Process:
1. HR Interview
2. Hiring Manager Interview
3. Technical Interview
4. Final Interview with Head of Platform & CTO
If you're ready to embrace ambitious goals and thrive in a dynamic environment,
Apply now and become part of Playson's exciting journey in the iGaming world!