Principal Site Reliability Engineer
Playson
Job Summary
Playson, founded in 2012, is a leading iGaming supplier providing a high-end micro-service-based platform processing billions of financial transactions daily. They focus on delivering the best game experience with low latency. This role is for a Principal Site Reliability Engineer in the Platform Tribe, responsible for managing alerts, providing 24x7 on-call support, proactive monitoring, deploying to EKS/K8s, enhancing infrastructure health, and collaborating with teams to ensure minimal impact during deployments.
Must Have
- Manage day-to-day alerts, system checks, and issue escalation as necessary.
- Provide 24x7 on-call support for critical SaaS events.
- Document issues and remediation steps.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to EKS/K8s cluster using Terraform and Helm/Flux.
- Enhance infrastructure health by implementing checks and scripts to address known issues.
- Maintain and develop deployment code.
- Implement/integrate new technologies into our Cloud Infrastructure.
- Collaborate with other teams to provide top-notch support and assistance.
- Prioritize customer focus in planning deployments/updates, ensuring minimal impact.
- Conduct RCA and take necessary corrective actions to prevent issue recurrence.
- Assign alert-related actions to the appropriate team after investigation.
- Handle support requests for environment-specific actions.
- Strong experience with issue processing (RCA, Postmortems).
- Proficiency in Kubernetes (deployment, scaling, troubleshooting).
- Familiarity with AWS, Terraform, Docker, CI/CD.
- Experience with monitoring tools like DataDog, Prometheus, Grafana, and logging solutions like Elasticsearch, Logstash, and Kibana (ELK Stack) or AWS CloudWatch.
- Strong understanding of networking concepts and protocols.
- Proficiency in at least one scripting language (e.g., Python, NodeJS, Go).
- Experience with configuration management tools like FluxCD/ArgoCD.
- Proficiency in Git or other version control systems.
- Familiarity with incident response and management tools like PagerDuty, Opsgenie, or VictorOps.
- Ownership, proactiveness, persistence, and passion for maintaining a high-traffic online platform.
Perks & Benefits
- Professional development
- Flexibility in your schedule
- Full Medical Insurance for you and your +1
- Special Life Event financial support
- Unlimited paid vacation leave
- Bonus system
- Unlimited sick leave
- Remote work
- Courses and training reimbursement
Job Description
Playson is a leading online gaming supplier, founded in 2012, which has developed worldwide recognition in the industry.
Founded in 2012, Playson is a leading iGaming supplier recognized worldwide. We provide our customers with a high-end micro-service-based platform as a service that aims to process billions of financial transactions per day.
We provide a cross-regional setup and are chasing latency reduction down to zero. We highly invest in delivering the best game experience and smooth connection regardless of the internet coverage and bandwidth of the game clients.
We are currently seeking an experienced Senior Site Reliability Engineer to join our dynamic Platform Tribe.
###### Key Responsibilities:
- Manage day-to-day alerts, system checks, and issue escalation as necessary.
- Provide 24x7 on-call support for critical SaaS events.
- Document issues and remediation steps.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to EKS/K8s cluster using Terraform and Helm/Flux.
- Enhance infrastructure health by implementing checks and scripts to address known issues.
- Maintain and develop deployment code.
- Implement/integrate new technologies into our Cloud Infrastructure.
- Collaborate with other teams to provide top-notch support and assistance.
- Prioritize customer focus in planning deployments/updates, ensuring minimal impact.
- Conduct RCA and take necessary corrective actions to prevent issue recurrence.
- Assign alert-related actions to the appropriate team after investigation.
- Handle support requests for environment-specific actions.
###### Requirements:
- Strong experience with issue processing (RCA, Postmortems).
- Proficiency in Kubernetes (deployment, scaling, troubleshooting).
- Familiarity with AWS, Terraform, Docker, CI/CD.
- Experience with monitoring tools like DataDog, Prometheus, Grafana, and logging solutions like Elasticsearch, Logstash, and Kibana (ELK Stack) or AWS CloudWatch.
- Strong understanding of networking concepts and protocols.
- Proficiency in at least one scripting language (e.g., Python, NodeJS, Go).
- Experience with configuration management tools like FluxCD/ArgoCD.
- Proficiency in Git or other version control systems.
- Familiarity with incident response and management tools like PagerDuty, Opsgenie, or VictorOps.
- Ownership, proactiveness, persistence, and passion for maintaining a high-traffic online platform.
###### Recruitment Process:
1. HR Interview
2. Hiring Manager Interview
3. Technical Interview
4. Final Interview with Head of Platform & CTO
If you're ready to embrace ambitious goals and thrive in a dynamic environment,
Apply now and become part of Playson's exciting journey in the iGaming world!
We deliver entertainment and satisfaction to the lives of the busy world.
What you get in return
- Professional development
- Flexibility in your schedule
- Full Medical Insurance for you and your +1
- Special Life Event financial support
- Unlimited paid vacation leave
- Bonus system
- Unlimited sick leave
- Remote work
- Courses and training reimbursement