Site Reliability Engineer

Eneba Games

Job Summary

At Eneba, we're building an open, safe, and sustainable marketplace for gamers, supporting over 20 million active users. The Platform team builds, deploys, and monitors core infrastructure, creating tools for stability and speed. We are expanding the team with a dedicated Site Reliability Engineer to own and evolve the observability and reliability layer of our platform. This role involves improving metrics, logs, and tracing, guiding teams in building reliable services, introducing SLOs, and supporting incident response across a highly distributed environment.

Must Have

  • Own and evolve observability stack (metrics, logs, tracing) using Prometheus CRDs, Thanos, Alertmanager, Loki, Sentry, and AWS.
  • Improve system reliability by designing, implementing, and maintaining SLIs, SLOs, and error budgets.
  • Enhance system visibility to proactively detect issues, reduce MTTR, and improve incident response.
  • Build internal self-service capabilities for metrics, alerts, dashboards, and instrumentation.
  • Tune and optimize the Thanos stack, improving query performance, cache effectiveness, retention policies, and cost efficiency.
  • Extend and maintain monitoring Helm charts, Prometheus rules, exporters, and dashboards-as-code.
  • Collaborate with Backend, DevOps, and Platform teams to ensure reliability and observability are built into services.
  • Support incident investigations, help pinpoint root causes, and contribute to blameless postmortems.
  • Maintain observability cost efficiency.
  • Keep the monitoring stack healthy and up to date.
  • Hands-on experience with production observability systems like Prometheus, Alertmanager, Grafana, Elasticsearch, Loki, Sentry.
  • Experience with Thanos or large-scale metrics systems.
  • Strong understanding of SLIs, SLOs, error budgets, MTTR, reliability patterns, and incident response workflows.
  • Solid experience with Kubernetes in production and deep understanding of how to monitor it.
  • Proficiency with Infrastructure as Code (Terraform preferred) and automation best practices.
  • Experience with AWS monitoring, scaling, and distributed cloud resource observability.
  • Proficiency in scripting or programming (Go, Python, or Bash).
  • Ability to reason about distributed systems failures and guide teams through root-cause analysis.
  • Strong ownership mindset, excellent communication, and eagerness to collaborate.

Good to Have

  • Experience designing, tuning, or operating Thanos at scale.
  • Experience building self-service observability tooling or dashboards-as-code frameworks.
  • Deep understanding of alert fatigue reduction, signal-to-noise optimization, and high-quality alerting patterns.
  • Experience implementing resilience testing, fault injection, or chaos engineering.
  • Familiarity with service meshes (Istio, Linkerd) or service-level reliability patterns (circuit breakers, retries, rate limiting).
  • Background operating multi-region or global-scale systems with complex telemetry needs.

Perks & Benefits

  • Opportunity to join our Employee Stock Options program.
  • Opportunity to help scale a unique product.
  • Various bonus systems: performance-based, referral, additional paid leave, personal learning budget.
  • Paid volunteering opportunities.
  • Work location of your choice: office, remote, opportunity to work and travel.
  • Personal and professional growth at an exponential rate supported by well-defined feedback and promotion processes.

Job Description

About Eneba

At Eneba, we’re building an open, safe and sustainable marketplace for the gamers of today and tomorrow. Our marketplace supports close to 20m+ active users (and growing fast!), provides a level of trust, safety and market accessibility unparalleled to none. We’re proud of what we’ve accomplished in such a short time and look forward to sharing this journey with you. Join us as we continue to scale, diversify our portfolio, and grow with the evolving community of gamers.

About your team

The Platform team builds, deploys, monitors, and is on call for the platform components and underlying platform infrastructure. The platform team creates tools for other teams to perform in the most stable, fast, and precise manner. Platform team members do not shy away from architecture-level assignments. They follow the latest tech trends, pulse, and know about the most effective tools of the moment. Eneba’s users cannot visually see the impact of the platform team, however, it is felt via the presence of speed, quality, and new features.

We’re expanding the team with a dedicated Site Reliability Engineer who will take ownership of observability, reliability practices, and system visibility across a highly distributed environment.

As a Site Reliability Engineer, you will own and evolve the entire observability and reliability layer of our platform. You’ll improve our metrics, logs, and tracing ecosystem; guide teams in building reliable services; introduce SLOs and error budgets; implement production readiness processes; and support developers during incidents by helping identify failing components across distributed systems.

You will be the driving force behind making reliability and observability a first-class part of our platform and self-service workflows.

Responsibilities

  • Own and evolve our observability stack across metrics, logs, and tracing using Prometheus CRDs, Thanos, Alertmanager, Loki, Sentry, Grafana, and supporting AWS services.
  • Improve system reliability by designing, implementing, and maintaining SLIs, SLOs, and error budgets, ensuring our services meet reliability objectives.
  • Enhance system visibility, enabling teams to proactively detect issues, reduce MTTR, and improve incident response workflows.
  • Build internal self-service capabilities for metrics, alerts, dashboards, and instrumentation to empower development teams.
  • Tune and optimize the Thanos stack, improving query performance, cache effectiveness, retention policies, and cost efficiency.
  • Extend and maintain monitoring Helm charts, Prometheus rules, exporters, and dashboards-as-code.Collaborate with Backend, DevOps, and Platform teams to ensure reliability and observability are built into services from the design phase.
  • Support incident investigations, help pinpoint root causes, correlate metrics/logs/traces, and contribute to blameless postmortems.
  • Maintain observability cost efficiency, reducing waste through retention strategy, metric cardinality tuning, and performance improvements.
  • Keep the monitoring stack healthy and up to date, ensuring reliability, security, and alignment with best practices.

Requirements

  • Hands-on experience with production observability systems, especially Prometheus, Alertmanager, Grafana, and log/trace platforms like Elasticsearch, Loki, Sentry, or their equivalents.
  • Experience with Thanos or large-scale metrics systems, including tuning, caching strategies, and long-term storage.
  • Strong understanding of SLIs, SLOs, error budgets, MTTR, reliability patterns, and incident response workflows.
  • Solid experience with Kubernetes in production and deep understanding of how to monitor it (exporters, node metrics, service mesh signals).Proficiency with Infrastructure as Code (Terraform preferred) and automation best practices.
  • Experience with AWS monitoring, scaling, and distributed cloud resource observability.
  • Proficiency in scripting or programming (Go, Python, or Bash) to build automation and tooling.
  • Ability to reason about distributed systems failures, correlate signals, and guide teams through root-cause analysis.
  • Strong ownership mindset, excellent communication, and eagerness to collaborate with development teams.

Extra points

  • Experience designing, tuning, or operating Thanos at scale.
  • Experience building self-service observability tooling or dashboards-as-code frameworks.
  • Deep understanding of alert fatigue reduction, signal-to-noise optimization, and high-quality alerting patterns.
  • Experience implementing resilience testing, fault injection, or chaos engineering.
  • Familiarity with service meshes (Istio, Linkerd) or service-level reliability patterns (circuit breakers, retries, rate limiting).
  • Background operating multi-region or global-scale systems with complex telemetry needs.

What it’s like to work at Eneba

  • Opportunity to join our Employee Stock Options program.
  • Opportunity to help scale a unique product.
  • Various bonus systems: performance-based, referral, additional paid leave, personal learning budget.
  • Paid volunteering opportunities.
  • Work location of your choice: office, remote, opportunity to work and travel.
  • Personal and professional growth at an exponential rate supported by well-defined feedback and promotion processes.

*Please attach CV's in English.

*To find out about how we handle your personal data, make sure to check out our Candidate Privacy Notice https://www.eneba.com/candidate-privacy-notice

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

15 Skills Required For This Role

Communication Budget Management Talent Acquisition Game Texts Incident Response Aws Service Mesh Prometheus Grafana Terraform Elasticsearch Helm Kubernetes Python Bash

Similar Jobs