Site Reliability Engineer

1 Month ago • 5-10 Years • Devops • $120,000 PA - $150,000 PA

Job Summary

Job Description

As a Site Reliability Engineer at Xsolla, you will be responsible for ensuring the high reliability and availability of large-scale production environments, meeting service level objectives (SLOs) and service level indicators (SLIs). This role involves monitoring systems, responding to incidents for quick resolution, and driving improvements to minimize downtime. You will develop comprehensive monitoring solutions using tools like Kubernetes, Datadog, Prometheus, and Grafana, and support services before they go live with capacity planning and readiness reviews. You will also engage in performance analysis, system tuning, and collaborate with development teams to enhance operational stability and build automation systems for maintaining system health.
Must have:
  • 5-10 years as SRE or Software Engineer in large-scale production
  • Proficiency in Python and Bash scripting
  • Deep knowledge of monitoring systems (Datadog, Prometheus, Grafana)
  • Understanding of CI/CD processes (Gitlab preferred)
  • Experience with container orchestration (Docker, Kubernetes)
  • Experience with infrastructure automation (Terraform)
  • Experience with Linux-based infrastructures and administration
  • Strong problem-solving and debugging skills
  • Excellent communication skills
Good to have:
  • Strong understanding of Go and PHP
  • Experience with Helm
  • Experience with automation and system hardening
  • Certified Kubernetes Administrator or Developer
  • HashiCorp Certifications
  • GCP Certifications
Perks:
  • 100% company-paid medical, dental, and vision plans
  • Unlimited Flexible Time Off
  • Personalized career roadmap
  • Professional development, training, and educational opportunities

Job Details

ABOUT US

At Xsolla, we believe that great games begin as ideas, driven by the curiosity, dedication, and grit of creators around the world. Our mission is to empower these visionaries by providing the support and resources they need to bring their games to life. We are committed to leveling the playing field, ensuring that every creator has the opportunity to share their passion with the world. 

Headquartered in Los Angeles, with offices in Berlin, Seoul, and beyond, we partner with industry leaders like Valve, Twitch, and Ubisoft to clear the paths for innovation in gaming. Our global reach spans over 200 geographies, offering more than 700 payment methods in 130+ currencies.

Longevity Opportunity Vision Enjoy the game!

Requirements

    • Proven experience as a Site Reliability Engineer, or similar Software Engineering role in a large-scale production environment ( 5 years to 10 years)
    • overall in IT area (as Ops or Developer).
    • Proficiency in scripting languages such as Python, Bash. Strong understanding of Go and PHP will be a plus.
    • Deep knowledge of monitoring systems such as Datadog, Prometheus, Grafana.
    • Good understanding of continuous integration/continuous delivery processes and platforms (Gitlab preferred). Experience with Helm.
    • Experience with Docker, Kubernetes, or other container orchestration systems.
    • Familiarity with infrastructure automation tools like Terraform.
    • Experience with automation, system administration, and system hardening.
    • Experience with Linux-based infrastructures, Linux/Unix administration.
    • Demonstrated problem-solving skills, particularly debugging and troubleshooting complex software systems. Ability to work under pressure.
    • Excellent communication skills with a capacity to articulate and solve complex technical problems
    • Xsolla Technology Stack:Ubuntu, Kubernetes, Gitlab, Terraform, Terragrunt, Puppet, Nginx, Google Cloud Platform, Datadog, Prometheus, Grafana,
    • ELK, Zabbix and Harbor.

Responsibilities

    • Ensure high reliability and availability and meet SLAs, SLOs, and SLIs.
    • Monitor the system for issues and respond to incidents, ensuring quick resolution to maintain high system availability.
    • Drive incident resolution and process improvements to minimize downtime and increase operational transparency.
    • Ensure all key services are measured, monitored and raising alerts when needed.
    • Develop comprehensive monitoring solutions to provide full visibility to the different platform components using tools and services like Kubernetes, Datadog, Prometheus, Grafana and others.
    • Support services before they go live through activities such as capacity planning, monitoring setup, logging, and production readiness reviews.
    • Engage in service capacity planning and demand forecasting, performance analysis, and system tuning.
    • Collaborate with the development teams to enhance the product's operational stability.
    • Build and drive the automation systems that maintain system health

Education

    • IT professional certifications are not required, but it will be a plus
    • Certified Kubernetes Administrator or Developer
    • HashiCorp Certifications
    • GCP Certifications


Benefits:
We are passionate about fostering a supportive environment for our team, so we prioritize the physical, mental, and emotional well-being of our employees and their families through a comprehensive Benefits Program. This includes 100% company-paid medical, dental, and vision plans, unlimited Flexible Time Off, and a personalized career roadmap for each employee. By investing in professional development through training and educational opportunities, we ensure that our team thrives both personally and professionally. Together, we’re not just building a business; we’re cultivating a community that values creativity, collaboration, and the transformative power of play.

By submitting the following job application form, you consent to Xsolla processing your data for career-related inquiries and potential employment opportunities. We process your data in accordance with this Xsolla Privacy Notice for Job Applicants. Please direct any inquiries regarding your data privacy to careers@xsolla.com.

Similar Jobs

Scout - Engineer, Whole Vehicle EE System Validation

Scout

Novi, Michigan, United States (On-Site)
2 Months ago
ISS Stoxx - Junior Data Analyst (Arabic Speaker)

ISS Stoxx

Makati City, Metro Manila, Philippines (Hybrid)
1 Month ago
binance - Accountant / Accounting Manager (12 months contract)

binance

Taipei City, Taiwan (Remote)
3 Months ago
Rockstar Games - Software Engineer, C#/Java (All Levels)

Rockstar Games

Edinburgh, Scotland, United Kingdom (On-Site)
3 Months ago
Regrello - Software Engineer

Regrello

United States (Hybrid)
2 Years ago
bytedance - Site Reliability Engineer, Traffic Platform

bytedance

San Jose, California, United States (On-Site)
9 Months ago
Salesforce - Lead Solution Engineer

Salesforce

London, England, United Kingdom (On-Site)
2 Months ago
bytedance - Site Reliability Engineer, ML System

bytedance

Seattle, Washington, United States (On-Site)
9 Months ago
Lytx,  Inc  - Senior DevOps Engineer

Lytx, Inc

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Nagarro - Associate Principal Engineer, DevOps

Nagarro

India (Remote)
10 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Sika Group - Customer Service Representative

Sika Group

Lewisville, Texas, United States (On-Site)
3 Weeks ago
dun bradstreet - Data Scientist

dun bradstreet

Solna, Stockholm County, Sweden (Hybrid)
3 Months ago
Autodesk - Principal Software Engineer

Autodesk

Brisbane, Queensland, Australia (On-Site)
2 Months ago
Square - Brand Representative – Event Marketing

Square

Austin, Texas, United States (On-Site)
1 Month ago
MarketScale - UGC Consultant

MarketScale

United States (Remote)
3 Months ago
Nightfall AI - Operations Coordinator

Nightfall AI

San Francisco, California, United States (On-Site)
3 Months ago
Juego Studios - Intern 3D Artist

Juego Studios

Bengaluru, Karnataka, India (On-Site)
1 Month ago
creative assembly - Senior/Principal Graphics Programmer

creative assembly

England, United Kingdom (On-Site)
5 Months ago
Mcdonalds - Oracle Cloud Administrator

Mcdonalds

Mexico City, Mexico (On-Site)
3 Weeks ago
Mixpanel - Account Executive, Mid-Market

Mixpanel

San Francisco, California, United States (Hybrid)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Montreal, Quebec, Canada

Black Bery - QNX Build Specialist

Black Bery

Ottawa, Ontario, Canada (On-Site)
1 Month ago
Motive Studio - Senior Character Artist, External Development - Battlefield

Motive Studio

Montreal, Quebec, Canada (On-Site)
4 Months ago
Altagram Group - German QA Tester

Altagram Group

Montreal, Quebec, Canada (On-Site)
4 Weeks ago
reality twist - QA Analyst (Manual)

reality twist

Vaughan, Ontario, Canada (On-Site)
3 Months ago
SideFX - 3D Software Developer

SideFX

Toronto, Ontario, Canada (Hybrid)
7 Months ago
extreme network - Embedded Software Engineer

extreme network

Ontario, Canada (Hybrid)
2 Months ago
Epic Games - QA Lead

Epic Games

Montreal, Quebec, Canada (On-Site)
4 Months ago
Cineplex - Part-time Crew Member

Cineplex

Mount Royal, Quebec, Canada (On-Site)
1 Year ago
Cineplex - Bartender - Seasonal

Cineplex

Toronto, Ontario, Canada (On-Site)
2 Months ago
Barnstorm VFX - Senior VFX Lighter

Barnstorm VFX

Montreal, Quebec, Canada (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

resonance  - DevOps Engineer

resonance

New York, United States (Remote)
2 Months ago
extreme network - Staff Software Engineer - DevSecOps - AWS/Azure - Terraform/Ansible - CI/CD Pipelines

extreme network

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
warner bros games - Staff Software Engineer - AWS Architecture (Observability Team)

warner bros games

Bengaluru, Karnataka, India (Hybrid)
8 Months ago
Veeam Software - Platform Engineer, SaaS

Veeam Software

Prague, Czechia (Remote)
3 Months ago
Rackspace Technology - Senior Solutions Architect (GCP)

Rackspace Technology

Egypt (Remote)
2 Months ago
Rippling - Senior Software Engineer (Backend) - HRIS Platform

Rippling

San Francisco, California, United States (On-Site)
3 Months ago
Wolters Kluwer - AI Software Architect

Wolters Kluwer

Tuscany, Italy (Hybrid)
1 Year ago
DraftKings - Senior Site Reliability Engineer - FinOps

DraftKings

Canada (Remote)
3 Months ago
Vercel - Build Systems Engineer - Turborepo

Vercel

New York, United States (Hybrid)
3 Months ago
welevel  - Fullstack AI Platform Engineer

welevel

Munich, Bavaria, Germany (On-Site)
4 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Montreal, Quebec, Canada (Remote)

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (Hybrid)

Beijing, China (On-Site)

Beijing, China (On-Site)

Berlin, Berlin, Germany (Remote)

Los Angeles, California, United States (Hybrid)

Los Angeles, California, United States (Remote)

Beijing, China (On-Site)

View All Jobs

Get notified when new jobs are added by Xsolla

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug