Site Reliability Engineer (Mid/Senior)

Razer

Job Summary

Razer is seeking a Site Reliability Engineer (SRE) to join their AI Software team. This role focuses on ensuring the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. The SRE will collaborate with software and AI teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment, building resilient systems and supporting AI workloads in production. Razer offers a global mission to revolutionize gaming and a unique, gamer-centric work experience for accelerated personal and professional growth.

Must Have

  • 4+ years of experience in SRE, DevOps, infrastructure engineering, or cloud operations
  • Experience operating production services with significant availability or scaling demands
  • Strong knowledge of Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)
  • Comfortable with Linux and Docker administration
  • Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)
  • Strong ability to code and script (preferably Bash scripting and Python)
  • Ability to use or quickly pick up a wide variety of open source technologies and automation tools
  • Understanding of GPU-based workloads and resource scheduling
  • Familiarity with vector databases, embeddings, and inference pipeline
  • Comfort with frequent, incremental code testing and deployment
  • Good analytical skills to debug deployment problems without developer help
  • Deep hands-on technical expertise and problem-solving skills
  • Ability to work in a collaborative, technically challenging environment with rapidly changing requirements
  • Bachelor’s or Master’s degree in computer science, AI or similar discipline

Perks & Benefits

  • Opportunity to make a global impact
  • Work across a global team located across 5 continents
  • Unique, gamer-centric #LifeAtRazer experience
  • Accelerated personal and professional growth
  • Certified as a Great Place to Work® in United States and Singapore

Job Description

Job Responsibilities :

We are looking for Site Reliability Engineers (SRE) to join our AI Software team. In this role, you will ensure the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. You’ll work closely with software engineers, AI teams and release teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment. This role is ideal for someone who enjoys building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.

Essential Duties and Responsibilities

  • Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.
  • Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.
  • Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.
  • Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications
  • Ensure reliability and uptime for experimental, production AI software environments.
  • Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.
  • Reduce MTTR through actionable alerts, runbooks, and automated diagnostics.
  • Automate infrastructure using IaC (Terraform/CloudFormation) and configuration management.
  • Improve release workflows and integrate with QA for smooth handoff to Release Candidate testing.
  • Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.
  • Participate in on-call rotations, incident reviews, and continuous improvement initiatives..

Pre-Requisites :

Qualifications

  • 4+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations
  • Experience operating production services with significant availability or scaling demands.
  • Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)
  • Comfortable with Linux and Docker administration
  • Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)
  • Strong ability to code and script ( preferably Bash scripting and Python)
  • Ability to use or quickly pick up a wide variety of open source technologies and automation tools
  • Understanding of GPU-based workloads and resource scheduling.
  • Familiarity with vector databases, embeddings, and inference pipeline
  • Comfort with frequent, incremental code testing and deployment
  • Must have good analytical skills to debug deployment problems without taking help from developers
  • Deep hands-on technical expertise and problem-solving skills
  • Ability to work in a collaborative, technically challenging environment with rapidly changing requirements.

Education & Experience

  • Has a Bachelor’s or Master’s degree in computer science, AI or similar discipline from an accredited institution

Travel Requirements

  • Role based in ​Singapore office​ and may require up to 1 travel trip per year.

17 Skills Required For This Role

Github Game Texts Quality Control Release Management Mysql Nginx Incident Response Linux Aws Nosql Terraform Ci Cd Docker Git Python Bash Jenkins

Similar Jobs