Site Reliability Engineer (Mid/Senior)
Razer
Job Summary
Razer is seeking a Site Reliability Engineer (SRE) to join their AI Software team. This role focuses on ensuring the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. The SRE will collaborate with software and AI teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment, building resilient systems and supporting AI workloads in production. Razer offers a global mission to revolutionize gaming and a unique, gamer-centric work experience for accelerated personal and professional growth.
Must Have
- 4+ years of experience in SRE, DevOps, infrastructure engineering, or cloud operations
- Experience operating production services with significant availability or scaling demands
- Strong knowledge of Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)
- Comfortable with Linux and Docker administration
- Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)
- Strong ability to code and script (preferably Bash scripting and Python)
- Ability to use or quickly pick up a wide variety of open source technologies and automation tools
- Understanding of GPU-based workloads and resource scheduling
- Familiarity with vector databases, embeddings, and inference pipeline
- Comfort with frequent, incremental code testing and deployment
- Good analytical skills to debug deployment problems without developer help
- Deep hands-on technical expertise and problem-solving skills
- Ability to work in a collaborative, technically challenging environment with rapidly changing requirements
- Bachelor’s or Master’s degree in computer science, AI or similar discipline
Perks & Benefits
- Opportunity to make a global impact
- Work across a global team located across 5 continents
- Unique, gamer-centric #LifeAtRazer experience
- Accelerated personal and professional growth
- Certified as a Great Place to Work® in United States and Singapore
Job Description
Job Responsibilities :
We are looking for Site Reliability Engineers (SRE) to join our AI Software team. In this role, you will ensure the reliability, performance, scalability, and operational excellence of AI products, model-serving infrastructure, and backend API systems. You’ll work closely with software engineers, AI teams and release teams to automate operations, enhance observability, and streamline deployments in a cloud-scale environment. This role is ideal for someone who enjoys building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.
Essential Duties and Responsibilities
- Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.
- Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.
- Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.
- Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications
- Ensure reliability and uptime for experimental, production AI software environments.
- Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.
- Reduce MTTR through actionable alerts, runbooks, and automated diagnostics.
- Automate infrastructure using IaC (Terraform/CloudFormation) and configuration management.
- Improve release workflows and integrate with QA for smooth handoff to Release Candidate testing.
- Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.
- Participate in on-call rotations, incident reviews, and continuous improvement initiatives..
Pre-Requisites :
Qualifications
- 4+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations
- Experience operating production services with significant availability or scaling demands.
- Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX)
- Comfortable with Linux and Docker administration
- Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL)
- Strong ability to code and script ( preferably Bash scripting and Python)
- Ability to use or quickly pick up a wide variety of open source technologies and automation tools
- Understanding of GPU-based workloads and resource scheduling.
- Familiarity with vector databases, embeddings, and inference pipeline
- Comfort with frequent, incremental code testing and deployment
- Must have good analytical skills to debug deployment problems without taking help from developers
- Deep hands-on technical expertise and problem-solving skills
- Ability to work in a collaborative, technically challenging environment with rapidly changing requirements.
Education & Experience
- Has a Bachelor’s or Master’s degree in computer science, AI or similar discipline from an accredited institution
Travel Requirements
- Role based in Singapore office and may require up to 1 travel trip per year.