Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.
Job Responsibilities :
We are seeking a skilled and driven Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.
REQUIREMENTS:
- Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
- Minimum 2 years of experience in SRE, DevOps, Cloud Infrastructure, or Systems Administration roles.
- Solid hands-on experience with AWS Cloud services including (but not limited to):
- Compute: EC2, Lambda, ECS, Auto Scaling
- Networking: VPC, Load Balancers, Route 53
- Messaging & Storage: SQS, S3, RDS, ElastiCache, SES
- Monitoring: CloudWatch, X-Ray
- Proficient in Infrastructure as Code using Terraform and/or CloudFormation.
- Experience with CI/CD tools (e.g., GitLab CI, Jenkins, CodePipeline, ArgoCD).
- Strong understanding of Linux and Windows system administration and troubleshooting.
- Comfortable with one or more scripting/programming languages such as Python, Node.js, Bash, Ruby, or JSON/YAML for automation.
- Strong grasp of network fundamentals, including DNS, HTTP(S), TLS/SSL, firewalls, and TCP/IP.
- Experience with containerization and orchestration (Docker, ECS, or Kubernetes is a plus).
- Familiar with observability tools and incident management best practices.
JOB DESCRIPTION:
- Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation.
- Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers).
- Lead and participate in architecture reviews focusing on reliability, scalability, security, and performance.
- Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK, etc.) to detect and resolve issues proactively.
- Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies.
- Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, and release management.
- Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby).
- Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, and networking.
- Ensure systems are compliant with security standards, including patching, hardening, and secure access policies.
- Provide on-call support, participate in incident rotations.
- Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
- Support from 5:00PM to 2:00AM (UTC+8) shift to ensure continuous of SRE coverage.
- Undergo initial familiarization period during regular working hours before transitioning to the designated shift.
- Provide support and solution handling to incident and tickets assigned.
Pre-Requisites :
Are you game?