Senior Site Reliability Engineer

Tencent

5+ Years | Shanghai, China (On Site) | Full Time | 16 months ago

Apply Now

Job Summary

Senior Site Reliability Engineer with 5+ years of experience in Cloud and on-prem SRE design and implementation. Must have expertise in infrastructure automation, distributed systems, and cloud platforms like AWS, Azure, GCP. Strong knowledge of monitoring, logging, and configuration management is essential.

Must Have

Infrastructure Automation
Distributed Systems
Cloud Platforms
Monitoring Concepts

Good to Have

Containerization Tech
Network Experience
Elastic Search
Prometheus

Perks & Benefits

Global IT Team
Fast-Paced Environment

Job Description

Responsibilities:

About Tencent Overseas IT:
Tencent Overseas IT has the mission to empower Tencent’s rapid global growth with future-ready, global IT platforms, applications, and services. We are chartered to lead the Overseas IT strategy, architecture, roadmap, and execution. Satisfying our internal/external customers and becoming a world-class global IT team are our top aspirations.

We are seeking a Sr. Site Reliability Engineer with extensive cloud and on-prem SRE design and implementation experience.

Duties and Responsibilities:
This senior role will closely work with our internal IT and cloud providers to design the best global SRE architecture and solution in the cloud. This role will also support the studio’s infrastructure, game publishing infrastructure and its evolution to the cloud. Our customers include internal or acquired gaming studios, game publishing services, innovative offices/workplaces, various business groups, and external customers. The work scope will include understanding the internal customers’ business requirements, collecting the technical requirements, developing reference architecture and prototypes based on leading industry best practices, leading implementation, and deployment for global locations, as well as issue troubleshooting when necessary.

For this SRE job, you will:
• Design, implement, and support operational and reliability of large-scale Cloud-enabled studio with a focus on performance at scale, real-time monitoring, logging ,analyzing and alerting
• Maintain services once they go live by measuring and monitoring availability, latency, and overall system health.
• Design and develop robust and scalable products and tools to enhance operational efficiency.
• Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
• Participate in incident response and troubleshooting efforts to minimize downtime and ensure system reliability.
• Maintain project and product documents and knowledge
• Be part of an on-call rotation to support production systems (if needed)

Based in Shanghai, China, this person will work closely with the global IT team, and HQ teams.

Whom we are looking for:

A quick learner
A positive, self-motivated, and passionate person
Independent, insistent, and open-minded.
A great team player and both dependable and autonomous.
Customer-oriented and could work at a very fast pace.

Requirements:

Requirements

5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large-scale private or public cloud systems in Production
In-depth knowledge and understanding of monitoring concepts, alert mechanisms, log monitoring, anomaly detections, creation, and setup of dashboards.
In-depth knowledge and experience with Elastic Search, Prometheus
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Passion for infrastructure and monitoring as code

Bachelor’s degree (or higher), Computer Science, Mathematics, or related science or engineering major
Solid understanding of cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
Good understanding and hands on experience in network is plus
Bilingual preferred (English, Chinese)

17 Skills Required For This Role

Problem Solving Team Player Game Texts Prototyping Incident Response Aws Azure Prometheus Ansible Terraform Helm Docker Kubernetes Python Shell Css System Design