Responsibilities:
About Tencent Overseas IT:
Tencent Overseas IT has the mission to empower Tencent’s rapid global growth with future-ready, global IT platforms, applications, and services. We are chartered to lead the Overseas IT strategy, architecture, roadmap, and execution. Satisfying our internal/external customers and becoming a world-class global IT team are our top aspirations.
We are seeking a Sr. Site Reliability Engineer with extensive cloud and on-prem SRE design and implementation experience.
Duties and Responsibilities:
This senior role will closely work with our internal IT and cloud providers to design the best global SRE architecture and solution in the cloud. This role will also support the studio’s infrastructure, game publishing infrastructure and its evolution to the cloud. Our customers include internal or acquired gaming studios, game publishing services, innovative offices/workplaces, various business groups, and external customers. The work scope will include understanding the internal customers’ business requirements, collecting the technical requirements, developing reference architecture and prototypes based on leading industry best practices, leading implementation, and deployment for global locations, as well as issue troubleshooting when necessary.
For this SRE job, you will:
• Design, implement, and support operational and reliability of large-scale Cloud-enabled studio with a focus on performance at scale, real-time monitoring, logging ,analyzing and alerting
• Maintain services once they go live by measuring and monitoring availability, latency, and overall system health.
• Design and develop robust and scalable products and tools to enhance operational efficiency.
• Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
• Participate in incident response and troubleshooting efforts to minimize downtime and ensure system reliability.
• Maintain project and product documents and knowledge
• Be part of an on-call rotation to support production systems (if needed)
Based in Shanghai, China, this person will work closely with the global IT team, and HQ teams.
Whom we are looking for:
- A quick learner
- A positive, self-motivated, and passionate person
- Independent, insistent, and open-minded.
- A great team player and both dependable and autonomous.
- Customer-oriented and could work at a very fast pace.
Requirements:
Requirements
- 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large-scale private or public cloud systems in Production
- In-depth knowledge and understanding of monitoring concepts, alert mechanisms, log monitoring, anomaly detections, creation, and setup of dashboards.
- In-depth knowledge and experience with Elastic Search, Prometheus
- Expertise in configuration management with a framework such as Ansible, Terraform, Helm
- Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
- Passion for infrastructure and monitoring as code
- Bachelor’s degree (or higher), Computer Science, Mathematics, or related science or engineering major
- Solid understanding of cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Good understanding and hands on experience in network is plus
- Bilingual preferred (English, Chinese)