Senior Distributed Storage SRE Engineer

Tencent

Job Summary

Tencent is seeking a Senior Distributed Storage SRE Engineer to manage the daily operations and stability of distributed storage systems. This role involves ensuring system SLA, designing disaster recovery solutions, optimizing performance, and improving resource efficiency. The engineer will also develop tools for operational efficiency, respond to and resolve online incidents, and implement fault recovery strategies. Candidates should have a Bachelor's degree in Computer Science or related field, experience with Unix/Linux, networking, storage system troubleshooting, and programming in Shell, Python, or Go.

Must Have

  • Responsible for daily operation and maintenance of distributed storage systems.
  • Ensure stability of block storage and guarantee system SLA.
  • Design and implement disaster recovery solutions.
  • Promote service reliability, scalability, and performance optimization.
  • Manage and plan block storage resources to improve efficiency.
  • Participate in O&M support platform construction and tool development.
  • Quickly respond to online incidents, debug, and solve faults.
  • Implement emergency plans and fault recovery strategies.
  • Bachelor’s degree in Computer Science or related technical field.
  • Experience with Unix/Linux operating systems internals.
  • Experience with networking or cloud systems.
  • Experience analyzing and troubleshooting storage systems.
  • Experience programming in Shell, Python, or Go.

Good to Have

  • Experience designing or managing large-scale distributed storage systems.
  • Understanding of distributed system principles and open source systems (NAS, HDFS, CEPH).
  • Familiarity with cloud products and practical experience in block storage.
  • Experience with SRE jobs (online release, monitoring, daily inspection) and script programming.
  • Strong sense of responsibility and timely problem-solving ability.

Perks & Benefits

  • Medical benefits
  • Dental benefits
  • Vision benefits
  • Life and disability benefits
  • Participation in the Company’s 401(k) plan
  • Up to 15 to 25 days of vacation per year (depending on tenure)
  • Up to 13 days of holidays throughout the calendar year
  • Up to 10 days of paid sick leave per year
  • Sign on payment (case-by-case basis)
  • Relocation package (case-by-case basis)
  • Restricted stock units (case-by-case basis)

Job Description

Business Unit

What the Role Entails

1. Responsible for the daily operation and maintenance of distributed storage systems (e.g. online release, software deployment, monitoring, inspection,etc.).

2. Responsible for the stability of the block storage, the design and implementation of disaster recovery solutions, promote the improvement of service reliability, scalability and performance optimization, and guarantee system SLA.

3. Responsible for resource management and planning of block storage and related systems to improve resource efficiency.

4. Participate in the construction of the operation and maintenance support platform, develop tools, and improve operational efficiency.

5. Quickly respond to online incidents, be able to discover, debug and solve common faults, hidden dangers and performance problems, and be responsible for the implementation of emergency plans and fault recovery strategies.

Who We Look For

1. Bachelor’s degree Computer Science or related technical field, or equivalent practical experience.

2. Experience with Unix/Linux operating systems internals (e.g. filesystems, storage devices).and with networking (e.g., tcp/ip, routing) or cloud systems.

3. Experience with analyzing and troubleshooting storage systems.

4. Experience programming in one or more of the following: Shell, Python, Go, etc.

Preferred qualifications:

1. Experience in designing or managing large-scale distributed storage systems, understanding the principle of distributed system and be familiar with open source distributed storage system (e.g. NAS, HDFS, CEPH).

2. Familiar with cloud products, have practical experience in block storage, and be able to deal with common block storage-related problems.

3. Experience with SRE jobs (e.g. online release, monitoring, daily inspection etc.) and script programming.

4. Strong sense of responsibility, and be able to respond and deal with problems in a timely manner.

8 Skills Required For This Role

Problem Solving Game Texts Networking Linux Unix Principle Python Shell

Similar Jobs