Production System Engineer - DSI

bytedance

Job Summary

The Data Systems Infrastructure (DSI) team at ByteDance is seeking a Production System Engineer to enhance the stability, efficiency, and scalability of global data center and server operations. This role involves managing the entire server fleet lifecycle, from design and deployment to troubleshooting and decommissioning. The engineer will develop automation tools, improve monitoring, conduct disaster recovery, and collaborate with various stakeholders to build and maintain colossal data fortresses for ByteDance's expanding domain in the US, Europe, and Asia.

Must Have

  • Contribute to enhancing stability, efficiency, effectiveness, and scalability of data center and server operations.
  • Participate in and enhance the entire lifecycle of the server fleet.
  • Develop and deploy tools and solutions to enhance automation, reliability, scalability, and operability of servers.
  • Develop and deploy tools and solutions for improving availability, latency, and overall service of datacenter infrastructure.
  • Troubleshoot and resolve complex technical issues, conduct root-cause analysis, and establish preventive measures.
  • Collaborate with infrastructure architects, project managers, data center operations engineers, platform developers, and supply chain teams.
  • Engage in on-call support and incident response.
  • Bachelor's degree in Computer Science, Electronic Engineering, or relevant technical field.
  • 5+ years of experience in Server Operations.
  • Demonstrated proficiency in Linux system administration tasks, Linux kernels, drivers, and modules.
  • Capable of scripting in Bash and Python to automate routine system operations.
  • In-depth understanding of server hardware and ability to perform troubleshooting and diagnostics.
  • Over 5 years of experience in planning, implementation, and operation of large-scale data centers.
  • Experience in managing and coordinating teams in the global context.

Good to Have

  • Data Center experience.
  • Proficiency in the operation and maintenance of GPU servers.

Job Description

Responsibilities

Unlocking the secrets of ByteDance's global tech empire, the Data Systems Infrastructure (DSI) team stands as the unseen architects behind the scenes. In a thrilling dance of technology and innovation, we propel the company's meteoric rise by constructing and orchestrating colossal data fortresses, taming the life cycle of server fleets, conjuring cloud solutions, and crafting a symphony of infrastructure services. Our mission is to ensure scalability and unwavering reliability, making sure ByteDance's digital footprint leaves an indelible mark on the world. Embark on an exciting expedition to explore the expanding ByteDance domain in the US, Europe, and Asia. The DSI team is building data citadels for hundreds of thousands of servers. As the maestro of production systems, you'll manage the server life cycles, from deployment to decommissioning, troubleshooting, and recycling, contributing to ByteDance's tech evolution.

Responsibilities:

  • Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
  • Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
  • Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
  • Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
  • Disaster Recovery: Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.
  • Cross-team Collaboration: Collaborate with stakeholders such as infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and our internal customers to comprehend overarching business objectives. Additionally, you will have the chance to design and implement innovative solutions for our Core IDCs and CDN/Edge.
  • On-call: Engage in our on-call support spanning across regions and incident response teams to address critical issues in the production environment.

Qualifications

Minimum Qualifications:

  • Bachelor's degree in Computer Science, Electronic Engineering, relevant technical field, or equivalent practical experience.
  • 5 + years of experience in Server Operations with the qualifications below
  • Demonstrated proficiency in Linux system administration tasks, with an in-depth understanding of Linux kernels, drivers, and modules.
  • Capable of scripting in Bash and Python to automate routine system operations, including system configuration, performance tuning, and security management in the Linux environment.
  • Has an in-depth understanding of server hardware and is able to perform troubleshooting and diagnostics on complex faults.
  • Possesses over 5 years of experience in participating in the planning, implementation, and operation of large-scale data centers in different countries.
  • Experience in managing and coordinating teams in the global context.

Preferred Qualification:

  • Data Center experience is preferred. We are seeking individuals who are proficient in areas spanning from operating system installations and break-fix operations to substantial projects such as planning and operations (covering the entire infrastructure lifecycle), as well as new design-build or retrofit activities for existing systems.
  • Proficiency in the operation and maintenance of GPU servers is highly preferred.

9 Skills Required For This Role

Team Management Problem Solving Game Texts Incident Response Linux Retrofit Python Bash System Design

Similar Jobs