Production System Engineer, Infrastructure Engineering

2 Minutes ago • 3 Years + • Product Management

Job Summary

Job Description

The Production System Engineer, Infrastructure Engineering, at ByteDance supports the company's rapid growth by building and operating hyperscale datacenters globally. This role involves managing the end-to-end lifecycle of server fleets, from deployment and OS installation to service summoning, inventory monitoring, troubleshooting, and decommissioning. The engineer will enhance stability, efficiency, and scalability of data center operations, participate in server fleet lifecycle enhancement, develop automation tools, improve monitoring, and conduct disaster recovery. Cross-team collaboration and on-call support are also key aspects of this role.
Must have:
  • Contribute to enhancing stability, efficiency, effectiveness, and scalability of data center and server operations.
  • Participate in and enhance the entire lifecycle of the server fleet.
  • Develop and deploy tools and solutions for automation, reliability, scalability, and operability.
  • Develop and deploy tools and solutions for improving availability, latency, and overall service of infrastructure.
  • Troubleshoot and resolve complex technical issues in high-pressure environments.
  • Conduct high-level root-cause analysis for service interruption and establish preventive measures.
  • Practice sustainable incident response and postmortem.
  • Collaborate with stakeholders to comprehend business objectives.
  • Engage in on-call support and incident response.
  • Proficiency in Linux system administration tasks.
  • In-depth comprehension of Linux kernels, drivers, and modules.
  • Scripting in Bash and Python to automate routine system operations.
  • In-depth understanding of server hardware, able to conduct troubleshooting or diagnostics.
  • Experience in planning, delivery, and operation of large-scale data centers.
  • Proficient in customizing operation and maintenance tools for new server hardware.
  • Competent in managing the entire software tool lifecycle.
  • Professional proficiency level in English.
  • Experience in managing and coordinating teams in the global context.
Good to have:
  • Intermediate level of expertise in Data Center operations (OS installations, break-fix, planning, new design-build, retrofit).
  • Proficiency in the operation and maintenance of GPU server.
  • Proficient in full stack software development.
  • Capable of creating and integrating RESTful APIs (e.g., using Flask).
  • Profound understanding of JavaScript and capable of leveraging it with Node.js.
  • Proficiency in SQL for efficient database management.
  • Familiarity with Redis.
  • Experience in Ansible Configuration Management, Application Deployment, and Task Execution.

Job Details

Responsibilities

About the Team

The Infrastructure Engineering team supports the company's fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable. Embark on an exciting expedition to explore the rapidly expanding ByteDance domain in the United States, Europe, and Asia. Here, the Infrastructure Engineering team is crafting monumental data citadels that encircle the planet, sheltering legions of hundreds of thousands of servers. As the maestro of our production systems, you will embark on a captivating odyssey, taming the life cycles of these servers. Your adventure will begin with the orchestration of their initial deployment, navigating the intricate terrain of OS installation, summoning services like a digital magician, and maintaining vigilant watch over our inventory. But, like any epic tale, there will be times of challenge when you become a troubleshooter extraordinaire, mending and restoring with unwavering dedication. Eventually, you'll guide them into the sunset, orchestrating their decommissioning and ensuring their rebirth through recycling, all while contributing to the pulsating rhythm of ByteDance's technological evolution.

Key Responsibilities:

  • Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
  • Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
  • Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
  • Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
  • Disaster Recovery: Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.
  • Cross-team Collaboration: Collaborate with stakeholders such as infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and our internal customers to comprehend overarching business objectives. Additionally, you will have the chance to design and implement innovative solutions for our Core IDCs and CDN/Edge.
  • On-call: Engage in our on-call support spanning across regions and incident response teams to address critical issues in the production environment.

Qualifications

Minimum Qualifications

  • Education: Bachelor's degree in Computer Science, Electronic Engineering, relevant technical field, or equivalent practical experience.
  • Experience: Minimal 3 years of experience in at least one of the areas below:
  • Server Operations: Demonstrated proficiency in Linux system administration tasks. Possessed an in-depth comprehension of Linux kernels, drivers, and modules. Capable of scripting in Bash and Python to automate routine system operations, encompassing skills such as system configuration, performance tuning, and security management within the Linux environment. Had an in-depth understanding of server hardware, and was able to conduct troubleshooting or diagnostics. 3+ years of experience participating in the planning, delivery, and operation of large-scale data centers in different countries.
  • Tooling Adaptation, Deployment, and Maintenance: Proficient in customizing operation and maintenance tools to satisfy specific demands for new server hardware. Competent in managing the entire software tool lifecycle, ranging from deployment to continuous maintenance. This encompasses tasks associated with facilitating the monitoring of server performance, effectively provisioning resources, timely handling of fault management, and conducting repairs to guarantee the smooth operation of new server hardware. Possessing over 3 years of experience in developing and maintaining hardware, network, or service monitoring software for more than 10,000 servers.
  • Communication: Professional proficiency level is required in English. Experience in managing and coordinating teams in the global context.
  • Preferred Qualification:
  • Data Center: An intermediate level of expertise is preferred. We are looking for individuals who are proficient in areas ranging from OS installations and break-fix operations to significant projects such as planning and operations (encompassing the entire infrastructure lifecycle), as well as new design-build or retrofit activities for existing systems.
  • Proficiency in the operation and maintenance of GPU server is strongly preferred.
  • Full Stack Software Development: Actively, we are in search of individuals proficient in full stack software development. The ideal candidates are expected to possess the following preferred skills:
  • Be capable of creating and integrating RESTful APIs. This encompasses expertise in using Flask for Python-based back-end development to establish robust API endpoints.
  • Have a profound understanding of JavaScript and be capable of leveraging it, along with Node.js, for both front-end and back-end development tasks.
  • Demonstrate proficiency in SQL for efficient database management, including designing database schemas, composing queries, and ensuring data integrity; be familiar with Redis.
  • Possess experience in Ansible Configuration Management, Application Deployment, and Task Execution.

Similar Jobs

Motorola solutions - Channel Account Manager - Video

Motorola solutions

Utah, United States (On-Site)
• 2 Months ago
Moon Active - Technical Project Manager

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
• 4 Days ago
Tesla - Delivery Advisor

Tesla

Tyrol, Austria (On-Site)
• 5 Months ago
Publicis Groupe - Sales Advisor

Publicis Groupe

San Antonio, Texas, United States (On-Site)
• 4 Days ago
PlayStation Global - Production Director

PlayStation Global

United States (Remote)
• 4 Months ago
A-Team - Technical Product Manager - AI Solutions

A-Team

New York, United States (Remote)
• 1 Month ago
Bonfire Studios - Senior Producer (Publishing)

Bonfire Studios

California, United States (Hybrid)
• 3 Weeks ago
Roblox - Principal Product Manager, AI Platform

Roblox

San Mateo, California, United States (On-Site)
• 3 Weeks ago
Windranger - Product Marketing Manager

Windranger

Apac, Northern Region, Uganda (Remote)
• 4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

CGS Carrers - Consultant II

CGS Carrers

United States (Remote)
• 2 Weeks ago
quience - Production Manager - Home (North)

quience

Delhi, India (On-Site)
• 3 Weeks ago
GoTo Group - Product Manager - Growth

GoTo Group

Bengaluru, Karnataka, India (On-Site)
• 3 Weeks ago
Trend Micro - Sr. Engineer

Trend Micro

Taipei City, Taiwan (On-Site)
• 9 Months ago
Motorola solutions - Channel Account Manager - Video

Motorola solutions

Utah, United States (On-Site)
• 2 Months ago
Sprinkler - Technical Project Manager

Sprinkler

Gurugram, Haryana, India (On-Site)
• 2 Months ago
Penrose studios - Chief of Staff

Penrose studios

San Francisco, California, United States (On-Site)
• 4 Years ago
Assystems - SOC L1 Analyst

Assystems

Gurugram, Haryana, India (On-Site)
• 9 Months ago
Zones - Vendor Rebate Specialist

Zones

Islamabad, Islamabad Capital Territory, Pakistan (On-Site)
• 1 Month ago
Shield AI - Staff Engineer, Autonomy Integration (R3492)

Shield AI

Washington, District Of Columbia, United States (On-Site)
• 1 Day ago

Get notifed when new similar jobs are uploaded

Jobs in Singapore

Mighty Jaxx - Visual Merchandising Designer

Mighty Jaxx

Singapore (On-Site)
• 3 Weeks ago
bytedance - System Engineer, STE Intern - 2025 Start

bytedance

Singapore (On-Site)
• 3 Months ago
bytedance - Software Engineer, Log Service

bytedance

Singapore (On-Site)
• 2 Months ago
Ugream game entertainment - Game Designer

Ugream game entertainment

Singapore (On-Site)
• 4 Days ago
bytedance - Frontend Software Engineer - Global Payment - Singapore

bytedance

Singapore (On-Site)
• 8 Months ago
Coda - Global Payment Partnerships Director

Coda

Singapore, Singapore (On-Site)
• 2 Months ago
Riot Games - Sr. Manager, Publishing Product Management

Riot Games

Singapore (On-Site)
• 1 Year ago
bytedance - Research Engineer (Foundation Model) - Machine Learning Systems

bytedance

Singapore (On-Site)
• 8 Months ago
Razer - Associate Manager, Planning

Razer

Singapore (On-Site)
• 1 Week ago

Get notifed when new similar jobs are uploaded

Product Management Jobs

Diligent Corporation - Product Marketing Manager

Diligent Corporation

London, England, United Kingdom (Hybrid)
• 1 Week ago
2K - Localization Producer

2K

Novato, California, United States (Hybrid)
• 3 Days ago
Alpha Sense - Senior Product Manager

Alpha Sense

United States (Remote)
• 2 Months ago
Roblox - Senior Product Manager, Profile

Roblox

San Mateo, California, United States (Hybrid)
• 1 Month ago
Power Integrations - Product Engineer

Power Integrations

Pasig, Metro Manila, Philippines (On-Site)
• 3 Months ago
Reltio - Senior Product Manager - AI Agent

Reltio

Bengaluru, Karnataka, India (Hybrid)
• 1 Week ago
Car Gurus - Associate Product Manager, Communications Platform

Car Gurus

Boston, Massachusetts, United States (Hybrid)
• 1 Month ago
Bally's Interactive - Senior Technical Data/AI Product Manager (iGaming)

Bally's Interactive

London, England, United Kingdom (On-Site)
• 1 Month ago
Scale AI - Software Engineer (Product), International Public Sector

Scale AI

Doha, Doha Municipality, Qatar (On-Site)
• 2 Months ago
Avalanche Studios Group - Director, Product Management

Avalanche Studios Group

Salt Lake City, Utah, United States (Hybrid)
• 1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
View All Jobs

Get notified when new jobs are added by bytedance

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug