Jobs Courses Resources Companies Placements

Home >

Jobs >

Production System Engineer, Infrastructure Engineering

Bytedance

Singapore (On-site)

Production System Engineer, Infrastructure Engineering

undefined ago • 5 Years + • Product Management

Job Summary

Job Description

The Infrastructure Engineering team at ByteDance builds and operates hyperscale datacenters globally, managing the end-to-end lifecycle of server fleets. As a Production System Engineer, you will enhance the stability, efficiency, effectiveness, and scalability of data center and server operations worldwide. This involves orchestrating server deployment, OS installation, service summoning, inventory management, troubleshooting, and decommissioning, contributing to ByteDance's technological evolution.

Must have:

Enhance stability, efficiency, effectiveness, and scalability of data center and server operations, platform, and service on a worldwide scale
Participate in and enhance the entire lifecycle of the server fleet
Develop and deploy tools and solutions to enhance automation, reliability, scalability, and operability of servers in the datacenter
Develop and deploy tools and solutions for improving availability, latency, and overall service of the datacenter infrastructure, server, and network health
Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment
Conduct high-level root-cause analysis for service interruption and establish preventive measures
Practice sustainable incident response and postmortem
Collaborate with stakeholders such as infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and internal customers
Design and implement innovative solutions for Core IDCs and CDN/Edge
Engage in on-call support spanning across regions and incident response teams
Demonstrated proficiency in Linux system administration tasks
In-depth comprehension of Linux kernels, drivers, and modules
Capable of scripting in Bash and Python to automate routine system operations
In-depth understanding of server hardware, and able to conduct troubleshooting or diagnostics
Experience participating in the planning, delivery, and operation of large-scale data centers
Proficient in customizing operation and maintenance tools to satisfy specific demands for new server hardware
Competent in managing the entire software tool lifecycle, ranging from deployment to continuous maintenance
Professional proficiency level is required in English
Experience in managing and coordinating teams in the global context

Good to have:

Intermediate expertise in Data Center operations (OS installations, break-fix, planning, design-build, retrofit)
Proficiency in operation and maintenance of GPU server
Proficient in full stack software development
Capable of creating and integrating RESTful APIs
Expertise in using Flask for Python-based back-end development
Profound understanding of JavaScript
Capable of leveraging Node.js for front-end and back-end development
Proficiency in SQL for database management
Familiarity with Redis
Experience in Ansible Configuration Management, Application Deployment, and Task Execution

17 skills required

17 skills required for this role

Add these skills to join the top 1% applicants for this job

team-management

problem-solving

game-texts

incident-response

linux

ansible

node.js

redis

front-end

flask

back-end

retrofit

python

sql

bash

javascript

system-design

Job Details

Responsibilities

About the Team The Infrastructure Engineering team supports the company's fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable. Embark on an exciting expedition to explore the rapidly expanding ByteDance domain in the United States, Europe, and Asia. Here, the Infrastructure Engineering team is crafting monumental data citadels that encircle the planet, sheltering legions of hundreds of thousands of servers. As the maestro of our production systems, you will embark on a captivating odyssey, taming the life cycles of these servers. Your adventure will begin with the orchestration of their initial deployment, navigating the intricate terrain of OS installation, summoning services like a digital magician, and maintaining vigilant watch over our inventory. But, like any epic tale, there will be times of challenge when you become a troubleshooter extraordinaire, mending and restoring with unwavering dedication. Eventually, you'll guide them into the sunset, orchestrating their decommissioning and ensuring their rebirth through recycling, all while contributing to the pulsating rhythm of ByteDance's technological evolution. Key Responsibilities: - Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale. - Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement. - Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter. - Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health. - Disaster Recovery: Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem. - Cross-team Collaboration: Collaborate with stakeholders such as infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and our internal customers to comprehend overarching business objectives. Additionally, you will have the chance to design and implement innovative solutions for our Core IDCs and CDN/Edge. - On-call: Engage in our on-call support spanning across regions and incident response teams to address critical issues in the production environment.

Qualifications

Minimum Qualifications - Education: Bachelor's degree in Computer Science, Electronic Engineering, relevant technical field, or equivalent practical experience. - Experience: Minimal 5 years of experience in at least one of the areas below: - Server Operations: Demonstrated proficiency in Linux system administration tasks. Possessed an in-depth comprehension of Linux kernels, drivers, and modules. Capable of scripting in Bash and Python to automate routine system operations, encompassing skills such as system configuration, performance tuning, and security management within the Linux environment. Had an in-depth understanding of server hardware, and was able to conduct troubleshooting or diagnostics. 5+ years of experience participating in the planning, delivery, and operation of large-scale data centers in different countries. - Tooling Adaptation, Deployment, and Maintenance: Proficient in customizing operation and maintenance tools to satisfy specific demands for new server hardware. Competent in managing the entire software tool lifecycle, ranging from deployment to continuous maintenance. This encompasses tasks associated with facilitating the monitoring of server performance, effectively provisioning resources, timely handling of fault management, and conducting repairs to guarantee the smooth operation of new server hardware. Possessing over 3 years of experience in developing and maintaining hardware, network, or service monitoring software for more than 10,000 servers. - Communication: Professional proficiency level is required in English. Experience in managing and coordinating teams in the global context. Preferred Qualifications: - Data Center: An intermediate level of expertise is preferred. We are looking for individuals who are proficient in areas ranging from OS installations and break-fix operations to significant projects such as planning and operations (encompassing the entire infrastructure lifecycle), as well as new design-build or retrofit activities for existing systems. - Proficiency in the operation and maintenance of GPU server is strongly preferred. - Full Stack Software Development: Actively, we are in search of individuals proficient in full stack software development. The ideal candidates are expected to possess the following preferred skills: - Be capable of creating and integrating RESTful APIs. This encompasses expertise in using Flask for Python-based back-end development to establish robust API endpoints. - Have a profound understanding of JavaScript and be capable of leveraging it, along with Node.js, for both front-end and back-end development tasks. - Demonstrate proficiency in SQL for efficient database management, including designing database schemas, composing queries, and ensuring data integrity; be familiar with Redis. - Possess experience in Ansible Configuration Management, Application Deployment, and Task Execution.

Similar Jobs

Channel Account Manager - Video

Motorola solutions

Utah, United States (On-Site)

• 3 Months ago

Technical Project Manager

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

• 2 Weeks ago

Senior Machine Learning Data Scientist - Product Security

Bungie

(Hybrid)

• 7 Months ago

Delivery Advisor

Tesla

Tyrol, Austria (On-Site)

• 6 Months ago

Sales Advisor

Publicis Groupe

San Antonio, Texas, United States (On-Site)

• 2 Weeks ago

Production Director

PlayStation Global

United States (Remote)

• 5 Months ago

Technical Product Manager - AI Solutions

A-Team

New York, United States (Remote)

• 1 Month ago

Senior Producer (Publishing)

Bonfire Studios

California, United States (Hybrid)

• 1 Month ago

Principal Product Manager, AI Platform

Roblox

San Mateo, California, United States (On-Site)

• 1 Month ago

Product Marketing Manager

Windranger

Apac, Northern Region, Uganda (Remote)

• 4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Consultant II

CGS Carrers

United States (Remote)

• 4 Weeks ago

Production Manager - Home (North)

quience

Delhi, India (On-Site)

• 1 Month ago

Product Manager - Growth

GoTo Group

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Sr. Engineer

Trend Micro

Taipei City, Taiwan (On-Site)

• 10 Months ago

Channel Account Manager - Video

Motorola solutions

Utah, United States (On-Site)

• 3 Months ago

Technical Project Manager

Sprinkler

Gurugram, Haryana, India (On-Site)

• 2 Months ago

Chief of Staff

Penrose studios

San Francisco, California, United States (On-Site)

• 4 Years ago

SOC L1 Analyst

Assystems

Gurugram, Haryana, India (On-Site)

• 9 Months ago

Vendor Rebate Specialist

Zones

Islamabad, Islamabad Capital Territory, Pakistan (On-Site)

• 1 Month ago

Staff Engineer, Autonomy Integration (R3492)

Shield AI

Washington, District Of Columbia, United States (On-Site)

• 2 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Singapore

Visual Merchandising Designer

Mighty Jaxx

Singapore (On-Site)

• 1 Month ago

System Engineer, STE Intern - 2025 Start

bytedance

Singapore (On-Site)

• 3 Months ago

Senior Games Development Support Engineer (Japanese Language Required)

Razer

Singapore (On-Site)

• 1 Year ago

Software Engineer, Log Service

bytedance

Singapore (On-Site)

• 2 Months ago

Game Designer

Ugream game entertainment

Singapore (On-Site)

• 2 Weeks ago

Frontend Software Engineer - Global Payment - Singapore

bytedance

Singapore (On-Site)

• 9 Months ago

Global Payment Partnerships Director

Coda

Singapore, Singapore (On-Site)

• 3 Months ago

Sr. Manager, Publishing Product Management

Riot Games

Singapore (On-Site)

• 1 Year ago

Research Engineer (Foundation Model) - Machine Learning Systems

bytedance

Singapore (On-Site)

• 9 Months ago

Associate Manager, Planning

Razer

Singapore (On-Site)

• 3 Weeks ago

Get notifed when new similar jobs are uploaded

Product Management Jobs

Product Marketing Manager

Diligent Corporation

London, England, United Kingdom (Hybrid)

• 3 Weeks ago

Localization Producer

Novato, California, United States (Hybrid)

• 2 Weeks ago

Senior Product Manager

Alpha Sense

United States (Remote)

• 2 Months ago

Senior Product Manager, Profile

Roblox

San Mateo, California, United States (Hybrid)

• 1 Month ago

Product Engineer

Power Integrations

Pasig, Metro Manila, Philippines (On-Site)

• 3 Months ago

Senior Product Manager - AI Agent

Reltio

Bengaluru, Karnataka, India (Hybrid)

• 3 Weeks ago

Associate Product Manager, Communications Platform

Car Gurus

Boston, Massachusetts, United States (Hybrid)

• 1 Month ago

Senior Technical Data/AI Product Manager (iGaming)

Bally's Interactive

London, England, United Kingdom (On-Site)

• 2 Months ago

Software Engineer (Product), International Public Sector

Scale AI

Doha, Doha Municipality, Qatar (On-Site)

• 3 Months ago

Director, Product Management

Avalanche Studios Group

Salt Lake City, Utah, United States (Hybrid)

• 2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

bytedance

796 Active Jobs

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.

Software Engineer Intern (Financial Product) - Global Payment - 2026 Summer(BS/MS)

San Jose, California, United States (On-Site)