Tech Lead - Data Infrastructure Site Reliability

undefined ago • 5 Years + • Devops • $208,800 PA - $438,000 PA

Job Summary

Job Description

The Site Reliability Engineering (SRE) team at ByteDance is seeking a Technical Lead to build and operate large-scale, distributed systems with high reliability and efficiency. This role involves applying expertise in coding, algorithms, complexity analysis, and system design to solve scaling and reliability challenges. The lead will provide deep technical leadership, drive architectural improvements, and collaborate with engineering, product, data, and infrastructure teams to deliver resilient, scalable platforms, focusing on automation, troubleshooting, and promoting best practices.
Must have:
  • Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
  • Collaborate with cross-functional teams to drive system reliability, performance, and scalability.
  • Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
  • Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
  • Promote best practices in system design, observability, performance optimization, and cost efficiency.
  • Communicate complex technical concepts effectively to both technical and non-technical stakeholders.
  • 5+ years of experience in Site Reliability Engineering, Software Development, or related fields.
  • Deep hands-on expertise in Databases (SQL/NoSQL), Kubernetes or container orchestration, or Big Data processing and storage systems.
  • Strong knowledge of system architecture, distributed systems, and performance bottlenecks.
  • Excellent communication and collaboration skills.
Good to have:
  • Proven track record of driving automation, tooling, and process improvements that enhance reliability and efficiency.
  • Experience in cost optimization and performance tuning at scale, backed by data-driven decision making.
  • Thought leadership in adopting new technologies, improving operational practices, and influencing system design.
Perks:
  • Additional discretionary bonuses/incentives
  • Restricted stock units
  • Day one access to medical, dental, and vision insurance
  • 401(k) savings plan with company match
  • Paid parental leave
  • Short-term and long-term disability coverage
  • Life insurance
  • Wellbeing benefits
  • 10 paid holidays per year
  • 10 paid sick days per year
  • 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).

Job Details

Responsibilities

Team Introduction: Our Site Reliability Engineering (SRE) team combines software and systems to build and operate large-scale, distributed systems with high reliability and efficiency. In this role, you’ll apply your expertise in coding, algorithms, complexity analysis, and system design to solve scaling and reliability challenges. We’re looking for a Technical Lead (SRE) who can provide deep technical leadership, drive architectural improvements, and collaborate effectively across multiple organizations. You’ll partner with engineering, product, data, and infrastructure teams to deliver resilient, scalable platforms. This is a highly technical, hands-on role that requires strong problem-solving ability, clear communication, and the ability to influence without formal authority.

  • Strong hands-on skills in the design, development, and operation of large-scale cloud infrastructure and distributed systems.
  • Collaborate with cross-functional teams (e.g., Advertising, Machine Learning, E-commerce, and Core Infra) to drive system reliability, performance, and scalability.
  • Lead initiatives to automate operations, eliminate toil, and improve overall system efficiency.
  • Troubleshoot complex production issues, perform root-cause analysis, and drive long-term reliability improvements.
  • Promote best practices in system design, observability, performance optimization, and cost efficiency.
  • Communicate complex technical concepts effectively to both technical and non-technical stakeholders.

Qualifications

Minimum Qualifications:

  • 5+ years of experience in Site Reliability Engineering, Software Development, or related fields, with a strong focus on designing, building, scaling, and operating cloud-based systems.
  • Deep hands-on expertise in at least one of the following areas:
  • Databases (SQL/NoSQL)
  • Kubernetes or container orchestration
  • Big Data processing and storage systems (streaming and batch)
  • Strong knowledge of system architecture, distributed systems, and performance bottlenecks.
  • Excellent communication and collaboration skills, with experience working across engineering, product, and data science teams.

Preferred Qualifications:

  • Proven track record of driving automation, tooling, and process improvements that enhance reliability and efficiency.
  • Experience in cost optimization and performance tuning at scale, backed by data-driven decision making.
  • Thought leadership in adopting new technologies, improving operational practices, and influencing system design.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in San Jose, California, United States

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Devops Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
View All Jobs

Get notified when new jobs are added by bytedance

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug