Senior Site Reliability Engineer

2 Hours ago • 5 Years +

Job Summary

Job Description

As a Senior Site Reliability Engineer at Reddit, you will enhance the reliability and performance of Reddit's engineering platforms. You will work closely with infrastructure and development teams, utilizing open-source solutions like Prometheus, Thanos, and Grafana. Your responsibilities include risk management, collaborating with teams, and implementing best practices to ensure system resilience and optimize service delivery. You will advise engineering teams on system design, identify and build capabilities for foundational services, automate tasks, diagnose and fix issues, and optimize system performance.
Must have:
  • 5+ years of experience in Software Engineering or SRE
  • Proficiency in one or more programming languages (Go, Python)
  • Experience with Kubernetes and Cloud systems
  • Familiarity with distributed systems development
  • Experience with high-traffic backend systems
  • Demonstrated ability to debug, fix, and optimize code
  • Troubleshooting skills spanning applications and systems
  • Strong working knowledge of Linux and containers
Good to have:
  • Familiarity with Prometheus, Thanos, Grafana, Vector, Clickhouse, Otel, Loki
Perks:
  • Retirement Savings plan
  • Workspace benefits for your home office
  • Personal & Professional development funds
  • Family Planning Support
  • Flexible Vacation & Reddit Global Days Off

Job Details

Reddit is a community of communities. It’s built on shared interests, passion, and trust and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 101M+ daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit redditinc.com.

Reddit SRE is rapidly innovating and our teams are working to meet the needs of infrastructure and development teams as they evolve our product faster than ever before. This is a unique opportunity to leave your mark on one of the most influential and trafficked corners of the internet.

As a Senior Site Reliability Engineer on Reddit’s Infrastructure SRE team, you’ll use your knowledge of distributed systems and architecture to improve the reliability and performance of Reddit’s engineering platforms and services. We are looking for someone who thrives at the intersection of infrastructure and software development. This team will work very closely with the Compute, Traffic, and Observability infrastructure teams. They will own a suite of tools for allowing engineers to understand their creations, based primarily on open-source solutions at scale. We’re active users of and contributors to Prometheus, Thanos, Grafana, Vector and more.

In this role, you will also take ownership of risk management, ensuring the reliability and performance of our systems. You will collaborate with cross-functional teams to identify, assess, and mitigate risks, implementing best practices to enhance system resilience. Your expertise will drive proactive measures to maintain uptime and optimize service delivery, making a significant impact on our operational excellence.

Join us and help build the future of Reddit!

Responsibilities:

  • Advise
    • Work closely with engineering teams in designing and developing systems that are resilient and highly performant at a tremendous scale, and maintaining the foundational platform for running Reddit’s infrastructure.
  • Amplify
    • Identify and build capabilities into our foundational Infrastructure and Platform services, which are used by Reddit engineering teams to build, deploy, and operate Reddit. 
    • Deliver software to improve the availability, scalability, latency, and efficiency of observability components.
    • Identify and engineer away risk across Reddit’s systems.
  • Automate
    • Take repetitive, manual, or risky tasks and automate them out of existence. Build tools and integrate systems to support Reddit’s evolution.
    • Automate critical aspects of the event driven development process
  • Diagnose
    • Draw on your knowledge of distributed systems to identify and fix network, system, and service-level issues. Practice sustainable incident response, and drive structural improvement with blameless postmortem.
    • Share on-call responsibilities. 
  • Optimize:
    • Observe and improve performance, reduce cost, and improve the experience for millions of users
    • Contribute upstream changes to the open source projects we use

Qualifications

  • 5+ years of experience in Software Engineering, Site Reliability Engineering, or a development-focused DevOps role.
  • Proficiency in one or more programming languages. We’re predominantly writing code in Go and Python.
  • Experience with Kubernetes and Cloud systems.
  • Familiarity with distributed systems development, bonus if familiar with any of the specific tools (Prometheus, Thanos, Grafana, Vector, Clickhouse, Otel, Loki)
  • Experience with the development and operation of high-traffic backend systems.
  • A demonstrated ability to debug, fix, and optimize code.
  • Troubleshooting skills that span applications, networking (TCP/IP), and systems.
  • Strong working knowledge of Linux and containers.
  • Excellent communication and collaborative skills.

Benefits:

  • Retirement Savings plan 
  • Workspace benefits for your home office 
  • Personal & Professional development funds
  • Family Planning Support 
  • Flexible Vacation & Reddit Global Days Off

Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve.  Reddit is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If, due to a disability, you need an accommodation during the interview process, please let your recruiter know.

Similar Jobs

CGS Carrers - Software Test Engineer (Technical)

CGS Carrers

(Remote)
3 Days ago
Contentstack - Senior Software Engineer I (MERN Stack)

Contentstack

Mumbai, Maharashtra, India (On-Site)
2 Weeks ago
Warner Bros Games - Staff Software Engineer - Fullstack developer (Backend)

Warner Bros Games

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Fortis Games - Senior DevOps Engineer

Fortis Games

Brazil (On-Site)
4 Months ago
ByteDance - Researcher Graduate (Applied Machine Learning - Enterprise) -2025 Start (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Redhorse Corp - Mid-Level Full Stack Software Engineer

Redhorse Corp

Falls Church, Virginia, United States (On-Site)
4 Months ago
ByteDance - Site Reliability Engineer, Edge Services

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Veeam Software - DevSecOps Engineer

Veeam Software

(Remote)
1 Week ago
Trend Micro - (Sr.) Software Engineer in Linux

Trend Micro

Taipei City, Taiwan (On-Site)
7 Months ago
Inworld AI - Staff Platform Engineer  - Canada

Inworld AI

Vancouver, British Columbia, Canada (On-Site)
5 Months ago
Easygo - Senior DevOps Engineer

Easygo

Melbourne, Victoria, Australia (On-Site)
2 Days ago
Next Level Games - Senior Linux Administrator

Next Level Games

British Columbia, Canada (On-Site)
3 Months ago
Zeta - Senior Site Reliability Engineer

Zeta

Hyderabad, Telangana, India (On-Site)
6 Months ago
Glean - Designated Technical Support Engineer

Glean

(Remote)
2 Weeks ago
ISS Stoxx - Software Development Lead

ISS Stoxx

Mumbai, Maharashtra, India (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Jobs in Berlin, Berlin, Germany

Tesla - CAD & PLM Support Engineer, 3DEXPERIENCE

Tesla

Berlin, Berlin, Germany (On-Site)
3 Months ago
Welevel - Technical Animator

Welevel

Munich, Bavaria, Germany (On-Site)
2 Months ago
Tesla - Site Manager Electrical/I&C

Tesla

Brandenburg, Germany (On-Site)
3 Months ago
Whalar - Associate Director, Community/Creator Partnerships

Whalar

Berlin, Berlin, Germany (Hybrid)
2 Weeks ago
Realworld one - Software Project Manager - Content Creation

Realworld one

Germany (Hybrid)
1 Month ago
ION - Principal Business Consultant - Endur

ION

Berlin, Berlin, Germany (On-Site)
7 Months ago
Eleven Labs - Full-Stack Engineer (Front-End Leaning)

Eleven Labs

Germany (Remote)
1 Month ago
Fluence - Sr. Software Architect (m/f/d)

Fluence

Berlin, Berlin, Germany (On-Site)
6 Months ago
Unitedgames - Initiativbewerbung

Unitedgames

Germany (Hybrid)
9 Months ago
Technicon Design - Senior Exterior Designer

Technicon Design

Munich, Bavaria, Germany (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Category Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!