Toast is driven by building the restaurant platform that helps restaurants adapt, take control, and get back to what they do best: building the businesses they love.
The Manager, Site Reliability Engineering Observability role at Toast fits within the Observability Enablement & Administration team, which is part of Site Reliability Engineering, responsible for overseeing Toast production services, with a commitment to quality, reliability, and low latency. The Observability Enablement & Administration team is responsible for setting the overall observability strategy, choosing the right tools and technologies, developing best practices, and providing guidance to other teams, while maintaining, governing cost, and administering the observability platform and log pipelines.
As a Manager of the Observability Enablement & Administration team, you will provide technical leadership and hands-on contributions, incorporating reliability best practices for programming and scripting, observability, production triage, incident resolution, and retrospective/root cause analysis to maintain the world-class reliability and uptime of our platform.
About this roll\* (Responsibilities)
In this role you will be responsible for the architecture, administration, maintenance, and enhancement of our observability platforms, ensuring optimal performance and availability for our critical security and business operations.
- Create and drive strategic organization-wide observability initiatives in collaboration with technical leadership and Product Management
- Drive day-to-day operations of the team and contribute to the development and prioritization of the SRE roadmap for observability initiatives
- Enable a geographically distributed team of engineers to continue performing at a high level and help increase the impact of their work
- Manage observability architecture design, support, and platform management
- Implement strategies to increase observability platform reliability and performance
- Lead and contribute to initiatives that automate operational toil for observability focused tasks such as those needed for legal and compliance requirements
- Guide teams to build and maintain systems that are observable
- Support end-users with training and technical guidance on observability tools and capabilities.
- Gather and analyze metrics from operating systems and applications that enable development teams with observability insights
Do you have the right ingredients\*? (Requirements)
- Hands-on experience managing an SRE or Observability team, including hiring, mentoring, cross functional collaboration
- Hands-on coding/scripting experience with Go, Python, etc
- Deep understanding of observability systems and tools such as APM, RUM, Synthetics, Splunk, OTEL, Log pipelines, SIEM, Terraform etc.
- Background in leading complex engineering projects in a Scrum environment
- Direct exposure to cloud infrastructure and SaaS solutions
- Polyglot technologist/generalist with a thirst for learning