Senior Site Reliability Engineering Manager, Production Engineering

4 Months ago • All levels

Product Management

Job Description

As the Senior Engineering Manager for our Production Engineering SRE team, you will lead a group of skilled engineers responsible for the design and management of large-scale, highly available distributed systems in the cloud. You will collaborate directly with application development teams to enhance the reliability, performance, and security of our platform. The role involves leading and mentoring a high-performing team, developing and implementing strategies to improve platform reliability, security, and performance, and overseeing the design and implementation of scalable operations tooling. You will also ensure effective management of incident response, lead efforts to automate production operations, and partner with teams to enhance the security posture of systems, while working closely with software development teams to optimize architecture and services for availability and performance.

Good To Have:

Strong communication and leadership skills.
Demonstrated ability in SRE/DevOps.
Background in security engineer or DevSecOps
Familiarity with CNCF tools.

Must Have:

Lead and scale SRE teams in a fast-paced environment.
Deep knowledge of site reliability principles, including incident response and SLOs.
Expert-level knowledge of Kubernetes and its ecosystem.
Strong understanding of cloud platforms, preferably AWS.
Experience with microservices architecture and distributed systems.

Add these skills to join the top 1% applicants for this job

team-management

cross-functional

cross-functional-collaboration

networking

incident-response

aws

prometheus

microservices

kubernetes

Please note that we have a hybrid approach to work and would like to find someone who can come into the office in London at least one day a week

Who We Are

Cisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network – even the ones they don’t own. Powered by AI and an unmatched set of cloud, Internet and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and remediate issues – before they impact end-user experiences.

ThousandEyes is deeply integrated across the entire Cisco technology portfolio and beyond, helping customers deploy at scale while also delivering AI-powered assurance insights within Cisco’s leading Networking, Security, Collaboration, and Observability portfolios.

About The Role

What You’ll Do

Team Leadership and Development:

Build and mentor a high-performing team of Site Reliability Engineers that embed with application development teams
Foster a culture of continuous learning, innovation, and best practices
Manage performance, set goals, and provide career development opportunities

Strategic Planning and Execution:

Develop and implement strategies to improve platform reliability, security, and performance
Collaborate with other engineering leaders to align SRE initiatives with overall business objectives
Establish and execute on a roadmap to build common platform solutions to reliability, security, and scale challenges engineering teams at ThousandEyes face.

Operational Excellence:

Oversee the design and implementation of scalable operations tooling for SREs and Developers
Ensure the effective management of our 24x7 incident response and on-call rotation
Lead efforts to automate production operations and adopt robust monitoring solutions

Security and Compliance:

Partner with application development teams and other platform engineering teams to enhance the security posture of our containerized and cloud-native systems
Ensure compliance with Cisco and industry standards for data protection, scanning, and system security

Cross-functional Collaboration:

Work closely with software development teams to optimize architecture and services for availability and performance
Collaborate with product management to align SRE initiatives with product roadmaps
Represent the Production Engineering SRE team in cross-functional meetings and initiatives

Minimum Qualifications

Proven track record of leading and scaling SRE teams in a fast-paces environment
Deep knowledge of site reliability principles, including incident response, change management, and SLOs
Expert-level knowledge of Kubernetes and its ecosystem
Strong understanding of cloud platforms, preferably AWS
Experience with microservices architecture and distributed systems

Preferred Qualifications

Strong communication and leadership skills, with the ability to influence cross-function stakeholders
Demonstrated ability in SRE, DevOps, or related fields, with at least 3 years in a management role
Background in security engineer, DevSecOps or a strong understanding of security best practices in cloud-native environments
Familiarity with CNCF tools such as Prometheus, OpenTelemetry, and ArgoCD

Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That's why Cisco is expanding the boundaries of discovering top talent by not only focusing on candidates with educational degrees and experience but also placing more emphasis on unlocking potential. We believe that everyone has something to offer and that diverse teams are better equipped to solve problems, innovate, and create a positive impact.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification. Research shows that people from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy. We urge you not to prematurely exclude yourself and to apply if you're interested in this work.

Set alerts for more jobs like Senior Site Reliability Engineering Manager, Production Engineering

Set alerts for new jobs by Thousand Eyes

Set alerts for new Product Management jobs in United Kingdom

Set alerts for new jobs in United Kingdom

Set alerts for Product Management (Remote) jobs