Senior Site Reliability Engineer, Production Engineering

1 Month ago • All levels

Job Summary

Job Description

The Senior Site Reliability Engineer (SRE) will be responsible for designing and managing large-scale, highly available distributed systems in the cloud. This includes collaborating with application development teams to enhance the reliability, performance, and security of the platform. The role involves using cloud-native tools, designing and implementing scalable operations tooling, deploying and maintaining AWS cloud-native services, participating in incident response, and automating production operations. The SRE will also develop automation solutions, stay updated on industry best practices, identify and provide solutions to obstacles, and standardize solutions for the microservice-based platform. The role emphasizes operations, infrastructure, and 'everything as code' in a rapidly growing infrastructure.
Must have:
  • Expert knowledge of Kubernetes and its ecosystem.
  • Proficiency in software development with languages like Python or Go.
  • In-depth knowledge of cloud providers, preferably AWS.
  • Proven ability to build and implement scalable and well-tested solutions.
  • Strong understanding of Unix/Linux systems and client-server protocols.
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.
Good to have:
  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.
  • 5+ years of experience in a related role.
  • Excellent communication and documentation skills.
  • Strong sense of ownership, drive, and attention to detail.

Job Details

Please note that we have a hybrid approach to work and would like to find someone who can come into our offices in London at least one day a week.

Who We Are

Cisco ThousandEyes is a leading Digital Experience Assurance platform that empowers organizations to deliver seamless digital experiences across every network—even those beyond their ownership. Leveraging AI and an unparalleled set of cloud, internet, and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and resolve issues before they impact end-user experiences.

ThousandEyes is deeply integrated across Cisco's extensive technology portfolio, supporting customers in scaling deployments while offering AI-powered assurance insights within Cisco’s Networking, Security, Collaboration, and Observability portfolios.

About The Role

We are seeking a skilled Senior Site Reliability Engineer (SRE) in Production Engineering with a strong background in SaaS and operations. You will design and manage large-scale, highly available distributed systems in the cloud, collaborating directly with application development teams to enhance the reliability, performance, and security of our platform.

What You’ll Do

  • Collaborate with software engineers to optimize architecture and services for availability, latency, performance, and reliability using cloud-native tools.
  • Design and implement scalable operations tooling to support platform growth and scaling across multiple regions.
  • Design, deploy, and maintain AWS cloud-native services that are elastic and resilient to failure.
  • Participate in and improve our 24x7 incident response and on-call rotation.
  • Use and expand our existing CNCF solutions like Kubernetes, Service Mesh, Prometheus, OpenTelemetry, and ArgoCD to increase platform reliability.
  • Automate production operations to provide guardrails and continuous platform operation.
  • Develop automation solutions for scalable service and platform operations, including deployment, scale testing, graceful failure, and chaos testing.
  • Stay updated on industry best practices for scalability and reliability to improve the scalability of the ThousandEyes platform.
  • Identify and provide solutions to common obstacles hindering operational excellence across engineering teams.
  • Generalize and standardize solutions and processes to enable repeated success across our microservice-based multi-region platform.
  • Play a key role in the ThousandEyes platform by leveraging scale testing, additional environments, and working with application teams to improve system reliability.
  • Manage a rapidly growing infrastructure capable of handling substantial daily data volumes, emphasizing operations/infrastructure/everything as code.

Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem.
  • Proficiency in software development with languages such as Python or Go.
  • In-depth knowledge of cloud providers, preferably AWS.
  • Proven ability to build and implement scalable and well-tested solutions.
  • Strong understanding of Unix/Linux systems, including kernel, system libraries, file systems, and client-server protocols.
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.
  • 5+ years of experience in a related role.
  • Excellent communication and documentation skills.
  • Strong sense of ownership, drive, and attention to detail.

Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That's why Cisco is expanding the boundaries of discovering top talent by not only focusing on candidates with educational degrees and experience but also placing more emphasis on unlocking potential. We believe that everyone has something to offer and that diverse teams are better equipped to solve problems, innovate, and create a positive impact.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification. Research shows that people from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy. We urge you not to prematurely exclude yourself and to apply if you're interested in this work.

Similar Jobs

edgemony - Backend Software Engineer (Platform)

edgemony

Milan, Lombardy, Italy (Hybrid)
1 Month ago
Veeam Software - Senior Customer Success Representative

Veeam Software

Seoul, South Korea (On-Site)
3 Weeks ago
Scopely - Principal DevOps Engineer - Star Trek Fleet Command

Scopely

United Kingdom (Remote)
2 Months ago
Stacklok - Director of Product Management - AI CodeGen

Stacklok

Bellevue, Washington, United States (Hybrid)
3 Weeks ago
fairmatic - Senior Software Engineer - Backend - Platform

fairmatic

Bengaluru, Karnataka, India (Hybrid)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Whatnot - Software Engineer, Account Integrity

Whatnot

Kraków, Lesser Poland Voivodeship, Poland (Remote)
2 Days ago
Gigamon - Staff Software Engineer - Gigasmart - Mobility

Gigamon

Chennai, Tamil Nadu, India (On-Site)
3 Months ago
Suki - Software Engineer III -Backend

Suki

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Razer - Senior Data Engineer

Razer

Singapore (On-Site)
2 Weeks ago
Applied materials  - High-Performance Computing (HPC) Architect

Applied materials

Bengaluru, Karnataka, India (On-Site)
1 Month ago
extreme network - Cloud Database Administrator (9466)

extreme network

Toronto, Ontario, Canada (Hybrid)
8 Months ago
GoDaddy - Principal Full Stack Software Engineer

GoDaddy

Colombia (Remote)
2 Weeks ago
N-ix - Solution Architect (Spanish Speaking)

N-ix

Poland (Remote)
2 Months ago
Cognite - Senior Data Engineer

Cognite

Pau, Nouvelle-Aquitaine, France (Remote)
9 Months ago
Dentsu Aegis - Lead Enterprise Architect

Dentsu Aegis

Pune, Maharashtra, India (On-Site)
1 Year ago

Get notifed when new similar jobs are uploaded

Jobs in London, England, United Kingdom

Cloud Imperium Games - Level Designer

Cloud Imperium Games

Manchester, England, United Kingdom (On-Site)
2 Weeks ago
Foster and partners  - Mechanical Engineer

Foster and partners

London, England, United Kingdom (On-Site)
1 Month ago
cirrus logic - IT Operations Asset Manager

cirrus logic

Edinburgh, Scotland, United Kingdom (On-Site)
1 Month ago
Clear Watery Analytics - Senior Graphic Designer

Clear Watery Analytics

London, England, United Kingdom (On-Site)
1 Month ago
Maverick Games - Principal UI Engineer

Maverick Games

Warwick, England, United Kingdom (Hybrid)
5 Months ago
Hawk Eye Innovations - Transformation Consultant

Hawk Eye Innovations

Basingstoke, England, United Kingdom (Hybrid)
2 Months ago
version 1 - Digital Advisor for Financial Services

version 1

London, England, United Kingdom (On-Site)
1 Month ago
Cloud Imperium Games - Senior Cinematic Designer

Cloud Imperium Games

Manchester, England, United Kingdom (On-Site)
2 Weeks ago
Blue bolt - 3D Generalist

Blue bolt

London, England, United Kingdom (Hybrid)
1 Week ago
Ubisoft - Analyst, Customer Support Experience

Ubisoft

Newcastle Upon Tyne, England, United Kingdom (Hybrid)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Category Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

The name ThousandEyes was born from two big ideas: the power to see things not ordinarily possible and the ability to collect insights from a multitude of vantage points. As organizations rely more on cloud services and the Internet, the network has become a black box they can't understand. ThousandEyes gives organizations visibility into the now borderless network, arming them with an accurate understanding of how the network impacts their applications, users and customers. ThousandEyes is used by some of the world's largest and fastest growing brands, including all of the top 5 global software companies, 5 of the top 6 US banks, and 45 of the Fortune 500.

Lisbon, Lisbon, Portugal (On-Site)

London, England, United Kingdom (Hybrid)

Mexico City, Mexico (On-Site)

Mexico City, Mexico (On-Site)

Austin, Texas, United States (On-Site)

San Francisco, California, United States (On-Site)

Mexico City, Mexico (On-Site)

São Paulo, Brazil (On-Site)

Austin, Texas, United States (On-Site)

Detroit, Michigan, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Thousand Eyes

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug