Senior Site Reliability Engineer, Production Engineering

2 Months ago • All levels • Product Management

Job Summary

Job Description

The Senior Site Reliability Engineer (SRE) will be responsible for designing and managing large-scale, highly available distributed systems in the cloud. This includes collaborating with application development teams to enhance the reliability, performance, and security of the platform. The role involves using cloud-native tools, designing and implementing scalable operations tooling, deploying and maintaining AWS cloud-native services, participating in incident response, and automating production operations. The SRE will also develop automation solutions, stay updated on industry best practices, identify and provide solutions to obstacles, and standardize solutions for the microservice-based platform. The role emphasizes operations, infrastructure, and 'everything as code' in a rapidly growing infrastructure.
Must have:
  • Expert knowledge of Kubernetes and its ecosystem.
  • Proficiency in software development with languages like Python or Go.
  • In-depth knowledge of cloud providers, preferably AWS.
  • Proven ability to build and implement scalable and well-tested solutions.
  • Strong understanding of Unix/Linux systems and client-server protocols.
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.
Good to have:
  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.
  • 5+ years of experience in a related role.
  • Excellent communication and documentation skills.
  • Strong sense of ownership, drive, and attention to detail.

Job Details

Please note that we have a hybrid approach to work and would like to find someone who can come into our offices in London at least one day a week.

Who We Are

Cisco ThousandEyes is a leading Digital Experience Assurance platform that empowers organizations to deliver seamless digital experiences across every network—even those beyond their ownership. Leveraging AI and an unparalleled set of cloud, internet, and enterprise network telemetry data, ThousandEyes enables IT teams to proactively detect, diagnose, and resolve issues before they impact end-user experiences.

ThousandEyes is deeply integrated across Cisco's extensive technology portfolio, supporting customers in scaling deployments while offering AI-powered assurance insights within Cisco’s Networking, Security, Collaboration, and Observability portfolios.

About The Role

We are seeking a skilled Senior Site Reliability Engineer (SRE) in Production Engineering with a strong background in SaaS and operations. You will design and manage large-scale, highly available distributed systems in the cloud, collaborating directly with application development teams to enhance the reliability, performance, and security of our platform.

What You’ll Do

  • Collaborate with software engineers to optimize architecture and services for availability, latency, performance, and reliability using cloud-native tools.
  • Design and implement scalable operations tooling to support platform growth and scaling across multiple regions.
  • Design, deploy, and maintain AWS cloud-native services that are elastic and resilient to failure.
  • Participate in and improve our 24x7 incident response and on-call rotation.
  • Use and expand our existing CNCF solutions like Kubernetes, Service Mesh, Prometheus, OpenTelemetry, and ArgoCD to increase platform reliability.
  • Automate production operations to provide guardrails and continuous platform operation.
  • Develop automation solutions for scalable service and platform operations, including deployment, scale testing, graceful failure, and chaos testing.
  • Stay updated on industry best practices for scalability and reliability to improve the scalability of the ThousandEyes platform.
  • Identify and provide solutions to common obstacles hindering operational excellence across engineering teams.
  • Generalize and standardize solutions and processes to enable repeated success across our microservice-based multi-region platform.
  • Play a key role in the ThousandEyes platform by leveraging scale testing, additional environments, and working with application teams to improve system reliability.
  • Manage a rapidly growing infrastructure capable of handling substantial daily data volumes, emphasizing operations/infrastructure/everything as code.

Qualifications

  • Expert-level knowledge of Kubernetes and its ecosystem.
  • Proficiency in software development with languages such as Python or Go.
  • In-depth knowledge of cloud providers, preferably AWS.
  • Proven ability to build and implement scalable and well-tested solutions.
  • Strong understanding of Unix/Linux systems, including kernel, system libraries, file systems, and client-server protocols.
  • Knowledge of Site Reliability principles: Incident Response, Change Management, Distributed Systems, Deployment Strategies, and SLOs.

Preferred Qualifications

  • Familiarity with best practices for operating a large-scale, highly available enterprise platform.
  • 5+ years of experience in a related role.
  • Excellent communication and documentation skills.
  • Strong sense of ownership, drive, and attention to detail.

Cisco values the perspectives and skills that emerge from employees with diverse backgrounds. That's why Cisco is expanding the boundaries of discovering top talent by not only focusing on candidates with educational degrees and experience but also placing more emphasis on unlocking potential. We believe that everyone has something to offer and that diverse teams are better equipped to solve problems, innovate, and create a positive impact.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification. Research shows that people from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy. We urge you not to prematurely exclude yourself and to apply if you're interested in this work.

Similar Jobs

DevRev - Account Executive: Enterprise

DevRev

United States (Remote)
1 Month ago
Keywords Studios - Implementation Specialist II

Keywords Studios

Pune, Maharashtra, India (Hybrid)
2 Months ago
USE Insider - Content Marketer (Brand and Campaigns)

USE Insider

United States (Remote)
2 Weeks ago
Veeam Software - Customer Success Strategic Representative with German

Veeam Software

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Month ago
appier - Senior Software Engineer, Machine Learning

appier

Taipei City, Taiwan (On-Site)
1 Month ago
cirrus logic - Graduate Product Test Engineer

cirrus logic

Edinburgh, Scotland, United Kingdom (On-Site)
4 Months ago
Scout - Staff Software Engineer, Backend (Digital Products)

Scout

Fremont, California, United States (On-Site)
3 Weeks ago
Milk  visual effects - VFX Line Producer

Milk visual effects

(On-Site)
4 Months ago
submarine career - Post-Production Supervisor

submarine career

Netherlands (On-Site)
3 Months ago
Aptive - Quality Operations Production Internship

Aptive

Épernon, Centre-Val De Loire, France (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Britive - SOFTWARE ENGINEER

Britive

California, United States (Remote)
7 Months ago
bytedance - Innovation Tech Solution Sales (Cloud) - BytePlus

bytedance

Singapore (On-Site)
8 Months ago
Figma - Growth Marketing Manager, Lifecycle

Figma

San Francisco, California, United States (Remote)
4 Weeks ago
GoDaddy - Software Development Engineer

GoDaddy

Serbia (Hybrid)
1 Month ago
Cognite - Business Development Director

Cognite

Tokyo, Japan (Hybrid)
9 Months ago
zeta - Project Manager I/II, CEO Office

zeta

Mumbai, Maharashtra, India (On-Site)
3 Months ago
zeta - Senior Manager - Digital Marketing

zeta

Bengaluru, Karnataka, India (On-Site)
3 Months ago
NCR Voyix - Customer Engineer

NCR Voyix

Singapore (On-Site)
4 Weeks ago
USE Insider - Cash & Treasury Manager

USE Insider

Istanbul, İstanbul, Türkiye (Hybrid)
1 Year ago
Veeam Software - Inside Sales Representative with German

Veeam Software

Bucharest, Bucharest, Romania (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in London, England, United Kingdom

Ion - Cloud Engineer/Architect (DevOps)

Ion

London, England, United Kingdom (On-Site)
9 Months ago
PlayStation Global - Lead Engineer, Mobile Performance

PlayStation Global

London, England, United Kingdom (On-Site)
2 Months ago
frames store - Previs Artist - Film & Episodic

frames store

England, United Kingdom (Hybrid)
5 Months ago
Thales - Contracts Manager - ICSM

Thales

Crawley, England, United Kingdom (Hybrid)
2 Months ago
Actian - Technical Implementation Advisor

Actian

United Kingdom (Remote)
2 Months ago
Unity - Software Engineer

Unity

Brighton And Hove, England, United Kingdom (On-Site)
1 Month ago
Perplexity - Senior C++ Developer

Perplexity

London, England, United Kingdom (On-Site)
1 Month ago
Monzo - Senior Legal Counsel - Employment, Incentives & Pensions

Monzo

London, England, United Kingdom (Remote)
1 Month ago
Bally's Interactive - Director - Design

Bally's Interactive

London, England, United Kingdom (On-Site)
1 Month ago
The story mob  - Account Lead

The story mob

London, England, United Kingdom (Hybrid)
2 Months ago

Get notifed when new similar jobs are uploaded

Product Management Jobs

Alpha Sense - Associate Product Manager, Web Curation

Alpha Sense

New York, United States (On-Site)
1 Month ago
Enphase Energy - Staff Product Marketing Manager

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
1 Month ago
zeta - Intern - Video Producer

zeta

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Tesla - Data Engineering Internship - New Product Introduction, BOM Management

Tesla

Brandenburg, Germany (On-Site)
5 Months ago
Joyteractive - Producer

Joyteractive

Cyprus (Remote)
3 Months ago
Xentrix studios - Production – Line Producer

Xentrix studios

India (On-Site)
8 Months ago
Glean - Product Management Lead, Verticals

Glean

Palo Alto, California, United States (On-Site)
1 Month ago
InMobiInMobi - Associate Product Manager - Salesforce [New Initiatives]

InMobiInMobi

Bengaluru, Karnataka, India (On-Site)
2 Months ago
The Walt Disney Company - Senior Product Manager II

The Walt Disney Company

San Francisco, California, United States (On-Site)
2 Months ago
flying wild hog - Senior Producer

flying wild hog

(Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

The name ThousandEyes was born from two big ideas: the power to see things not ordinarily possible and the ability to collect insights from a multitude of vantage points. As organizations rely more on cloud services and the Internet, the network has become a black box they can't understand. ThousandEyes gives organizations visibility into the now borderless network, arming them with an accurate understanding of how the network impacts their applications, users and customers. ThousandEyes is used by some of the world's largest and fastest growing brands, including all of the top 5 global software companies, 5 of the top 6 US banks, and 45 of the Fortune 500.

Lisbon, Lisbon, Portugal (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

London, England, United Kingdom (Hybrid)

Lisbon, Lisbon, Portugal (On-Site)

Lisbon, Lisbon, Portugal (Hybrid)

London, England, United Kingdom (Hybrid)

Sydney, New South Wales, Australia (On-Site)

View All Jobs

Get notified when new jobs are added by Thousand Eyes

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug