Staff Site Reliability Engineer

14 Minutes ago • 8 Years +
Devops

Job Description

Stord is seeking a Staff Site Reliability Engineer to join a fast-moving SRE team. This role involves leading architecture decisions for scalable and reliable infrastructure on Google Cloud Platform, implementing Infrastructure as Code, and managing containerized environments with Docker and Kubernetes. The engineer will define SLOs, build robust monitoring solutions, and design CI/CD pipelines. They will also automate operational workflows, partner with engineering teams, provide escalation support, and mentor junior engineers, taking full ownership of critical infrastructure decisions.
Good To Have:
  • Database administration experience (PostgreSQL, MySQL, Redis, etc.).
  • Familiarity with event-driven systems (Kafka, Pub/Sub, etc.).
  • Experience with log aggregation tools (ELK, Splunk, Fluentd).
  • Exposure to chaos engineering and resilience testing.
  • Performance testing and optimization experience.
  • Relevant GCP certifications (Cloud Architect, Cloud DevOps Engineer).
  • Knowledge of GCP-specific services (Cloud Run, GKE, Cloud Functions, BigQuery, etc.).
  • Experience with multi-cloud or hybrid architectures.
  • Background in functional programming (Elixir, Haskell, F#, Clojure, etc.).
  • Strong DevOps background and mindset.
Must Have:
  • Lead architecture for scalable and reliable infrastructure on GCP.
  • Implement Infrastructure as Code (IaC) using tools like Terraform.
  • Manage containerized environments with Docker and Kubernetes.
  • Define and maintain Service Level Objectives (SLOs) and SLIs.
  • Build robust monitoring, alerting, and observability solutions.
  • Design and maintain CI/CD pipelines.
  • Automate operational workflows and infrastructure provisioning.
  • Provide escalation support for production incidents.
  • 8+ years experience in SRE, platform, or infrastructure with leadership.
  • Proficiency in Python, Go, or Java.
  • Strong hands-on experience with GCP.
  • Deep knowledge of networking and distributed systems.
  • Experience with Git.

Add these skills to join the top 1% applicants for this job

problem-solving
communication
github
game-texts
performance-testing
gitlab
mysql
postgresql
networking
load-balancing
ansible
prometheus
new-relic
grafana
terraform
chef
elk
puppet
google-cloud-platform
redis
ci-cd
docker
kubernetes
git
python
github-actions
splunk
jenkins
java
system-design

What You’ll Do:

Infrastructure & Platform Management

  • Lead architecture decisions to deliver scalable and reliable infrastructure, primarily on Google Cloud Platform (GCP)
  • Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, Pulumi, or similar
  • Manage containerized environments with Docker and Kubernetes
  • Drive system performance tuning, capacity planning, and resource optimization

Reliability & Monitoring

  • Define and maintain Service Level Objectives (SLOs) and Indicators (SLIs)
  • Build robust monitoring, alerting, and observability solutions using Prometheus, Grafana, DataDog, or New Relic
  • Develop and maintain disaster recovery and business continuity strategies

Automation & DevOps

  • Design and maintain CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, etc.)
  • Automate operational workflows and infrastructure provisioning
  • Implement configuration management with Ansible, Chef, Puppet, or similar tools
  • Develop custom tooling and scripts to enhance operational efficiency

Collaboration & Support

  • Partner with engineering teams to improve deployment practices and application reliability
  • Provide escalation support for production incidents and lead post-incident reviews
  • Conduct technical design reviews and offer architectural guidance
  • Mentor junior engineers on SRE and infrastructure best practices
  • Participate in on-call rotations for critical systems

What You’ll Need:

Technical Skills

  • 8+ years of experience in site reliability, platform engineering, or infrastructure roles with leadership exposure
  • Proficiency in at least one programming language (Python, Go, Java, etc.)
  • Strong hands-on experience with GCP and its core services
  • Expertise in containerization (Docker) and orchestration (Kubernetes)
  • Deep knowledge of Infrastructure as Code (Terraform, CloudFormation, etc.)
  • Skilled in monitoring/observability (Prometheus, Grafana, ELK, etc.)
  • Solid understanding of networking, load balancing, and distributed systems
  • Experience with Git and collaborative development workflows

Core Competencies

  • Exceptional troubleshooting and problem-solving abilities
  • Strong grasp of system design principles and scalability patterns
  • Experience with incident management and post-mortem practices
  • Familiarity with security best practices and compliance standards
  • Excellent communication skills and ability to work cross-functionally

Preferred Qualifications:

  • Database administration experience (PostgreSQL, MySQL, Redis, etc.)
  • Familiarity with event-driven systems and platforms (Kafka, Pub/Sub, etc.)
  • Experience with log aggregation tools (ELK, Splunk, Fluentd)
  • Exposure to chaos engineering and resilience testing
  • Performance testing and optimization experience
  • Relevant GCP certifications (Cloud Architect, Cloud DevOps Engineer)
  • Knowledge of GCP-specific services (Cloud Run, GKE, Cloud Functions, BigQuery, etc.)
  • Experience with multi-cloud or hybrid architectures
  • Background in functional programming (Elixir, Haskell, F#, Clojure, etc.)
  • Strong DevOps background and mindset

Set alerts for more jobs like Staff Site Reliability Engineer
Set alerts for new jobs by Stord
Set alerts for new Devops jobs in United States
Set alerts for new jobs in United States
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙