Stord is seeking a Staff Site Reliability Engineer to join a fast-moving SRE team. This role involves leading architecture decisions for scalable and reliable infrastructure on Google Cloud Platform, implementing Infrastructure as Code, and managing containerized environments with Docker and Kubernetes. The engineer will define SLOs, build robust monitoring solutions, and design CI/CD pipelines. They will also automate operational workflows, partner with engineering teams, provide escalation support, and mentor junior engineers, taking full ownership of critical infrastructure decisions.
Good To Have:- Database administration experience (PostgreSQL, MySQL, Redis, etc.).
- Familiarity with event-driven systems (Kafka, Pub/Sub, etc.).
- Experience with log aggregation tools (ELK, Splunk, Fluentd).
- Exposure to chaos engineering and resilience testing.
- Performance testing and optimization experience.
- Relevant GCP certifications (Cloud Architect, Cloud DevOps Engineer).
- Knowledge of GCP-specific services (Cloud Run, GKE, Cloud Functions, BigQuery, etc.).
- Experience with multi-cloud or hybrid architectures.
- Background in functional programming (Elixir, Haskell, F#, Clojure, etc.).
- Strong DevOps background and mindset.
Must Have:- Lead architecture for scalable and reliable infrastructure on GCP.
- Implement Infrastructure as Code (IaC) using tools like Terraform.
- Manage containerized environments with Docker and Kubernetes.
- Define and maintain Service Level Objectives (SLOs) and SLIs.
- Build robust monitoring, alerting, and observability solutions.
- Design and maintain CI/CD pipelines.
- Automate operational workflows and infrastructure provisioning.
- Provide escalation support for production incidents.
- 8+ years experience in SRE, platform, or infrastructure with leadership.
- Proficiency in Python, Go, or Java.
- Strong hands-on experience with GCP.
- Deep knowledge of networking and distributed systems.
- Experience with Git.