Senior Site Reliability Engineer (SRE)

3 Months ago • 6-12 Years • $199,000 PA - $252,000 PA
Devops

Job Description

Stacklok is seeking a Senior Site Reliability Engineer (SRE) to design, build, and operate the infrastructure powering their AI-first products and services. This role involves owning key production systems, leading reliability initiatives, and delivering secure, scalable infrastructure for AI use cases. The engineer will work with technologies like Kubernetes, Terraform, and ArgoCD, automating deployments and incident response, enhancing service health through telemetry and SLOs, and ensuring infrastructure scalability. The ideal candidate thrives in dynamic environments, builds reliable infrastructure, and applies strong technical judgment to complex operational challenges. Responsibilities include collaborating across teams, developing automation and internal tooling, and mentoring junior engineers, with a focus on reducing toil and making AI systems dependable.
Good To Have:
  • Experience with AWS
  • Experience with GitOps practices
  • Experience with PagerDuty
Must Have:
  • Strong SRE foundation
  • Proficiency in Python, Go, or Bash
  • Deep experience with Terraform
  • Hands-on Kubernetes and Docker experience
  • Experience with at least one major cloud provider
  • Experience with ArgoCD or Flux
  • Experience automating incident response
  • Proficient with observability tools
  • Experience defining and using SLOs/KPIs
  • Familiarity with security best practices
  • Strong collaboration and communication skills
Perks:
  • Competitive compensation
  • Equity
  • Comprehensive healthcare
  • Flexible work environment
  • Adaptable work hours
  • Flexible Paid Time Off

Add these skills to join the top 1% applicants for this job

team-management
communication
logo-development
github
game-texts
incident-response
aws
azure
prometheus
terraform
grafana
docker
kubernetes
python
bash

At Stacklok, we’re an AI-first company led by Kubernetes co-founder Craig McLuckie, helping enterprise developers connect the data, systems, and services that power their businesses today with the agentic and assistive AI systems they’re building for tomorrow. We believe the shift from applications to agents is the next major evolution in software, and we’re building the foundation that helps teams make that leap with confidence.

Our open source platform, ToolHive, provides developers with a powerful yet simple way to securely connect AI systems to real-world environments, delivering the right context at the right time. It solves tough challenges like security, access control, and observability without adding friction to the developer experience. By using open protocols like MCP (Model Context Protocol) and a highly pluggable architecture, supported by a community first development approach, ToolHive allows enterprises to run AI agents safely behind firewalls, with full control over data flow, context, and decision-making.


Connect With Us!

GitHub Discord LinkedIn

Location

This is a hybrid role that requires in-person work three days a week: Tuesday, Wednesday, and Thursday. We believe this approach balances flexibility with the value of in-person collaboration and community.

Our current office is located at:
3120 139th Avenue SE, Suite 500
Bellevue, WA 98005

Please note: we are planning to relocate to a more central location in the near future.

Elevator Pitch

Stacklok is seeking a Senior Site Reliability Engineer to design, build, and operate the infrastructure that powers our products and services. In this role, you’ll own key production systems, lead reliability-focused engineering efforts, and help deliver secure, scalable infrastructure for real-world AI use cases.

You’ll work hands-on with technologies like Kubernetes, Terraform, and ArgoCD to evolve cloud-native systems. You’ll automate deployments and incident response, enhance service health through telemetry and SLOs, and ensure our infrastructure can scale with product adoption.

We’re looking for an engineer who thrives in high-change environments, builds reliable and maintainable infrastructure, and applies strong technical judgment to complex operational challenges. You should be comfortable collaborating across teams, developing automation and internal tooling, and mentoring less experienced engineers.

If you're excited about reducing toil, scaling infrastructure through code, and making AI-powered systems dependable in production, we’d love to hear from you.

Success In The Role:  6-12 Months Expectations

  • Embedded in Team and Culture: Built strong, trust-based relationships across engineering, product, and design. Adapted quickly to team workflows, values, and collaboration norms. Contributed effectively to team goals with minimal oversight.
  • Product and Platform Fluency Demonstrated: Developed a deep understanding of Stacklok’s products, architecture, and strategy. Used this fluency to inform infrastructure decisions, collaborate effectively with product and engineering teams, and align platform work with near- and long-term goals.

  • Infrastructure Ecosystem Designed and Implemented: Led the design and setup of a scalable deployment ecosystem using Terraform and Kubernetes. Selected and configured tooling for observability, monitoring, and delivery. Embedded infrastructure security and operational best practices from the outset.

  • Automation and Reliability Improved: Delivered automation across provisioning, deployment, recovery, and operational workflows that significantly reduced manual effort and operational risk. Improved consistency, accelerated engineering velocity, and helped eliminate recurring sources of toil. Drove optimizations, including cloud cost reduction.

  • Operational Excellence Established: Defined and implemented meaningful SLOs and KPIs tied to service health and business goals. Designed and rolled out the team’s initial on-call and incident response processes. Contributed to shaping a strong culture of operational readiness and shared accountability.

  • Team Clarity and Production Knowledge Scaled: Produced high-quality documentation, system diagrams, and runbooks that improved team preparedness and visibility. Mentored peers in production ownership, tooling usage, and operational best practices. Helped foster a culture of shared responsibility and engineering excellence

In This Role You Will:

  • Design and Operate Reliable Infrastructure: Contribute to the evolution of our infrastructure by designing and managing production systems that support multiple engineering teams. Continuously improve platform performance, availability, and operational robustness through well-engineered solutions.

  • Automate Operational Workflows: Apply an automation-first mindset to reduce manual processes in areas like provisioning, deployment, and incident response. Deliver resilient tooling and workflows that enable faster delivery and improve reliability

  • Monitor and Improve Service Health: Define and maintain key metrics that reflect system performance and reliability. Use telemetry and observability tooling to proactively detect issues and drive systemic improvements.

  • Champion Operational Excellence: Establish and iterate on SLOs, incident response, and on-call practices that ensure reliable service delivery. Promote a culture of accountability, preparedness, and continuous improvement.

  • Mentor and Enable Engineering Teams: Share production knowledge, write and maintain high-quality runbooks and system documentation, and support engineers in adopting sound operational practices. Contribute to a healthy, inclusive engineering culture through mentorship and collaboration.

We Understand

We understand that not everyone will meet every requirement listed, and that’s perfectly okay! We encourage you to apply regardless of your self-assessment. We value a diverse range of skills and experiences and believe that your unique attributes can make a significant impact. We want to hear from you!

Desired Skill & Experience

  • Site Reliability Engineering: Strong foundation in SRE, with experience designing, operating, and scaling reliable production systems in fast-paced environments.

  • Programming: Proficient in applying fundamental programming principles to build reliable, maintainable automation, scripting, and internal tools. Experienced with languages such as Python, Go, Bash, or similar, with an emphasis on clear structure, testing, and operational reliability.

  • Infrastructure as Code (IaC): Deep experience with Terraform or similar tooling to provision, configure, and manage cloud environments using code-driven workflows.

  • Cloud-Native Operations: Hands-on experience with Kubernetes and Docker in production environments. Familiarity with autoscaling, recovery strategies, and cloud-native architecture patterns.

  • Cloud Provider Experience: Proficient with at least one major cloud provider (e.g., AWS, Azure, GCP). Experience with AWS is preferred.

  • GitOps and Deployment Tooling: Experience deploying to Kubernetes using GitOps practices. Familiarity with ArgoCD (preferred) or similar tools like Flux.

  • Incident Response Automation: Experience automating incident response workflows using tools such as PagerDuty to improve response times and operational consistency.

  • Observability and Monitoring: Proficient with log aggregation and telemetry tools such as AWS CloudWatch, Prometheus, Grafana, or similar, to support monitoring, performance tuning, and proactive issue detection.

  • Service Quality and Metrics: Experienced in defining and using SLOs and KPIs to guide reliability goals, improve service quality, and drive operational focus.

  • Operational Security Awareness: Familiar with operational and infrastructure security best practices, including secure software supply chain considerations.

  • Business-Aligned Impact: Track record of delivering technical solutions that drive measurable business outcomes. Applies engineering judgment with product and customer context in mind.

  • Collaboration and Communication: Strong written and verbal communication skills. Comfortable collaborating across technical and non-technical audiences, mentoring peers, and contributing to inclusive team culture.

  • Startup Agility and Versatility: Thrives in fast-moving, high-growth environments. Adaptable across responsibilities, self-directed, and proactive in driving clarity and execution.

 

#LI-Hybrid

Why Join Us?

At Stacklok, we believe great technology is built by teams that support, challenge, and inspire one another. We are AI maximalists, confident in its potential and committed to ensuring it is used in ways that are safe and sustainable.

You will join a highly motivated, collaborative team with deep experience building some of the world’s most impactful technologies. We work in the open, side by side with the community, with strong roots in open source, cloud-native technologies, security, and developer tools.

We offer competitive compensation, equity, comprehensive healthcare, and a flexible work environment - including adaptable work hours and flexible PTO to support your success. 

If you're excited about the future of AI, and want to build alongside people who care deeply about their craft, their community, and each other, we would love to hear from you.

 

 

Stacklok Inc, is proud to be an equal opportunity employer. We are committed to providing equal employment opportunities for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.

Set alerts for more jobs like Senior Site Reliability Engineer (SRE)
Set alerts for new jobs by Stacklok
Set alerts for new Devops jobs in United States
Set alerts for new jobs in United States
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙