The Junior Site Reliability Engineer is responsible for assisting in the design, build, and maintenance of the infrastructure and deployment systems that underpin our live environments. This role is hands-on and highly collaborative, working closely with development teams and senior SREs to ensure our systems are reliable, scalable, and well-instrumented. Junior SREs are expected to learn and apply best practices in building robust, automated solutions, and to ensure their work is repeatable and understandable by others. Every contribution should be accompanied by documentation to support knowledge-sharing within the team and across engineering.
Core Responsibilities
- Infrastructure Design & Maintenance
- Assist in building and maintaining infrastructure using infrastructure-as-code (IaC) tools (e.g., Terraform, CloudFormation).
- Support the provisioning and lifecycle management of production, staging, and other critical environments.
- Help implement shared infrastructure components (e.g., logging, metrics, service mesh, load balancing).
- Contribute to improving infrastructure scalability, availability, and performance under the guidance of senior engineers.
- Collaborate with development teams to provide infrastructure support for their deployment needs.
- Deployment Systems & CI/CD
- Support and help extend CI/CD pipelines (GitHub Actions, Argo CD) to improve reliability and automation of deployments.
- Help promote consistency and best practices across environments for deployment, rollback, and observability.
- Work with developers to streamline testing and delivery of code to production.
- Assist in reducing manual steps in the deployment and operations workflows.
- Reliability, Observability & Tooling
- Assist in the implementation and maintenance of our monitoring, alerting, and logging infrastructure (Kube-Prometheus-Grafana stack).
- Help track SLOs/SLIs for core services in partnership with service owners.
- Learn to identify and help eliminate single points of failure, performance bottlenecks, and sources of instability.
- Participate in reliability reviews and post-incident analysis.
- Documentation & Knowledge Sharing
- Ensure that all systems and processes you work on are accompanied by thorough, up-to-date documentation.
- Contribute to shared knowledge bases, runbooks, and developer-facing onboarding materials.
- Participate in internal training sessions and pairings to learn from teammates.
- Collaboration & Culture
- Work closely with the SRE Lead and other team members to execute work aligned with team goals.
- Engage constructively with other teams across engineering.
- Participate in on-call rotations with strong support from senior members.
- Embrace a culture of blameless learning, transparency, and continuous improvement.
Qualifications & Skills
- Experience: 3+ years in a DevOps, SRE, or related role.
- Cloud: Basic understanding of cloud computing concepts, with some hands-on experience in AWS.
- Containers & Orchestration: Familiarity with Docker and a foundational understanding of Kubernetes concepts. Experience with AWS ECS is a plus.
- CI/CD: Exposure to CI/CD principles and tools like GitHub Actions. Familiarity with Argo CD is a bonus.
- IaC: Some experience with or exposure to Infrastructure as Code tools like Terraform or CloudFormation.
- Scripting: Proficiency in at least one scripting language (e.g., Bash, Python).
- Observability: A basic understanding of monitoring and logging. Exposure to Prometheus and Grafana is desirable.
- Collaboration: Strong communication skills and a desire to learn and work within a team.
- Problem Solving: An enthusiastic and curious approach to solving technical challenges.