Senior Site Reliability Engineer (Observability & Resilience)

1 Month ago • 5 Years + • Devops • $130,000 PA - $150,000 PA

Job Summary

Job Description

MagicSchool is seeking a Senior Site Reliability Engineer specializing in Observability and Resilience. This role involves leading the platform's observability strategy, designing resilient infrastructure, and driving instrumentation and telemetry. Responsibilities include creating observability patterns (metrics, logging, tracing, alerting), building internal tooling and dashboards, defining SLIs/SLOs, ensuring high availability and disaster recovery with Terraform, and collaborating with engineering teams to embed resilient design. The ideal candidate has at least 5 years of experience in SRE or a related field, expertise in observability tools like Grafana and Prometheus, strong Terraform skills, and excellent communication abilities.
Must have:
  • 5+ years in SRE, DevOps, or observability role
  • Design/operate systems for high availability
  • Experience with Grafana, Prometheus, Loki, Datadog
  • Strong Terraform and infrastructure-as-code skills
  • Enable product engineers on observability patterns
  • Calm and decisive communication during incidents
Good to have:
  • Experience with Sentinel or similar logging stacks
  • Exposure to educational or compliance environments
  • Strong debugging skills
Perks:
  • Work on cutting-edge AI technology
  • Mission-driven team
  • Flexibility of working from home
  • Unlimited time off
  • Employer-paid health insurance
  • Stock options
  • 401k match
  • Monthly wellness stipend

Job Details

WHO WE ARE: MagicSchool is the premier generative AI platform for teachers. We're just over 2 years old, and more than 5.5 million teachers from all over the world have joined our platform. Join a top team at a fast growing company that is working towards real social impact. Make an account and try us out at our website and connect with our passionate community on our Wall of Love.

Role Description:

As Senior Site Reliability Engineer (Observability & Resilience), you will lead observability across our platform and help design the resilient infrastructure our customers and educators rely on every day. In this hands-on, individual contributor role, you’ll drive instrumentation and telemetry strategy while partnering closely with product and engineering to plan for Resilience, Recovery, and Availability.

Responsibilities:

In this role, you will be responsible for driving to the following outcomes:

  • Observability Leadership: Design and implement observability patterns—including metrics, logging, tracing, and alerting—to ensure we have clear, actionable visibility into platform behavior and performance.

  • Build internal tooling and dashboards: Empower our teams with real-time system insights.

  • Operational Excellence: Define and maintain SLIs and SLOs in partnership with product and engineering teams. Establish best practices for alert tuning and signal-to-noise balancing to reduce incident fatigue and improve response accuracy.

  • Platform Resilience: Architect and support infrastructure that prioritizes high availability, disaster recovery, and graceful degradation. Leverage Terraform and infrastructure-as-code to ensure consistent, reliable deployments across AWS and Google Cloud.

  • Cross-Functional Enablement: Collaborate with engineers across teams to embed resilient design and observability from the ground up. Provide training and pairing support to product engineers, helping them build and maintain telemetry that supports the full software lifecycle.

Experience & Qualifications:

To be successful in this role, you’ll bring the following experience and qualifications:

  • Professional Experience: At least 5 years in an SRE, DevOps, or observability-focused role, with a track record of success in fast-paced, high-growth environments.

  • Observability & Resilience: Experience designing and operating systems for high availability and disaster recovery. Familiarity with incident response, alert fatigue reduction, and signal-to-noise balancing.

  • Tooling Expertise: Deep experience with observability tools such as Grafana, Prometheus, Loki, Datadog, and OpenTelemetry. Proven ability to operationalize these tools for maximum team impact.

  • Infrastructure Skills: Strong proficiency with Terraform and infrastructure-as-code workflows. Experience with multi-cloud deployments and operating resilient systems at scale.

  • Enablement & Collaboration: Passion for enabling product engineers through training and pairing on observability patterns. Ability to drive cross-functional initiatives that improve system health and team effectiveness.

  • Communication Skills: Skilled at explaining complex infrastructure and observability concepts to both technical and non-technical audiences. Calm and decisive under pressure, especially during incident response.

Nice to Have:

  • Experience with Sentinel, Loki, or similar logging/metrics stacks.

  • Exposure to educational or compliance-heavy environments.

  • Strong debugging skills and a calm presence during incidents.

Notice: Priority Deadline and Review Start Date

Please note that applications for this position will be accepted until 7/18/25 — applications received after this date will be reviewed on an intermittent basis. While we encourage early submissions, all applications received by the priority deadline will receive equal consideration. Thank you for your interest, and we look forward to reviewing your application.

Why Join Us?

  • Work on cutting-edge AI technology that directly impacts educators and students.

  • Join a mission-driven team passionate about making education more efficient and equitable.

  • Flexibility of working from home, while fostering a unique culture built on relationships, trust, communication, and collaboration with our team - no matter where they live.

  • Unlimited time off to empower our employees to manage their work-life balance. We work hard for our teachers and users, and encourage our employees to rest and take the time they need.

  • Choice of employer-paid health insurance plans so that you can take care of yourself and your family. Dental and vision are also offered at very low premiums.

  • Every employee is offered generous stock options, vested over 4 years.

  • Plus a 401k match & monthly wellness stipend

Our Values:

  • Educators are Magic:  Educators are the most important ingredient in the educational process - they are the magic, not the AI. Trust them, empower them, and put them at the center of leading change in service of students and families.

  • Joy and Magic: Bring joy and magic into every learning experience - push the boundaries of what’s possible with AI.

  • Community:  Foster community that supports one another during a time of rapid technological change. Listen to them and serve their needs.

  • Innovation:  The education system is outdated and in need of innovation and change - AI is an opportunity to bring equity, access, and serve the individual needs of students better than we ever have before.

  • Responsibility: Put responsibility and safety at the forefront of the technological change that AI is bringing to education.

  • Diversity: Diversity of thought, perspectives, and backgrounds helps us serve the wide audience of educators and students around the world.

  • Excellence:  Educators and students deserve the best - and we strive for the highest quality in everything we do.

Similar Jobs

Alphawave Semi - Optical System Engineer

Alphawave Semi

Toronto, Ontario, Canada (On-Site)
3 Months ago
NetEase Games - International Tax Manager

NetEase Games

(On-Site)
4 Months ago
Games For Love - Project Manager for League of Pros Cause Jam

Games For Love

Lynnwood, Washington, United States (Remote)
1 Year ago
160over90 - Account Director - Partnerships

160over90

New York, New York, United States (On-Site)
3 Months ago
Xsolla - Machine Learning Engineer

Xsolla

Montreal, Quebec, Canada (Remote)
3 Months ago
London stock Exchange - Application Technical Support Engineer (SRE Engineer)

London stock Exchange

Taipei City, Taiwan (Hybrid)
2 Months ago
Sun Studio - Senior Backend and DevOps Engineer

Sun Studio

Ho Chi Minh City, Ho Chi Minh City, Vietnam (On-Site)
5 Months ago
Qualcomm - Automotive Linux Platform Engineer

Qualcomm

Shanghai, China (On-Site)
2 Months ago
Devoteam - IT Traineeship - DevOps (Dutch speaking)

Devoteam

Amsterdam, North Holland, Netherlands (On-Site)
9 Months ago
oportun - Sr. Cloud Engineer

oportun

Mexico (Remote)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Crunchyroll - Engineering Manager, tvOS

Crunchyroll

Los Angeles, California, United States (Hybrid)
3 Months ago
welevel  - Working Student: in Operations Management

welevel

Munich, Bavaria, Germany (On-Site)
5 Months ago
dun bradstreet - Sales Director, National Sales Germany

dun bradstreet

Frankfurt Am Main, Hessen, Germany (Hybrid)
1 Month ago
Workato - Enterprise Development Representative

Workato

Denver, Colorado, United States (On-Site)
3 Weeks ago
Backbone - Technical Program Manager, Mechanical

Backbone

Atherton, California, United States (Hybrid)
1 Year ago
Tekion Corp - Design Operation Specialist II

Tekion Corp

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Buckman - Sr. Process Safety Engineer

Buckman

Memphis, Tennessee, United States (On-Site)
3 Weeks ago
eBay - Product Manager-Product Knowledge

eBay

Bengaluru, Karnataka, India (On-Site)
4 Weeks ago
QuinStreet - Account Management Operations Associate

QuinStreet

United States (Remote)
2 Months ago
Trackman - Customer Service Specialist

Trackman

Phoenix, Arizona, United States (On-Site)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in United States

Harvey - Revenue Accounting Manager

Harvey

San Francisco, California, United States (On-Site)
3 Weeks ago
UPF Industries  - Cabinet Maker

UPF Industries

Fredericksburg, Virginia, United States (On-Site)
4 Weeks ago
Grammarly - Senior Corporate Development & Investor Relations Manager

Grammarly

San Francisco, California, United States (Hybrid)
1 Month ago
Apple - Analog Engineering Program Manager

Apple

Cupertino, California, United States (On-Site)
1 Month ago
Jane Street - Machine Learning Educator

Jane Street

New York, United States (On-Site)
1 Month ago
Enphase Energy - Regional Sales Manager, Commercial and Industrial (Western US Region)

Enphase Energy

United States (Remote)
4 Months ago
Univision - Senior Manager, Asi Studios

Univision

New York, United States (On-Site)
3 Weeks ago
Gupta Media - Data Analyst

Gupta Media

Boston, Massachusetts, United States (On-Site)
3 Months ago
Varonis  - Senior Product Marketing Manager

Varonis

United States (On-Site)
10 Months ago
FICO - Partner Account Manager, Payment Networks

FICO

United States (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Google - Software Engineer III, Infrastructure, Google Cloud Compute Infrastructure

Google

Kirkland, Washington, United States (On-Site)
7 Months ago
Qualcomm - Automotive - Platform Software Engineer

Qualcomm

San Diego, California, United States (On-Site)
2 Months ago
Coupa - Senior Salesforce Solution Architect

Coupa

Mexico City, Mexico (Remote)
3 Months ago
extreme network - Solutions Architect

extreme network

Texas, United States (Remote)
1 Month ago
BigID - Sr Solutions/Presales Engineer

BigID

Singapore (Remote)
3 Months ago
bytedance - Software Engineer Intern (On-Device AI - Intelligent Creation-AI Platform)

bytedance

San Jose, California, United States (On-Site)
4 Months ago
Nagarro - Senior Engineer, DevOps

Nagarro

Mumbai, Maharashtra, India (On-Site)
10 Months ago
bytedance - Traffic Access Architectural SRE - Traffic Infrastructure

bytedance

Singapore (On-Site)
4 Months ago
deel. - Senior Backend Engineer, Node.js + AWS

deel.

Moldova (Remote)
3 Weeks ago

Get notifed when new similar jobs are uploaded