Principal Site Reliability Engineer (AI-first SRE)
Groupon
Job Summary
Groupon is modernizing its global platform, with reliability at the core of this transformation. We are seeking a Principal Site Reliability Engineer to lead the shift from reactive maintenance to predictive, AI-driven resilience. This role involves designing intelligent, self-healing systems to prevent incidents, ensuring fast, secure, and reliable customer experiences across millions of daily interactions. You will architect highly available systems, leverage AI/ML for infrastructure governance, build AIOps pipelines, and lead chaos engineering programs to enhance revenue resilience.
Must Have
- Architect and maintain self-healing systems with 99.9%+ availability targets.
- Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
- Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
- Build AIOps-based observability and auto-remediation pipelines.
- Apply predictive modeling to forecast failures before they impact users.
- Lead chaos, performance, and resilience testing programs.
- Map platform and service behavior to revenue impact and drive improved revenue resilience.
- Mentor engineers and drive reliability standards across teams.
- Partner with platform, data, and product teams to ensure stability aligns with business goals.
- Support major incident response, incident review, and participate in on-call rotations.
- 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
- Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
- Proficiency in Python or Go for automation and tooling.
- Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
- Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
- Strong communication and influencing skills.
Good to Have
- Experience with MLOps or large-scale data infrastructure.
- Exposure to FinOps or cloud cost optimization.
- Previous leadership of global incident response or SRE transformation programs.
Perks & Benefits
- Opportunity to work with cutting-edge technologies in a transformative environment.
- A collaborative and innovative work culture that values your expertise and contributions.
- Professional growth and leadership development pathways tailored to your aspirations.
- A chance to leave a lasting impact by shaping the future of reliable and scalable systems.
Job Description
Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis.
Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We're a "best of both worlds" kind of company. We're big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact.
About the Role
Groupon is modernizing its global platform — and reliability is at the center of that transformation. We’re looking for a Principal Site Reliability Engineer to lead the evolution from reactive maintenance to predictive, AI-driven resilience.
You’ll design intelligent, self-healing systems that prevent incidents before they happen, ensuring our customers enjoy fast, secure, and reliable experiences across millions of daily interactions.
Key Responsibilities:
- Architect and maintain self-healing systems with 99.9%+ availability targets.
- Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
- Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
- Build AIOps-based observability and auto-remediation pipelines.
- Apply predictive modeling to forecast failures before they impact users.
- Lead chaos, performance, and resilience testing programs.
- Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
- Mentor engineers and drive reliability standards across teams.
- Partner with platform, data, and product teams to ensure stability aligns with business goals.
- Support major incident response, incident review, and participate in on-call rotations.
Key Requirements:
- 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
- Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
- Proficiency in Python or Go for automation and tooling.
- Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
- Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
- Strong communication and influencing skills — data over hierarchy.
Nice to Have:
- Experience with MLOps or large-scale data infrastructure.
- Exposure to FinOps or cloud cost optimization.
- Previous leadership of global incident response or SRE transformation programs.
What Success Looks Like
- 99.9%+ uptime sustained through predictive rather than reactive responses.
- Faster MTTR via automated detection and auto-remediation.
- Reliability insights used in leadership decisions.
- Mentorship leading to stronger reliability practices across teams.
We Are Interested In
- Technologists who see reliability as a product, not just a metric.
- Engineers who use AI/ML as a tool for scale and insight.
- Leaders who can balance innovation speed with operational excellence.
- Engineers who understand the entire e-commerce stack and how it impacts revenue.
What We Offer:
- The opportunity to work with cutting-edge technologies in a transformative environment.
- A collaborative and innovative work culture that values your expertise and contributions.
- Professional growth and leadership development pathways tailored to your aspirations.
- A chance to leave a lasting impact by shaping the future of reliable and scalable systems.
Join us to push the boundaries of platform reliability and drive meaningful change in a fast-evolving digital world!