Principal Site Reliability Engineer (AI-first SRE)

Groupon

10+ Years | Colombia (Remote) | Full Time | 1 day ago

Apply Now

Job Summary

Groupon is modernizing its global platform, with reliability at the core of this transformation. We are seeking a Principal Site Reliability Engineer to lead the shift from reactive maintenance to predictive, AI-driven resilience. This role involves designing intelligent, self-healing systems to prevent incidents, ensuring fast, secure, and reliable customer experiences across millions of daily interactions. You will architect highly available systems, leverage AI/ML for infrastructure governance, build AIOps pipelines, and lead chaos engineering programs to enhance revenue resilience.

Must Have

Architect and maintain self-healing systems with 99.9%+ availability targets.
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
Build AIOps-based observability and auto-remediation pipelines.
Apply predictive modeling to forecast failures before they impact users.
Lead chaos, performance, and resilience testing programs.
Map platform and service behavior to revenue impact and drive improved revenue resilience.
Mentor engineers and drive reliability standards across teams.
Partner with platform, data, and product teams to ensure stability aligns with business goals.
Support major incident response, incident review, and participate in on-call rotations.
10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
Proficiency in Python or Go for automation and tooling.
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
Strong communication and influencing skills.

Good to Have

Experience with MLOps or large-scale data infrastructure.
Exposure to FinOps or cloud cost optimization.
Previous leadership of global incident response or SRE transformation programs.

Perks & Benefits

Opportunity to work with cutting-edge technologies in a transformative environment.
A collaborative and innovative work culture that values your expertise and contributions.
Professional growth and leadership development pathways tailored to your aspirations.
A chance to leave a lasting impact by shaping the future of reliable and scalable systems.

Job Description

Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis.

Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We're a "best of both worlds" kind of company. We're big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact.

About the Role

Groupon is modernizing its global platform — and reliability is at the center of that transformation. We’re looking for a Principal Site Reliability Engineer to lead the evolution from reactive maintenance to predictive, AI-driven resilience.

You’ll design intelligent, self-healing systems that prevent incidents before they happen, ensuring our customers enjoy fast, secure, and reliable experiences across millions of daily interactions.

Key Responsibilities:

Architect and maintain self-healing systems with 99.9%+ availability targets.
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
Build AIOps-based observability and auto-remediation pipelines.
Apply predictive modeling to forecast failures before they impact users.
Lead chaos, performance, and resilience testing programs.
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
Mentor engineers and drive reliability standards across teams.
Partner with platform, data, and product teams to ensure stability aligns with business goals.
Support major incident response, incident review, and participate in on-call rotations.

Key Requirements:

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
Proficiency in Python or Go for automation and tooling.
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
Strong communication and influencing skills — data over hierarchy.

Nice to Have:

Experience with MLOps or large-scale data infrastructure.
Exposure to FinOps or cloud cost optimization.
Previous leadership of global incident response or SRE transformation programs.

What Success Looks Like

99.9%+ uptime sustained through predictive rather than reactive responses.
Faster MTTR via automated detection and auto-remediation.
Reliability insights used in leadership decisions.
Mentorship leading to stronger reliability practices across teams.

We Are Interested In

Technologists who see reliability as a product, not just a metric.
Engineers who use AI/ML as a tool for scale and insight.
Leaders who can balance innovation speed with operational excellence.
Engineers who understand the entire e-commerce stack and how it impacts revenue.

What We Offer:

The opportunity to work with cutting-edge technologies in a transformative environment.
A collaborative and innovative work culture that values your expertise and contributions.
Professional growth and leadership development pathways tailored to your aspirations.
A chance to leave a lasting impact by shaping the future of reliable and scalable systems.

Join us to push the boundaries of platform reliability and drive meaningful change in a fast-evolving digital world!

10 Skills Required For This Role

Team Management Leadership Game Texts Incident Response Aws Prometheus Terraform Grafana Kubernetes Python

Similar Jobs

Devops

Senior Site Reliability Engineer - Observability and Telemetry Platform