Senior Site Reliability Engineer
GoDaddy
Job Summary
GoDaddy is seeking a Senior Site Reliability Engineer to lead reliability for security platforms protecting GoDaddy’s global footprint. This remote role involves defining SLOs, automating operations, and leading incident response for critical security infrastructure like IDS/IPS and DDoS mitigation. The engineer will ensure high availability, performance, and operational excellence, reducing toil and maintaining audit-ready operations.
Must Have
- Define SLIs/SLOs and error budgets for security platforms.
- Architect high availability and disaster recovery for security services.
- Design zero-downtime maintenance and upgrade strategies.
- Automate deployments, configuration, and compliance using SaltStack and Python.
- Operate and improve security stack: TrendMicro TippingPoint IPS, Suricata, NetScout/Arbor Sightline/TMS, HAProxy, Nginx, Juniper, Palo Alto, Kentik/KProxy.
- Build and evolve observability: Icinga, Grafana, InfluxDB, rsyslog.
- Lead incident response in a 24/7 on-call rotation.
- Reduce toil via self-service tooling and automated health checks.
- Ensure audit-ready operations aligned to WebTrust and PCI-DSS.
- Collaborate with engineering and product teams; mentor contractors.
- Maintain high-quality operational documentation.
- 5+ years in SRE/production operations or platform engineering.
- Expert-level SaltStack for configuration management.
- Strong Linux administration and troubleshooting.
- Deep understanding of TCP/IP, routing, L4–L7, and load balancing.
- Proficiency in Python for automation and tooling.
- Experience with observability tools: Icinga, Grafana, InfluxDB, rsyslog.
- Familiarity with Git-based workflows and Infrastructure as Code.
- Proven effectiveness in 24/7 on-call and incident management.
- Excellent technical writing and documentation skills.
Good to Have
- Hands-on administration of IDS/IPS and DDoS platforms (TrendMicro TippingPoint, Suricata, NetScout/Arbor Sightline/TMS).
- Experience with HAProxy and Nginx.
- Juniper and Palo Alto administration.
- Bachelor’s degree in Computer Science, Information Technology, or related field.
- Industry certifications (Security+, CISSP, Linux+).
- Experience in web hosting, ISP, or managed service provider environments.
- Background in incident response, change management, and compliance audits.
- Experience operating hybrid cloud/on-premises infrastructure.
- Understanding of WebTrust and PCI-DSS operational requirements.
Perks & Benefits
- Paid time off
- Retirement savings (e.g., 401k, pension schemes)
- Bonus/incentive eligibility
- Equity grants
- Participation in employee stock purchase plan
- Competitive health benefits
- Family-friendly benefits including parental leave
- Employee Resource Groups (Culture)
- Support for entrepreneurs/side hustles
Job Description
Join our team
Are you ready to be a reliability leader for the security platforms that protect GoDaddy’s global footprint? Our Security Infrastructure Operations team is seeking a Senior Site Reliability Engineer to drive availability, performance, and operational excellence across our IDS/IPS, DDoS mitigation, and other security services. You’ll define SLOs, reduce toil with automation, and lead incident response for mission-critical security infrastructure.
What you'll get to do…
- Own reliability outcomes for security platforms by defining SLIs/SLOs and error budgets; build actionable alerting, dashboards, and runbooks.
- Architect and implement high availability, capacity planning, and disaster recovery for IDS/IPS, DDoS mitigation, and supporting services.
- Design zero/minimal-downtime maintenance and upgrade strategies for OS, firmware, and signature updates.
- Automate deployments, configuration, and compliance using SaltStack and Python.
- Operate and improve a heterogeneous stack: TrendMicro TippingPoint IPS, Suricata, NetScout/Arbor Sightline/TMS, HAProxy, Nginx, Juniper, Palo Alto, Kentik/KProxy.
- Build and evolve observability: Icinga alerting, Grafana dashboards, InfluxDB metrics, rsyslog pipelines; drive SLO-based alerting and noise reduction.
- Lead incident response within a 24/7 on-call rotation; act as incident commander, drive rapid mitigation, and run blameless postmortems with durable fixes.
- Reduce toil through self-service tooling, APIs, and automated health checks; champion reliability reviews and game days/chaos testing.
- Ensure audit-ready operations aligned to WebTrust and PCI-DSS; uphold change management, configuration baselines, and access controls.
- Collaborate with Network Engineering, Security Architecture, Hosting, and Product teams; mentor and provide technical guidance to 2–3 contractors.
- Maintain high-quality operational documentation, SOPs, and architectural diagrams
Your experience should include...
- 5+ years in SRE/production operations or platform engineering supporting large-scale, mission-critical systems; experience with network/security platforms.
- Expert-level SaltStack for configuration management and automation ie. Puppet Ansible.
- Strong Linux administration and troubleshooting; deep understanding of TCP/IP, routing, L4–L7, and load balancing.
- Proficiency in Python for automation, tooling, and integrations; familiarity with software engineering practices (code reviews, testing, CI/CD).
- Observability at scale: Icinga, Grafana, InfluxDB, and rsyslog.
- Git-based workflows and Infrastructure as Code concepts.
- Proven effectiveness in a 24/7 operations environment with on-call responsibilities and incident management.
- Excellent technical writing and documentation skills; ability to lead and mentor contractors.
You may also have…
- Hands-on administration of IDS/IPS and DDoS platforms (TrendMicro TippingPoint, Suricata, NetScout/Arbor Sightline/TMS).
- Experience with HAProxy and Nginx; Juniper and Palo Alto administration.
- Bachelor’s degree in Computer Science, Information Technology, or related field.
- Industry certifications (Security+, CISSP, Linux+).
- Experience in web hosting, ISP, or managed service provider environments.
- Background in incident response, change management, and compliance audits.
- Experience operating hybrid cloud/on-premises infrastructure; understanding of WebTrust and PCI-DSS operational requirements.
We've got your back...
We offer a range of total rewards that may include paid time off, retirement savings (e.g., 401k, pension schemes), bonus/incentive eligibility, equity grants, participation in our employee stock purchase plan, competitive health benefits, and other family-friendly benefits including parental leave. GoDaddy’s benefits vary based on individual role and location and can be reviewed in more detail during the interview process.
We also embrace our diverse culture and offer a range of Employee Resource Groups (Culture). Have a side hustle? No problem. We love entrepreneurs! Most importantly, come as you are and make your own way.