Senior Site Reliability Engineer

DraftKings

Job Summary

As a Senior Site Reliability Engineer at DraftKings, you will be instrumental in building and scaling critical infrastructure across global data centers, multiple cloud platforms, and on-premise systems. You will design automation-first solutions to enhance performance, eliminate operational friction, and drive stability at scale. This role involves influencing architectural decisions, developing tools for rapid and reliable delivery, and ensuring high uptime and Quality of Service for internal customers.

Must Have

  • Drive stability and scalability across global compute platform.
  • Implement automation for self-healing, fault-tolerant infrastructure.
  • Develop internal tools to eliminate repetitive tasks.
  • Establish critical performance and reliability metrics for infrastructure.
  • Ensure highest uptime and Quality of Service for internal customers.
  • Support technical growth by sharing knowledge and participating in design discussions.
  • Participate in on-call rotation, incident reviews, and Root Cause Analysis (RCA) reporting.
  • At least 4 years of experience managing distributed cloud environments.
  • Deep expertise in container orchestration with Kubernetes.
  • Strong experience developing software for automation and infrastructure tooling (Go, Python).
  • Kubernetes administration experience (installation, configuration, troubleshooting).
  • Working knowledge of networking and Linux-based systems (Docker, containerd, packet-level debugging, kernel troubleshooting).
  • Experience with Infrastructure as Code (IaC) and configuration management tools (Terraform, Chef, Pulumi).

Job Description

At DraftKings, AI is becoming an integral part of both our present and future, powering how work gets done today, guiding smarter decisions, and sparking bold ideas. It’s transforming how we enhance customer experiences, streamline operations, and unlock new possibilities. Our teams are energized by innovation and readily embrace emerging technology. We’re not waiting for the future to arrive. We’re shaping it, one bold step at a time. To those who see AI as a driver of progress, come build the future together.

The Crown Is Yours

As a Senior Site Reliability Engineer, you'll build and scale the critical infrastructure behind every product. In this role, you'll take on complex challenges across global data centers, multiple cloud platforms, and on-premise systems—designing automation-first solutions that elevate performance and eliminate operational friction. You'll be trusted to drive stability at scale, influence architectural decisions, and build tools that empower our teams to move fast and deliver reliably. This is where your impact won't just be felt, it'll be foundational.

What You'll Do

  • Drive stability and scalability across our global compute platform spanning numerous data centers, multiple public clouds, and on-premise environments, serving as the foundation for all our products.
  • Implement automation for self-healing, fault-tolerant infrastructure using declarative configurations and event-driven workflows, and develop internal tools to eliminate repetitive tasks.
  • Establish critical performance and reliability metrics for infrastructure platform components.
  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
  • Support technical growth by sharing knowledge, participating in design discussions, and contributing to a collaborative team culture.
  • Participate in an on-call rotation, incident reviews, root cause identification, and Root Cause Analysis (RCA) reporting.

What You'll Bring

  • Bachelor's degree in Computer Science or relevant education, experience, and training.
  • At least 4 years of experience managing distributed cloud environments, along with platform automation at scale.
  • Deep expertise in container orchestration with Kubernetes with the ability to design, scale, and troubleshoot complex workloads.
  • Strong experience developing software for automation and infrastructure tooling such as Go and Python.
  • Kubernetes administration experience, including installation, configuration, and troubleshooting.
  • Working knowledge of networking and Linux-based systems, including container runtimes such as Docker and containerd, packet-level debugging, and kernel troubleshooting.
  • Experience with Infrastructure as Code (IaC) and configuration management tools, including Terraform, Chef, and Pulumi to ensure scalable and repeatable infrastructure provisioning.
  • Creative problem-solving skills and excellent communication.

#LI-SP1

10 Skills Required For This Role

Communication Problem Solving Game Texts Networking Linux Terraform Chef Docker Kubernetes Python

Similar Jobs