Vice President of SRE

NSCALE

10+ Years | United Kingdom (On Site) | Full Time | 2 weeks ago

Apply Now

Job Summary

Nscale is seeking a Vice President of Site Reliability Engineering (SRE) to lead its global reliability function. This role involves owning the strategy, execution, and leadership of the SRE organization, ensuring the GPU-accelerated cloud operates with world-class reliability, observability, and operational excellence. The VP of SRE will be responsible for building and scaling SRE teams, defining reliability practices, and driving automation and resilience across infrastructure, partnering closely with Product, Engineering, Infrastructure, and Operations leadership.

Must Have

Define and execute Nscale’s global SRE strategy, aligning reliability goals with business outcomes.
Build, scale, and lead a world-class SRE organisation, including hiring, mentoring, and developing talent.
Own service reliability frameworks, including SLOs, SLIs, and error budgets.
Drive the design, automation, and operation of infrastructure platforms across bare-metal, OpenStack, Kubernetes, and Slurm environments.
Establish best-in-class incident management practices—minimising MTTR and maximising learning from post-mortems.
Partner with Observability, Infrastructure, and Product teams to deliver 360° visibility.
Guide capacity planning and scaling strategies, ensuring platform resilience.
Champion automation-first principles across provisioning, monitoring, CI/CD, and operational workflows.
Provide executive-level reporting on reliability, operational performance, and capacity.
Stay ahead of industry trends in SRE, automation, and AIOps.
10+ years of experience in SRE, Infrastructure, or Reliability Engineering, including 3+ years in a leadership role.
Proven track record building and leading distributed SRE or infrastructure operations teams.
Deep expertise with Linux systems, Kubernetes, and cloud-native platforms.
Strong background in bare-metal and datacentre operations.
Demonstrated experience in defining and enforcing SLOs/SLIs and error budgets.
Strong knowledge of automation and Infrastructure-as-Code (Terraform, Ansible, Crossplane).
Experience driving observability best practices using Prometheus, Grafana, and related tools.
Skilled communicator with the ability to influence cross-functional teams and report at executive level.

Good to Have

Prior experience with OpenStack (OVN networking, KVM virtualization) or HPC environments (Slurm, RDMA, InfiniBand).
Contributions to open-source communities in SRE, infrastructure, or cloud-native spaces.
Experience embedding secure and compliant operational practices (SOC2, ISO 27001, GDPR).
Background scaling infrastructure for AI, GPU workloads, or HPC environments.

Job Description

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

At Nscale, our Engineering team plays a critical role in driving the deployment and then subsequent management of our infrastructure and software platforms.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About the Role

We are seeking a VP of Site Reliability Engineering (SRE) to lead Nscale’s global reliability function. You will own the strategy, execution, and leadership of our SRE organisation, ensuring our GPU-accelerated cloud operates with world-class reliability, observability, and operational excellence.

You’ll be responsible for building and scaling SRE teams, defining reliability practices, and driving automation and resilience across infrastructure. This is a high-impact role that will partner closely with Product, Engineering, Infrastructure, and Operations leadership to deliver a secure, performant, and reliable platform at hyperscale.

What you'll be doing

Define and execute Nscale’s global SRE strategy, aligning reliability goals with business outcomes.
Build, scale, and lead a world-class SRE organisation, including hiring, mentoring, and developing talent across multiple regions.
Own service reliability frameworks, including SLOs, SLIs, and error budgets, embedding them into engineering culture.
Drive the design, automation, and operation of infrastructure platforms across bare-metal, OpenStack, Kubernetes, and Slurm environments.
Establish best-in-class incident management practices—minimising MTTR and maximising learning from post-mortems.
Partner with Observability, Infrastructure, and Product teams to deliver 360° visibility across GPU clusters, fabrics, and services.
Guide capacity planning and scaling strategies, ensuring platform resilience as Nscale expands globally.
Champion automation-first principles across provisioning, monitoring, CI/CD, and operational workflows.
Provide executive-level reporting on reliability, operational performance, and capacity to senior leadership.
Stay ahead of industry trends in SRE, automation, and AIOps, applying them to Nscale’s infrastructure at scale.

About you

10+ years of experience in SRE, Infrastructure, or Reliability Engineering, including 3+ years in a leadership role.
Proven track record building and leading distributed SRE or infrastructure operations teams.
Deep expertise with Linux systems, Kubernetes, and cloud-native platforms.
Strong background in bare-metal and datacentre operations, including provisioning (PXE, IPMI), networking, and hardware lifecycle.
Demonstrated experience in defining and enforcing SLOs/SLIs and error budgets.
Strong knowledge of automation and Infrastructure-as-Code (Terraform, Ansible, Crossplane).
Experience driving observability best practices using Prometheus, Grafana, and related tools.
Skilled communicator with the ability to influence cross-functional teams and report at executive level.

Preferred Qualifications

Prior experience with OpenStack (OVN networking, KVM virtualization) or HPC environments (Slurm, RDMA, InfiniBand).
Contributions to open-source communities in SRE, infrastructure, or cloud-native spaces.
Experience embedding secure and compliant operational practices (SOC2, ISO 27001, GDPR).
Background scaling infrastructure for AI, GPU workloads, or HPC environments.

In all we do, our core values guide us.

Relentless Innovation

Ownership and Accountability

Openness and Transparency

Customer-Centric Focus

Sustainability

Full-Speed Collaboration

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

13 Skills Required For This Role

Cross Functional Cost Management Game Texts Networking Linux Kvm Prometheus Openstack Ansible Terraform Grafana Ci Cd Kubernetes

Similar Jobs