Infrastructure Support Manager

23 Minutes ago • All levels
IT & Infrastructure

Job Description

Nscale is seeking an Infrastructure Support Manager to lead the daily operations and support of its global datacenter infrastructure. This role involves managing a team of engineers responsible for monitoring, troubleshooting, and incident response for critical GPU, networking, and storage systems across multiple datacenters. The manager will ensure quick incident resolution, continuous infrastructure health monitoring, and consistent adherence to support processes, driving operational excellence and reliability.
Good To Have:
  • Experience in AI/ML or high-performance computing datacenter support.
Must Have:
  • Lead and manage a team of infrastructure support engineers.
  • Oversee daily monitoring and support of GPU, networking, and storage systems.
  • Ensure rapid and effective incident response, escalation, and resolution.
  • Develop and maintain support processes, runbooks, and escalation procedures.
  • Collaborate with engineering, buildout, and operations teams.
  • Conduct root cause analysis and implement preventative measures.
  • Track and report on support metrics (SLAs, uptime, MTTR, incident volume).
  • Drive adoption of monitoring, observability, and automation tools.
  • Mentor and develop team members.
  • Participate in the on-call rotation.
  • Proven experience in datacenter infrastructure support or operations management.
  • Strong technical knowledge of servers, GPUs, networking, and storage systems.
  • Solid understanding of monitoring and observability practices and tools.
  • Experience leading support teams in mission-critical 24/7 environments.
  • Excellent troubleshooting and problem-solving skills with a focus on root cause analysis.
  • Familiarity with ITIL or other support frameworks.
  • Strong leadership, communication, and coaching skills.
Perks:
  • Collaborative, supportive, and innovative environment.
  • Highly competitive package (base + equity).
  • Reviews every 12 months.
  • Join the fastest-growing tech startup.
  • Dynamic progression plan tailored to ambitions.
  • Human-First Flexibility and autonomy.
  • Thriving remote-first team with seamless virtual collaboration.

Add these skills to join the top 1% applicants for this job

problem-solving
cost-management
game-texts
networking
incident-response
prometheus
grafana
spark

Infrastructure Support Manager

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About the Role (Job Purpose)

Nscale is seeking an Infrastructure Support Manager to lead the daily operations and support of our global datacenter infrastructure. This role will manage a team of engineers providing monitoring, troubleshooting, and incident response for mission-critical GPU, networking, and storage systems across multiple datacenters.

You will ensure that incidents are resolved quickly, infrastructure health is continuously monitored, and support processes are followed consistently. This leadership role is key to guaranteeing operational excellence and reliability across Nscale’s datacenter footprint.

What You’ll be Doing (Responsibilities)

  • Lead and manage a team of infrastructure support engineers across our London datacenter site.
  • Oversee daily monitoring and support of GPU, networking, and storage systems.
  • Ensure rapid and effective incident response, escalation, and resolution.
  • Develop and maintain support processes, runbooks, and escalation procedures.
  • Collaborate with engineering, buildout, and operations teams to improve reliability and reduce recurring issues.
  • Conduct root cause analysis and implement preventative measures for critical incidents.
  • Track and report on support metrics (SLAs, uptime, MTTR, incident volume) to leadership. ● Drive adoption of monitoring, observability, and automation tools across the team.
  • Mentor and develop team members, fostering a culture of operational excellence.
  • Participate in the on-call rotation and ensure adequate coverage across regions.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

About You (Skills / Qualifications Experience)

  • Proven experience in datacenter infrastructure support or operations management.
  • Strong technical knowledge of servers, GPUs, networking, and storage systems.
  • Solid understanding of monitoring and observability practices and tools (e.g., Prometheus, Grafana, Datadog).
  • Experience leading support teams in mission-critical 24/7 environments.
  • Excellent troubleshooting and problem-solving skills with a focus on root cause analysis.
  • Familiarity with ITIL or other support frameworks for incident, problem, and change management.
  • Strong leadership, communication, and coaching skills with the ability to manage global teams.
  • Experience in AI/ML or high-performance computing datacenter support
  • Ability to collaborate across engineering, operations, and vendor partners.

What We Can Offer You

At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.

  • Highly competitive package (base + equity) with reviews every 12 months. 🚀
  • Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
  • Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
  • Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
  • Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

Set alerts for more jobs like Infrastructure Support Manager
Set alerts for new jobs by NSCALE
Set alerts for new IT & Infrastructure jobs in United Kingdom
Set alerts for new jobs in United Kingdom
Set alerts for IT & Infrastructure (Remote) jobs
Contact Us
hello@outscal.com
Made in INDIA 💛💙