Observability Platform Engineer

5 Minutes ago • 2-5 Years
Devops

Job Description

As an Observability Platform Engineer at Nscale, you will be responsible for designing, building, and managing systems that provide deep visibility into Nscale’s infrastructure and AI workloads. This role involves treating observability as a product, collaborating with engineering and SRE teams to ensure robust, scalable, and user-friendly monitoring, logging, tracing, and alerting platforms. You will ensure infrastructure health, reliability, and performance by enabling proactive insights and reducing operational friction, combining hands-on engineering with an understanding of user needs.
Good To Have:
  • Hands-on experience operating observability infrastructure at scale.
  • Knowledge of Infrastructure-as-Code (e.g. Terraform) to automate deployments.
  • Exposure to streaming systems or pipelines for observability data.
Must Have:
  • Design, build, and support scalable observability infrastructure.
  • Collaborate with teams on observability integration across GPU clusters, Kubernetes, Slurm, and AI services.
  • Implement and refine monitoring and alerting patterns.
  • Automate observability pipelines using IaC tools and scripting.
  • Troubleshoot observability platform issues and support incident remediation.
  • 2-5 years experience in Software Engineering, SRE, DevOps, or observability-related roles.
  • Proficiency in Python, Go, or Bash.
  • Experience with Kubernetes or containerised environments.
  • Familiarity with on-call responsibilities, triaging, and escalating live production issues.
  • Comfortable with Grafana, Prometheus, Loki, OpenTelemetry, ClickHouse, Elastic, Thanos, VictoriaMetrics.
  • Strong communication and collaboration skills.

Add these skills to join the top 1% applicants for this job

game-texts
prometheus
terraform
grafana
kubernetes
python
bash

About Nscale

Nscale is the GPU cloud engineered for AI. We offer high‑performance, cost‑efficient infrastructure designed for modern AI workloads, blending the power of bespoke supercomputers with the flexibility of cloud services. Our vertically integrated platform spans GPU‑dense, energy‑efficient data centres through Kubernetes and Slurm orchestration to AI‑ready services.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About the Role (Job Purpose)

As an Observability Platform Engineer, you will design, build, and manage the systems that surface deep visibility into Nscale’s infrastructure and AI workloads. You’ll treat observability as a product, partnering with engineering and SRE teams to ensure our monitoring, logging, tracing, and alerting platforms are robust, scalable, and easy to use.

This role requires hands-on engineering experience combined with empathy for how other teams consume observability data. You’ll ensure infrastructure health, reliability, and performance by enabling proactive insights and reducing operational friction.

What You’ll Do

  • Design, build, and support scalable observability infrastructure (metrics, logs, traces, alerts).
  • Collaborate with internal teams to embed observability as a seamless product across GPU clusters, Kubernetes, Slurm, and AI services.
  • Implement and refine monitoring and alerting patterns to enhance system reliability and reliability culture.
  • Maintain production and pre-production observability clusters and help others adopt best practices.
  • Automate observability pipelines using IaC tools and scripting for repeatability and consistency.
  • Troubleshoot observability platform issues and support incident remediation efforts.
  • Serve as an advocate for observability best practices, training teams on effective usage and instrumentation.

About you

Skills / Experience

  • 2–5 years of experience in Software Engineering, SRE, DevOps, or observability-related roles.
  • Proficiency in at least one scripting or programming language (Python, Go, Bash).
  • Experience with Kubernetes or containerised environments.
  • Familiarity with on-call responsibilities, triaging, and escalating live production issues.
  • Comfortable with observability tooling, Grafana, Prometheus, Loki, OpenTelemetry, ClickHouse, Elastic, Thanos, VictoriaMetrics, etc.
  • Strong communication and collaboration skills, able to empathise with users of observability systems and translate needs into solutions.

Preferred

  • Hands-on experience operating observability infrastructure at scale.
  • Knowledge of Infrastructure-as-Code (e.g. Terraform) to automate deployments.
  • Exposure to streaming systems or pipelines for observability data.

In all we do, our core values guide us:

Relentless Innovation

At Nscale, we constantly push the boundaries of innovation, embracing creative risks to shape the future. Our aim is to deliver products that not only meet but exceed today’s expectations, setting new standards for tomorrow.

Ownership and Accountability

Every Nscaler is fully accountable for their work, driving it with excellence and urgency. We set high standards, ensuring that our contributions are not just good but exceptional.

Openness and Transparency

We believe trust and transparency are key to our success. We maintain open communication within our teams and with stakeholders, sharing both successes and challenges. Our open-source approach allows customers to explore our technology, building trust and ensuring our solutions are both innovative, secure, and reliable.

Customer-Centric Focus

Our customers are central to our mission, and we are committed to delivering impactful solutions that drive real-world success. We focus on deeply understanding their needs and challenges, striving to exceed expectations in both product quality and service.

Sustainability

We are dedicated to considering the long-term environmental and societal impacts of our technologies. By integrating sustainability into our operations and product development, we ensure that our innovations are both effective and responsible, contributing positively to the world around us.

Full-Speed Collaboration

Collaboration at Nscale is fast, efficient, and respectful. We work together seamlessly, with clear communication and mutual respect, ensuring our shared goals are met with high standards and impactful outcomes.

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.

Set alerts for more jobs like Observability Platform Engineer
Set alerts for new jobs by NSCALE
Set alerts for new Devops jobs in United Kingdom
Set alerts for new jobs in United Kingdom
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙