Platform Engineer - Reliability & Scale

LangChain

5+ Years | On Site | Full Time | 1 day ago

Apply Now

Job Summary

LangChain's mission is to make intelligent agents ubiquitous by providing an agent engineering platform and open-source frameworks like LangChain and LangGraph, which see over 90+ million downloads monthly. LangSmith offers observability, evaluation, and deployment for LLM systems. This role on the platform engineering team involves scaling LangSmith and LangGraph Platform products, architecting and operating critical systems for AI observability and app deployments, and working with cutting-edge AI and distributed systems technologies.

Must Have

Design and implement high throughput data-intensive systems supporting our flagship SaaS products (LangSmith and LangGraph Platform)
Build monitoring, alerting, and automated recovery systems that maintain high uptime
Debug performance bottlenecks, optimize database queries, and architect solutions for distributed system challenges
Influence technical decisions around infrastructure, tooling, and operational practices as we grow from startup to enterprise scale
Participate in on-call rotation with focus on post-incident learning, automation and prevention
5+ years building and operating production systems at scale
Production experience with OSS datastores (PostgreSQL, Redis)
Deep knowledge of Cloud Object Storage, Kubernetes, containerized infrastructure, cloud platforms (e.g. GCP)
Hands-on experience with observability stacks (Datadog, Prometheus/Grafana, OpenTelemetry or similar)
Strong hands-on software engineering skills (Python, Go, Rust)
"You build it, you run it, you own it" philosophy with the focus on sustainable practices

Good to Have

Knowledge of columnar file and memory formats
Proficiency with analytical databases
Background in high-growth startups
Previous experience in AI infrastructure

Perks & Benefits

Competitive compensation that includes base salary, meaningful equity
Health and dental coverage
Flexible vacation
401(k) plan
Life insurance
Locally competitive benefits aligned with regional norms and regulations (for EU and UK team members)

Job Description

About LangChain

At LangChain, our mission is to make intelligent agents ubiquitous. We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast.

Our open source frameworks, LangChain and LangGraph, see over 90+ million downloads per month and help developers build agents with speed and granular control. LangSmith offers observability, evaluation, and deployment for rapid iteration, enabling teams to transform LLM systems into dependable production experiences.

LangChain is trusted by millions of developers worldwide and powers AI teams at companies like Replit, Clay, Cloudflare, Harvey, Rippling, Vanta, Workday, and more.

About the role

Join our platform engineering team as we scale LangSmith and LangGraph Platform products. You'll architect and operate the critical systems that power our customers' AI observability and LangGraph app deployments, working directly with cutting-edge technologies at the intersection of AI and distributed systems.

Scale critical systems: Design and implement high throughput data-intensive systems supporting our flagship SaaS products (LangSmith and LangGraph Platform)
Drive reliability: Build monitoring, alerting, and automated recovery systems that maintain high uptime
Solve complex problems: Debug performance bottlenecks, optimize database queries, and architect solutions for distributed system challenges
Shape platform strategy: Influence technical decisions around infrastructure, tooling, and operational practices as we grow from startup to enterprise scale
Respond to incidents: Participate in on-call rotation with focus on post-incident learning, automation and prevention

How to be successful in this role

Experience: 5+ years building and operating production systems at scale
Database expertise: Production experience with OSS datastores (PostgreSQL, Redis)
Infrastructure expertise: Deep knowledge of Cloud Object Storage, Kubernetes, containerized infrastructure, cloud platforms (e.g. GCP)
Observability mastery: Hands-on experience with observability stacks (Datadog, Prometheus/Grafana, OpenTelemetry or similar)
Programming proficiency: Strong hands-on software engineering skills (Python, Go, Rust)
Operational mindset: "You build it, you run it, you own it" philosophy with the focus on sustainable practices

Nice to Have

Knowledge of columnar file and memory formats
Proficiency with analytical databases
Background in high-growth startups
Previous experience in AI infrastructure

Compensation & Benefits

We offer competitive compensation that includes base salary, meaningful equity, and benefits such as health and dental coverage, flexible vacation, a 401(k) plan, and life insurance. Actual compensation will vary based on role, level, and location. For team members in the EU and UK, we provide locally competitive benefits aligned with regional norms and regulations.
Annual salary range: $175,000-$225,000 USD for Senior Engineers

9 Skills Required For This Role

Saas Business Models Game Texts Postgresql Rust Prometheus Grafana Redis Kubernetes Python

Similar Jobs