Lead DevOps / Platform Engineer

Interface AI

6-9 Years | San Jose, California, United States (On Site) | Full Time | 2 months ago

Apply Now

Job Summary

interface.ai is seeking a Lead Platform Engineer to design, build, and evolve its core AI platform infrastructure. This role focuses on software engineering, infrastructure automation, and platform reliability, enabling product and AI teams to ship faster. Responsibilities include designing developer-facing platforms, defining reliability standards, and scaling complex workloads like LLM orchestration, vector databases, and event-driven systems for products like Sphere, Orbit, and Nexus.

Must Have

Design, implement, and maintain core platform services and internal APIs.
Build internal developer platforms (IDP) that streamline CI/CD, environment provisioning, and observability.
Architect for fault tolerance, auto-scaling, and zero-downtime deployments for distributed microservices and AI pipelines.
Own and extend Terraform/Crossplane configurations to standardize provisioning across environments.
Implement deep observability (OpenTelemetry, Prometheus, Grafana) for tracing, metrics, and proactive alerting.
Manage Kubernetes, Helm, and service mesh (Istio/Linkerd) to ensure secure and efficient service communication.
Build and evolve backend services in Go/Node.js/Python for internal orchestration, configuration, and workload routing.
Collaborate with AI teams to optimize LLM workflows, caching strategies, and retrieval pipelines for low-latency inference.
Write high-quality scripts/tools in Python/Go to automate operational tasks, resilience testing, and rollout management.
6–9 years of engineering experience, with at least 3+ years in platform, infrastructure, or DevOps-heavy roles.
Strong proficiency in at least two backend languages (Go, Node.js, or Python).
Hands-on experience with Kubernetes, Helm, Terraform, and declarative infrastructure management.
Deep understanding of distributed systems, container orchestration, and microservice communication.
Proficiency in AWS cloud architecture (EKS, S3, RDS, Lambda, IAM, VPC).
Proven experience with observability and tracing systems (OpenTelemetry, Prometheus, Grafana).
Experience with CI/CD pipeline design (Jenkins, GitHub Actions, ArgoCD, GitOps workflows).
Knowledge of networking, service mesh, and security controls in production-grade environments.
Strong debugging and performance tuning skills; ability to reason about failure modes and resilience.

Good to Have

Exposure to AI/ML or data-intensive systems, including model serving, vector databases, or RAG pipelines.

Perks & Benefits

100% paid health, dental & vision care
401(k) match & financial wellness perks
Discretionary PTO + paid parental leave
Remote-first flexibility
Mental health, wellness & family benefits
A mission-driven team shaping the future of banking

Job Description

About the Role

We are looking for a Lead Platform Engineer to design, build, and evolve our core AI platform infrastructure. This role is at the intersection of software engineering, infrastructure automation, and platform reliability, enabling product and AI teams to ship faster with confidence.

You will design developer-facing platforms, define standards for reliability and observability, and help scale complex workloads like LLM orchestration, vector databases, and event-driven systems.

This is a hands-on role where you’ll shape the foundational components that power our multi-product ecosystem — Sphere (Voice AI), Orbit (Chat AI), and Nexus (Employee Copilot).

What You’ll Do

Platform Architecture: Design, implement, and maintain core platform services and internal APIs for scalable, multi-tenant workloads.
Developer Experience: Build internal developer platforms (IDP) that streamline CI/CD, environment provisioning, and observability across teams.
System Reliability: Architect for fault tolerance, auto-scaling, and zero-downtime deployments for distributed microservices and AI pipelines.
Infrastructure as Code: Own and extend Terraform/Crossplane configurations to standardize provisioning across environments.
Performance & Observability: Implement deep observability (OpenTelemetry, Prometheus, Grafana) for tracing, metrics, and proactive alerting.
Service Orchestration: Manage Kubernetes, Helm, and service mesh (Istio/Linkerd) to ensure secure and efficient service communication.
Platform APIs: Build and evolve backend services in Go/Node.js/Python for internal orchestration, configuration, and workload routing.
AI Platform Integration: Collaborate with AI teams to optimize LLM workflows, caching strategies, and retrieval pipelines for low-latency inference.
Automation: Write high-quality scripts/tools in Python/Go to automate operational tasks, resilience testing, and rollout management.
Cross-Functional Partnership: Work with Product, DevOps, and Security to ensure every platform capability meets performance, compliance, and reliability goals.

What You’ll Bring

6–9 years of engineering experience, with at least 3+ years in platform, infrastructure, or DevOps-heavy roles.
Strong proficiency in at least two backend languages (Go, Node.js, or Python).
Hands-on experience with Kubernetes, Helm, Terraform, and declarative infrastructure management.
Deep understanding of distributed systems, container orchestration, and microservice communication.
Proficiency in AWS cloud architecture (EKS, S3, RDS, Lambda, IAM, VPC).
Proven experience with observability and tracing systems (OpenTelemetry, Prometheus, Grafana).
Experience with CI/CD pipeline design (Jenkins, GitHub Actions, ArgoCD, GitOps workflows).
Exposure to AI/ML or data-intensive systems, including model serving, vector databases, or RAG pipelines.
Knowledge of networking, service mesh, and security controls in production-grade environments.
Strong debugging and performance tuning skills; ability to reason about failure modes and resilience.
Excellent collaboration skills — able to partner with developers, product managers, and AI researchers effectively.

Why Join Us

Build core platform systems that power one of the fastest-growing AI companies in fintech.
Shape developer experience, infrastructure standards, and reliability practices for an AI-first ecosystem.
Collaborate with top-tier engineers, AI researchers, and architects on large-scale distributed systems.
Work in a high-trust, fast-growth environment where innovation meets real-world impact.

Compensation

Compensation is expected to be between $170,000 - $200,000. Exact compensation may vary based on skills and location.

What We Offer

💡 100% paid health, dental & vision care
💰 401(k) match & financial wellness perks
🌴 Discretionary PTO + paid parental leave
🏡 Remote-first flexibility
🧠 Mental health, wellness & family benefits
🚀 A mission-driven team shaping the future of banking

At interface.ai, we are committed to providing an inclusive and welcoming environment for all employees and applicants. We celebrate diversity and believe it is critical to our success as a company. We do not discriminate on the basis of race, color, religion, national origin, age, sex, gender identity, gender expression, sexual orientation, marital status, veteran status, disability status, or any other legally protected status. All employment decisions at Interface.ai are based on business needs, job requirements, and individual qualifications. We strive to create a culture that values and respects each person's unique perspective and contributions. We encourage all qualified individuals to apply for employment opportunities with Interface.ai and are committed to ensuring that our hiring process is inclusive and accessible.

19 Skills Required For This Role

Cross Functional Problem Solving Github Game Texts Networking Aws Service Mesh Model Serving Prometheus Terraform Grafana Helm Node.js Ci Cd Microservices Kubernetes Python Github Actions Jenkins