Infrastructure Engineer — Systems & Platform

Sixtyfour

1+ Years | San Francisco, California, United States (On Site) | Full Time | 1 months ago

Apply Now

Job Summary

Sixtyfour builds AI research agents that can discover, link, and reason over information about people and companies, automating research workflows for sales, recruiting, and marketing. This role involves designing and maintaining highly available, scalable infrastructure on AWS, architecting CI/CD pipelines, optimizing LLM inference, improving deployment workflows, and enhancing system reliability and developer productivity.

Must Have

Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
Strong observability mindset, including logs, metrics, and traces.
Production experience with autoscaling, load testing, and cost-aware resource optimization.
Excellent debugging and on-call discipline with a focus on uptime and reliability.

Good to Have

Experience managing LLM serving infrastructure (OpenAI-compatible APIs, vLLM, Triton, or similar).
Familiarity with Next.js and TypeScript to understand end-to-end deployment pipelines.
Experience with Terraform, Pulumi, or similar IaC tools.
Security-focused mindset, including network boundaries, secret management, and RBAC.
Knowledge of real-time systems (SSE or WebSockets) or stream processing.
Experience building developer platform tools or internal DevOps systems.

Job Description

About Sixtyfour

We build AI research agents that can discover, link, and reason over everything about people and companies. The platform turns that intelligence into automated research workflows for sales, recruiting, and marketing.

About the role

Skills: Kubernetes, Amazon Web Services (AWS)

What you’ll do

Design and maintain highly available, scalable infrastructure across AWS (ECS, EKS, Lambda, SQS, CloudFront, CloudWatch).
Architect automated CI/CD pipelines (GitHub Actions, Terraform) with strong testing, observability, and rollback safety.
Optimize LLM inference infrastructure, including autoscaling GPU/CPU clusters, caching, async queues, batching, and tracing.
Improve deployment workflows and environment consistency using Docker, IaC, and lightweight configuration management.
Work on backend performance, including queue throughput, caching strategies, database indexing, and load balancing.
Monitor, debug, and improve system reliability and latency across all services (API, inference, and web app).
Build internal tools that enhance developer productivity and operational visibility.
Partner with engineers to evolve the workflow and job execution engine for better parallelism, retry logic, and observability.
Set up metrics, tracing, and alerting (OpenTelemetry, Prometheus, Grafana, Sentry) to make reliability measurable and actionable.

Minimum requirements

Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
Strong observability mindset, including logs, metrics, and traces.
Production experience with autoscaling, load testing, and cost-aware resource optimization.
Excellent debugging and on-call discipline with a focus on uptime and reliability.

Nice to have

Experience managing LLM serving infrastructure (OpenAI-compatible APIs, vLLM, Triton, or similar).
Familiarity with Next.js and TypeScript to understand end-to-end deployment pipelines.
Experience with Terraform, Pulumi, or similar IaC tools.
Security-focused mindset, including network boundaries, secret management, and RBAC.
Knowledge of real-time systems (SSE or WebSockets) or stream processing.
Experience building developer platform tools or internal DevOps systems.

Technology

Language Models, Opensearch/Elasticsearch, Next.js (typescript), Python, FastAPI, AWS, Docker, Celery workers, Playwright, Supabase, Stripe

26 Skills Required For This Role

Problem Solving Github Data Structures Game Texts Load Testing Playwright Networking Aws Load Balancing Prometheus Grafana Terraform Elasticsearch Circleci Amazon Web Services Fastapi Redis Ci Cd Docker Websockets Kubernetes Python Next.js Github Actions Typescript Stripe

Similar Jobs