About Sixtyfour
We build AI research agents that can discover, link, and reason over everything about people and companies. The platform turns that intelligence into automated research workflows for sales, recruiting, and marketing.
About the role
Skills: Kubernetes, Amazon Web Services (AWS)
What you’ll do
- Design and maintain highly available, scalable infrastructure across AWS (ECS, EKS, Lambda, SQS, CloudFront, CloudWatch).
- Architect automated CI/CD pipelines (GitHub Actions, Terraform) with strong testing, observability, and rollback safety.
- Optimize LLM inference infrastructure, including autoscaling GPU/CPU clusters, caching, async queues, batching, and tracing.
- Improve deployment workflows and environment consistency using Docker, IaC, and lightweight configuration management.
- Work on backend performance, including queue throughput, caching strategies, database indexing, and load balancing.
- Monitor, debug, and improve system reliability and latency across all services (API, inference, and web app).
- Build internal tools that enhance developer productivity and operational visibility.
- Partner with engineers to evolve the workflow and job execution engine for better parallelism, retry logic, and observability.
- Set up metrics, tracing, and alerting (OpenTelemetry, Prometheus, Grafana, Sentry) to make reliability measurable and actionable.
Minimum requirements
- Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
- Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
- Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
- Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
- Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
- Strong observability mindset, including logs, metrics, and traces.
- Production experience with autoscaling, load testing, and cost-aware resource optimization.
- Excellent debugging and on-call discipline with a focus on uptime and reliability.
Nice to have
- Experience managing LLM serving infrastructure (OpenAI-compatible APIs, vLLM, Triton, or similar).
- Familiarity with Next.js and TypeScript to understand end-to-end deployment pipelines.
- Experience with Terraform, Pulumi, or similar IaC tools.
- Security-focused mindset, including network boundaries, secret management, and RBAC.
- Knowledge of real-time systems (SSE or WebSockets) or stream processing.
- Experience building developer platform tools or internal DevOps systems.
Technology
Language Models, Opensearch/Elasticsearch, Next.js (typescript), Python, FastAPI, AWS, Docker, Celery workers, Playwright, Supabase, Stripe