Staff Software Engineer, Compute

4 Hours ago • All levels • $180,000 PA - $250,000 PA
Software Development & Engineering

Job Description

You are an experienced software engineer who thrives on building large scale computation platforms. You have deep expertise in backend systems that orchestrate workloads and route requests efficiently, while taking care of capacity and resource constraints. You possess a strong understanding of foundational cloud infrastructure and Linux provisioning and management tools. You know how to achieve reliability and scale with minimum operational load. This role involves developing and maintaining the core Python platform, infrastructure layer using Terraform and Ansible, and owning technologies like K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, and distributed networking storage, while also creating the vision for future infrastructure.
Must Have:
  • Develop and maintain our core Python platform, which handles routing of requests, orchestration of AI workloads, GPU server capacity management, observability, authentication, rate limiting.
  • Develop and maintain our infrastructure layer where we use Terraform, Ansible, and provider APIs to manage our fleet of GPU workers.
  • Own K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking storage, and other technologies that underpin our platform.
  • Create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years.
  • Deep experience building distributed compute platforms, preferably with Python.
  • Strong foundation in managing both cloud and bare metal infrastructure.
  • Solid understanding of K8s and CI/CD on it.
  • Excellent communication.
  • Self-starter who executes quickly, takes ownership and constantly seeks improvement.
Perks:
  • Interesting and challenging work.
  • Employee-friendly equity terms (early exercise, extended exercise).
  • A lot of learning and growth opportunities.
  • Visa sponsorship and relocation assistance to San Francisco.
  • Health, dental, and vision insurance (US).
  • Regular team events and offsites.

Add these skills to join the top 1% applicants for this job

communication
game-texts
networking
linux
prometheus
ansible
terraform
grafana
ci-cd
python

You are an experienced software engineer who thrives on building large scale computation platforms. You have deep expertise in backend systems that orchestrate workloads and route requests efficiently, while taking care of capacity and resource constraints. You possess a strong understanding of foundational cloud infrastructure and Linux provisioning and management tools. You know how to achieve reliability and scale with minimum operational load.

Key responsibilities

  • Develop and maintain our core Python platform, which handles routing of requests, orchestration of AI workloads, GPU server capacity management, observability, authentication, rate limiting, and many others
  • Develop and maintain our infrastructure layer where we use Terraform, Ansible, and provider APIs to manage our fleet of GPU workers
  • Own K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking storage, and other technologies that underpin our platform
  • Create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years

Requirements

  • Deep experience building distributed compute platforms, preferably with Python
  • Strong foundation in managing both cloud and bare metal infrastructure
  • Solid understanding of K8s and CI/CD on it
  • Excellent communication
  • Self-starter who executes quickly, takes ownership and constantly seeks improvement

What we offer at fal

  • Interesting and challenging work
  • Employee-friendly equity terms (early exercise, extended exercise)
  • A lot of learning and growth opportunities
  • We offer visa sponsorship and will help you relocate to San Francisco.
  • Health, dental, and vision insurance (US)
  • Regular team events and offsites

Set alerts for more jobs like Staff Software Engineer, Compute
Set alerts for new jobs by fal
Set alerts for new Software Development & Engineering jobs in United States
Set alerts for new jobs in United States
Set alerts for Software Development & Engineering (Remote) jobs
Contact Us
hello@outscal.com
Made in INDIA 💛💙