Distributed Systems Engineer

10 Minutes ago • All levels • System Design

Job Summary

Job Description

At Krea, we are building next-generation AI creative tools, dedicated to making AI intuitive and controllable for creatives. This role focuses on designing, building, and maintaining robust, reliable, and scalable distributed systems that form the backbone of Krea's infrastructure. These systems support AI research, real-time user experiences, and large-scale model deployments, including managing multi-thousand-node Kubernetes GPU clusters and collaborating with ML engineers.
Must have:
  • Design, build, and maintain large-scale distributed infrastructure.
  • Own and scale multi-thousand-node Kubernetes GPU clusters.
  • Collaborate with ML engineers and researchers to architect systems.
  • Improve network architecture, optimize load balancing, and streamline operational practices.
Good to have:
  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS/GCP/Azure)
  • High-performance and fault-tolerant networking
  • Low-level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Infrastructure as Code (e.g. Terraform)

Job Details

About Krea

At Krea, we are building next-generation AI creative tools.

We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it.

We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We're building better, smarter, and more controllable tools to harness this medium.

This job

Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real-time user experiences, and large-scale model deployments.

As a Distributed Systems Engineer, you will…

  • … design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving.
  • … own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations.
  • … collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
  • … improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments.

Example projects

  • Own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads.
  • Architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures.
  • Develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones.
  • Create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters.
  • Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability.
  • Setting up IP-based rate-limiting to prevent GPU abuse.

Strong candidates may have experience with…

  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS/GCP/Azure)
  • High-performance and fault-tolerant networking
  • Low-level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Bonus: Infrastructure as Code (e.g. Terraform)

About us

  • We’re building AI creative tooling.
  • We’ve raised over $83M from the best investors in Silicon Valley.
  • We’re a team of 12 with millions of active users scaling aggressively.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in San Francisco, California, United States

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

System Design Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by krea.ai

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug