Distributed Systems Engineer

1 Month ago • All levels • System Design

Job Summary

Job Description

At Krea, we are building next-generation AI creative tools, dedicated to making AI intuitive and controllable for creatives. This role focuses on designing, building, and maintaining robust, reliable, and scalable distributed systems that form the backbone of Krea's infrastructure. These systems support AI research, real-time user experiences, and large-scale model deployments, including managing multi-thousand-node Kubernetes GPU clusters and collaborating with ML engineers.
Must have:
  • Design, build, and maintain large-scale distributed infrastructure.
  • Own and scale multi-thousand-node Kubernetes GPU clusters.
  • Collaborate with ML engineers and researchers to architect systems.
  • Improve network architecture, optimize load balancing, and streamline operational practices.
Good to have:
  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS/GCP/Azure)
  • High-performance and fault-tolerant networking
  • Low-level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Infrastructure as Code (e.g. Terraform)

Job Details

About Krea

At Krea, we are building next-generation AI creative tools.

We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it.

We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We're building better, smarter, and more controllable tools to harness this medium.

This job

Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real-time user experiences, and large-scale model deployments.

As a Distributed Systems Engineer, you will…

  • … design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving.
  • … own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations.
  • … collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
  • … improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments.

Example projects

  • Own and manage a large-scale Kubernetes cluster designed to run extensive ML training and inference workloads.
  • Architect fault-tolerant systems ensuring uninterrupted model training and real-time inference despite individual node failures.
  • Develop and implement optimized load-balancing strategies to efficiently distribute workloads across zones.
  • Create comprehensive monitoring, alerting systems, and operational playbooks for high-availability clusters.
  • Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability.
  • Setting up IP-based rate-limiting to prevent GPU abuse.

Strong candidates may have experience with…

  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS/GCP/Azure)
  • High-performance and fault-tolerant networking
  • Low-level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Bonus: Infrastructure as Code (e.g. Terraform)

About us

  • We’re building AI creative tooling.
  • We’ve raised over $83M from the best investors in Silicon Valley.
  • We’re a team of 12 with millions of active users scaling aggressively.

Similar Jobs

Zynga - Director of Product - NaturalMotion Games

Zynga

London, England, United Kingdom (Hybrid)
8 Months ago
Activision - Staff Platform Solutions Engineer

Activision

New York, United States (On-Site)
2 Months ago
Luxoft - Senior ETL Developer

Luxoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
9 Months ago
London stock Exchange - Senior Manager, Application Support & Operations

London stock Exchange

Heredia, Costa Rica (On-Site)
3 Months ago
Enphase Energy - Senior Database Engineer

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
7 Months ago
ARHS - Senior System Engineer

ARHS

Valletta, Malta (Remote)
10 Months ago
NVIDIA - Senior System Power Management Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
4 Months ago
Thumbtack - Senior IT Systems Engineer

Thumbtack

Philippines (Remote)
1 Month ago
Electronic Arts - Mobile Application Developer - EA SPORTS FC

Electronic Arts

Vancouver, British Columbia, Canada (Hybrid)
1 Month ago
GHX - Integration System Engineer II-Provider

GHX

United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Apple - Core Audio Software Engineer

Apple

Cupertino, California, United States (On-Site)
2 Months ago
PlayStation Global - Manager, Commercial Planning

PlayStation Global

São Paulo, State Of São Paulo, Brazil (Hybrid)
4 Months ago
Grammarly - Full-Stack Software Engineer

Grammarly

San Francisco, California, United States (Hybrid)
1 Month ago
SIFT - Customer Success Manager

SIFT

San Francisco, California, United States (Remote)
1 Month ago
limit break - Senior Frontend/Web UI Engineer

limit break

Tokyo, Japan (On-Site)
5 Months ago
Workato - Staff Software Engineer

Workato

Sofia, Sofia City Province, Bulgaria (Remote)
1 Month ago
The Walt Disney Company - Senior Pipeline Engineer

The Walt Disney Company

Glendale, California, United States (On-Site)
5 Months ago
M365 connect - Sales Manager Microsoft Dynamics 365 FSCM

M365 connect

Berlin, Berlin, Germany (Remote)
4 Months ago
Plaid  - Software Engineer (Fullstack) - Payments

Plaid

San Francisco, California, United States (On-Site)
1 Year ago
Lorikeet - Agent Manager

Lorikeet

London, England, United Kingdom (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Glocomms - Principal Security Architect

Glocomms

Dallas, Texas, United States (Hybrid)
1 Month ago
OKX - Associate General Counsel, Licensing

OKX

New York, United States (On-Site)
1 Month ago
SimpliSafe - IT Engineer Weekend- Overnights

SimpliSafe

Richmond, Virginia, United States (On-Site)
2 Months ago
22squared - Financial System Administrator

22squared

Atlanta, Georgia, United States (Hybrid)
1 Month ago
zoox - Senior/Staff Technical Program Manager - Software

zoox

Foster City, California, United States (Hybrid)
2 Years ago
Temperature Pro - Business Intelligence Engineer

Temperature Pro

San Francisco, California, United States (On-Site)
4 Months ago
Marvell - Senior Principal Engineering Program Manager

Marvell

Santa Clara, California, United States (On-Site)
1 Month ago
Nintendo - Software Engineer (NTD)

Nintendo

Redmond, Washington, United States (On-Site)
1 Year ago
Roblox - Principal Engineer, Build and CI/CD

Roblox

San Mateo, California, United States (On-Site)
2 Months ago
Marvell - Senior Principal Engineer - AI/Firmware Engineer

Marvell

Santa Clara, California, United States (On-Site)
1 Year ago

Get notifed when new similar jobs are uploaded

System Design Jobs

Toppan MErril - Systems Engineer

Toppan MErril

Chennai, Tamil Nadu, India (On-Site)
1 Year ago
Shield AI - Senior Fluids Systems Engineer

Shield AI

Dallas, Texas, United States (On-Site)
1 Month ago
Alten Technology - Systems Engineer

Alten Technology

Boston, Massachusetts, United States (On-Site)
3 Months ago
Razer - Senior System Engineer

Razer

Ho Chi Minh City, Vietnam (On-Site)
3 Months ago
bytedance - Staff Research Scientist/Engineer, Recommendation Systems

bytedance

San Jose, California, United States (On-Site)
1 Month ago
Qualcomm - Staff Systems Performance Engineer

Qualcomm

San Diego, California, United States (On-Site)
1 Month ago
bytedance - Optical System Engineer

bytedance

San Jose, California, United States (On-Site)
5 Months ago
Apple - Systems Architect

Apple

Cupertino, California, United States (On-Site)
3 Months ago
Shield AI - Principal Fuel Systems Engineer

Shield AI

Dallas, Texas, United States (On-Site)
1 Month ago
Wind River - Senior Engineer - File Systems

Wind River

Bengaluru, Karnataka, India (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by krea.ai

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug