Staff Software Engineer, Site Reliability (SRE)

1 Hour ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll maintain and optimize a large-scale infrastructure supporting millions of daily active users. Responsibilities include ensuring reliability, scalability, and performance; developing monitoring and automation tools (Python, Golang); collaborating with development teams on CI/CD and system design; establishing SLAs/SLOs; managing incidents and outages; and contributing to disaster recovery planning. The goal is to scale the platform to 3 billion users.
Must have:
  • 5+ years DevOps/SRE experience in a large-scale organization
  • Expertise in Python and Golang for automation
  • Experience with SQL, Linux, Kubernetes, Terraform, GCP
  • Troubleshooting across various platforms
  • Incident management and postmortems
Good to have:
  • Familiarity with GPU clusters/HPC
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $300K

Similar Jobs

Ajmera Infotech - Senior ASP.NET Developer with Azure Expertise

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
4 Months ago
Nagarro - Senior Staff Engineer - Prompt Engineer

Nagarro

Colombia (Remote)
4 Weeks ago
Scorewarrior - Senior System Engineer

Scorewarrior

Limassol, Limassol, Cyprus (On-Site)
1 Month ago
ByteDance - Software Engineer

ByteDance

San Jose, California, United States (On-Site)
1 Day ago
SparkCognition - Software Engineer (Frontend)

SparkCognition

Bengaluru, Karnataka, India (On-Site)
6 Months ago
The Walt Disney Company - Sr Principal Software Engineer

The Walt Disney Company

Orlando, Florida, United States (On-Site)
3 Weeks ago
Playtika - Senior Data/AI SRE Engineer

Playtika

Ukraine (On-Site)
5 Months ago
ION - Cloud Engineer Kubernetes

ION

Castellazzo Bormida, Piedmont, Italy (Hybrid)
6 Months ago
CD PROJEKT RED - DevOps Engineering Manager

CD PROJEKT RED

Warsaw, Masovian Voivodeship, Poland (On-Site)
8 Months ago
Extreme Network - SR PROGRAMMER - Oracle Fusion Cloud- VBCS/ BI Reports/ OTBI/FRS & SmartView

Extreme Network

Chennai, Tamil Nadu, India (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Saama Technologies,  Inc  - Senior Site Reliability Engineer

Saama Technologies, Inc

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
Blazing griffin - Tools Programmer (Games)

Blazing griffin

Glasgow, Scotland, United Kingdom (Hybrid)
4 Months ago
Epic Games - Senior Web Programmer

Epic Games

Vancouver, British Columbia, Canada (On-Site)
2 Months ago
ACV Auctions - Senior Engineer, Machine Learning

ACV Auctions

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
Epic Games - Build Engineer, Fortnite

Epic Games

Cary, North Carolina, United States (On-Site)
2 Months ago
The Walt Disney Company - Data Engineer II - Ad Intelligence

The Walt Disney Company

Glendale, California, United States (On-Site)
2 Days ago
The Walt Disney Company - Sr Machine Learning Engineer

The Walt Disney Company

New York, New York, United States (On-Site)
3 Weeks ago
Crunchyroll - Staff Site Reliability Engineer

Crunchyroll

Mexico City, Mexico City, Mexico (On-Site)
4 Months ago
Hashlist - Senior Data Engineer

Hashlist

Pune, Maharashtra, India (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Penumbra - Therapy Development Specialist

Penumbra

Philadelphia, Pennsylvania, United States (Remote)
2 Weeks ago
Niantic - Enterprise Business Development Lead

Niantic

Sunnyvale, California, United States (Hybrid)
1 Month ago
The Walt Disney Company - Lead Software Engineer (Front End/JavaScript)

The Walt Disney Company

Santa Monica, California, United States (On-Site)
5 Months ago
Netflix - Data Engineer (L5) - Product (Device)

Netflix

United States (Remote)
5 Months ago
Riot Games - Game Designer II - League of Legends, Summoner's Rift Team, Live Pod

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago
Spell Brush - AI Infrastructure Engineer

Spell Brush

San Francisco, California, United States (On-Site)
3 Weeks ago
Microsoft - Member of Technical Staff - Full Stack Software Engineer

Microsoft

Redmond, Washington, United States (Hybrid)
16 Hours ago
Postman - Staff Frontend Engineer, Client Platform

Postman

San Francisco, California, United States (On-Site)
6 Months ago
ION - Senior Technical Consultant - Endur

ION

Dallas, Texas, United States (On-Site)
6 Months ago
Next Level Business Services - SAP WM

Next Level Business Services

Naples, Florida, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Rackspace Technology - AWS Service Delivery Manager

Rackspace Technology

India (Remote)
3 Weeks ago
N-iX - Middle GCP DevOps Engineer

N-iX

Ukraine (Remote)
3 Weeks ago
NVIDIA - Solutions Architect, Data Center Infrastructure

NVIDIA

Texas, United States (Remote)
6 Days ago
Applike Group - Senior DevOps Engineer  (f/m/d) 🚀

Applike Group

Hamburg, Hamburg, Germany (Hybrid)
6 Months ago
Microsoft - Senior Customer Experience Engineering Manager

Microsoft

Noida, Uttar Pradesh, India (On-Site)
16 Hours ago
ByteDance - Site Reliability Engineer, Traffic Platform - 2025 Start

ByteDance

Singapore (On-Site)
5 Months ago
Rackspace Technology - Cloud Database Engineer III/IV

Rackspace Technology

Gurugram, Haryana, India (Remote)
3 Weeks ago
Velotio Technologies - Senior Software Engineer (Golang)

Velotio Technologies

Pune, Maharashtra, India (Remote)
3 Weeks ago
N-iX - Azure Cloud/DevOps Engineer

N-iX

(Remote)
3 Weeks ago
Google - Customer Engineering, Data and AI Migration Specialist

Google

Bengaluru, Karnataka, India (On-Site)
12 Hours ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug