Staff Software Engineer, Site Reliability (SRE)

1 Month ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll maintain and optimize a large-scale infrastructure supporting millions of daily active users. Responsibilities include ensuring reliability, scalability, and performance; developing monitoring and automation tools (Python, Golang); collaborating with development teams on CI/CD and system design; establishing SLAs/SLOs; managing incidents and outages; and contributing to disaster recovery planning. The goal is to scale the platform to 3 billion users.
Must have:
  • 5+ years DevOps/SRE experience in a large-scale organization
  • Expertise in Python and Golang for automation
  • Experience with SQL, Linux, Kubernetes, Terraform, GCP
  • Troubleshooting across various platforms
  • Incident management and postmortems
Good to have:
  • Familiarity with GPU clusters/HPC
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $300K

Similar Jobs

FRVR - Growth Freelancer (SEO, Content & Product Focus)

FRVR

Lisbon, Lisbon, Portugal (On-Site)
1 Month ago
Brillio - Fullstack Engineer - React  Java - R01527438

Brillio

Chennai, Tamil Nadu, India (Hybrid)
7 Months ago
Sabre India - Principal Data Science Engineer

Sabre India

Kraków, Lesser Poland Voivodeship, Poland (Hybrid)
2 Weeks ago
Outbrain - DevOps Security Engineer

Outbrain

Netanya, Center District, Israel (Hybrid)
4 Weeks ago
Info Stretch - Senior Engineer

Info Stretch

Bengaluru, Karnataka, India (On-Site)
7 Months ago
NVIDIA - Senior Site Reliability Engineer - Infrastructure

NVIDIA

Austin, Texas, United States (On-Site)
3 Months ago
Sony Interactive Entertainment - Server-Side Engineer (PlayStation™Network Server Application Development)

Sony Interactive Entertainment

Tokyo, Japan (On-Site)
4 Months ago
Ajmera Infotech - SENIOR ASP.NET DEVELOPER

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
10 Months ago
PwC - ETIC, GCP Technical Support Engineer - Manager

PwC

Cairo, Cairo Governorate, Egypt (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Corsair - Senior Front-End Engineer

Corsair

Munich, Bavaria, Germany (Hybrid)
1 Month ago
DEVOTEAM - Consultant DevOps CI / CD

DEVOTEAM

Cesson-Sévigné, Brittany, France (On-Site)
8 Months ago
Thatgamecompany - Live Ops Engineer

Thatgamecompany

United States (Remote)
2 Months ago
Axon - Senior Software Engineer

Axon

Boston, Massachusetts, United States (On-Site)
3 Weeks ago
ByteDance - Software Engineer, Backend and Infrastructure

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Oportun - Senior Manager, Software Engineering

Oportun

(Remote)
1 Month ago
Capgemini - Android Middleware/Framework Developer

Capgemini

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Tide - Senior Engineer, Backend

Tide

Romania (Hybrid)
1 Month ago
Synechron - Senior .NET Developer

Synechron

Pune, Maharashtra, India (On-Site)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

ByteDance - Senior/Tech Lead AI/LLM Network Software Development Engineer - Seattle

ByteDance

Seattle, Washington, United States (On-Site)
7 Months ago
Samsung Semiconductor - Senior Staff Engineer, DTCO

Samsung Semiconductor

San Jose, California, United States (On-Site)
2 Months ago
The Walt Disney Company - Youth Activities Counselor (Japanese Speaking)

The Walt Disney Company

Kapolei, Hawaii, United States (On-Site)
2 Months ago
Epic Games - Technical Animation Lead

Epic Games

Cary, North Carolina, United States (On-Site)
1 Month ago
commerce iq - Senior Enterprise Account Executive (Expansions)

commerce iq

United States (On-Site)
1 Month ago
Sbm management - Custodial Lead

Sbm management

Alameda, California, United States (On-Site)
3 Months ago
Anavation - Data Architect SME

Anavation

Colorado Springs, Colorado, United States (Remote)
1 Month ago
Meta - Software Engineer, Systems ML - SW/HW Co-design

Meta

Seattle, Washington, United States (Remote)
6 Months ago
Fliff  Inc  - Junior Finance

Fliff Inc

Philadelphia, Pennsylvania, United States (On-Site)
10 Months ago
Inkittt - Senior Frontend Engineer

Inkittt

San Francisco, California, United States (On-Site)
10 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Google - Principal Architect, State, Local, and Education, Public Sector

Google

California, United States (On-Site)
1 Month ago
N-iX - Senior DevOps Engineer (AWS)

N-iX

Colombia (Remote)
1 Month ago
Rackspace Technology - Technical Account Manager - AWS

Rackspace Technology

United Kingdom (Remote)
1 Month ago
Epic Games - Senior DevOps Programmer

Epic Games

Porto Alegre, State Of Rio Grande Do Sul, Brazil (On-Site)
2 Months ago
ByteDance - Site Reliability Engineer, Traffic Platform - 2025 Start

ByteDance

Singapore (On-Site)
7 Months ago
Luxoft - Senior Software Support Engineer

Luxoft

Italy, New York, United States (Remote)
6 Months ago
Aristocrat Gaming - Senior Systems Reliability Engineer (SRE)

Aristocrat Gaming

Austin, Texas, United States (Hybrid)
2 Months ago
NVIDIA - Senior HPC AI Cluster Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
ByteDance - Site Reliability Engineer, Compute Platform

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
NVIDIA - Senior HPC AI Cluster Engineer

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug