Staff Software Engineer, Site Reliability (SRE)

1 Month ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll maintain and enhance infrastructure supporting thousands of nodes, terabytes of data, and millions of daily active users. Responsibilities include ensuring reliability, scalability, and performance as the user base grows towards 3 billion. You'll collaborate with development teams, design and implement processes for stability and availability, develop monitoring and automation tools, establish SLAs/SLOs, and participate in on-call rotations. This role demands experience with production services, software tool development (Python/Golang), SQL, Linux, CI/CD, Kubernetes, Terraform, and multiple cloud platforms (GCP is a must). You will also develop plans for site reliability and disaster recovery.
Must have:
  • 5+ years DevOps/SRE experience in large-scale organizations
  • Proficiency in Python and Golang for automation
  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform
  • GCP experience
  • Troubleshooting and incident management skills
Good to have:
  • GPU cluster/HPC experience
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

Xsolla - Tech Lead - Metasites

Xsolla

Baku, Azerbaijan (Hybrid)
2 Months ago
Blazesoft - .Net Developer

Blazesoft

Vaughan, Ontario, Canada (On-Site)
6 Months ago
Advitha Tech Solutions - C++/Multimedia Engineer

Advitha Tech Solutions

New Delhi, Delhi, India (Remote)
6 Months ago
ByteDance - Senior Software Engineer - Automation Testing Tools and AI-Driven Quality Assurance Technology

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Kefir Games - QA Automation Engineer / SDET

Kefir Games

Cyprus (On-Site)
2 Months ago
Patterned Learning Career - Lead Python AWS Developer

Patterned Learning Career

(Remote)
2 Months ago
Velotio Technologies - Lead Engineer (DevOps)

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago
ByteDance - Site Reliability Engineer, Traffic Infrastructure

ByteDance

Singapore (On-Site)
5 Months ago
ARHS - AWS or Azure Cloud Architect

ARHS

Luxembourg (On-Site)
5 Months ago
Moon Active - DevOps Team Leader

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Epic Games - Build Programmer, Fortnite

Epic Games

Vancouver, British Columbia, Canada (On-Site)
1 Month ago
The Walt Disney Company - Principal Software Engineer

The Walt Disney Company

San Francisco, California, United States (On-Site)
3 Months ago
Luxoft - Senior Software Support Engineer

Luxoft

(Remote)
4 Months ago
Crunchyroll - Staff DevOps Engineer, Embedded Cloud Reliability

Crunchyroll

San Francisco, California, United States (Hybrid)
2 Months ago
Ubisoft - Site Reliability Engineer [Game Security]

Ubisoft

Düsseldorf, North Rhine-Westphalia, Germany (On-Site)
2 Months ago
The Walt Disney Company - Staff Production Engineer - Platform

The Walt Disney Company

Sydney, New South Wales, Australia (On-Site)
5 Months ago
Aristocrat Gaming - DevOps Lead

Aristocrat Gaming

Montreal, Quebec, Canada (Hybrid)
1 Month ago
Barracuda Networks  Inc  - Software Engineer

Barracuda Networks Inc

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Luxoft - Lead Integration and Release Engineer

Luxoft

Bucharest, Bucharest, Romania (On-Site)
4 Months ago
Playtika - C# Developer

Playtika

Romania (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Menlo Park, California, United States

PlayStation Global - Senior Gameplay Engineer

PlayStation Global

Los Angeles, California, United States (On-Site)
1 Month ago
DraftKings - Manager, Product Design

DraftKings

Las Vegas, Nevada, United States (On-Site)
1 Month ago
PlayStation Global - Sr Software Engineer

PlayStation Global

Madison, Wisconsin, United States (Hybrid)
3 Months ago
Glean - Tech Lead Manager - Generative AI Product

Glean

Palo Alto, California, United States (On-Site)
5 Months ago
Dun & Bradstreet - Revenue Accountant (R-16767)

Dun & Bradstreet

Jacksonville, Florida, United States (On-Site)
6 Months ago
Google - Software Engineer III, Google Cloud Platforms

Google

Sunnyvale, California, United States (On-Site)
5 Months ago
Dun & Bradstreet - Early Talent Network

Dun & Bradstreet

Jacksonville, Florida, United States (On-Site)
6 Months ago
The Walt Disney Company - Sr Software Engineer (Rust Developer)

The Walt Disney Company

San Francisco, California, United States (On-Site)
5 Months ago
Trek - Field Quality Rework Manager

Trek

Waterloo, Wisconsin, United States (Hybrid)
1 Month ago
The Mill - Senior Systems Engineer

The Mill

New York, New York, United States (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Keywords Studios (Player Support) - Solutions Architect

Keywords Studios (Player Support)

Montreal, Quebec, Canada (Remote)
4 Months ago
Nielsen Holdings - SENIOR DEVOPS ENGINEER

Nielsen Holdings

Mumbai, Maharashtra, India (Hybrid)
5 Months ago
Barracuda Networks  Inc  - Software Engineer

Barracuda Networks Inc

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Zuru - DevOps Specialist

Zuru

Modena, Emilia-Romagna, Italy (Hybrid)
5 Months ago
Scale AI - Software Engineer, Cloud Infrastructure

Scale AI

San Francisco, California, United States (On-Site)
6 Months ago
EXUSIA - Ab Initio Data Engineer

EXUSIA

United States (On-Site)
6 Months ago
SmileGate - Build Manager [LOST ARK Mobile]

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
3 Months ago
NVIDIA - Senior DevOps Engineer

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
2 Months ago
ByteDance - Production System Engineer, Infrastructure Engineering

ByteDance

Singapore (On-Site)
5 Months ago
CloudHire - Sr. Java Application Architect

CloudHire

Bengaluru, Karnataka, India (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

New York, New York, United States (On-Site)

New York, New York, United States (On-Site)

New York, New York, United States (On-Site)

Menlo Park, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug