Staff Software Engineer, Site Reliability (SRE)

1 Month ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll support a massive infrastructure (thousands of nodes, terabytes of data, millions of daily active users). You'll ensure reliability, scalability, and performance as the user base grows to 3 billion. Responsibilities include maintaining production services, developing monitoring and automation tools (Python, Golang), collaborating with development teams on scalable systems and CI/CD, establishing SLAs/SLOs, providing system monitoring and incident alerts, participating in on-call rotations, and developing disaster recovery plans. You'll work with SQL, Linux, Kubernetes, Terraform, and GCP.
Must have:
  • 5+ years DevOps/SRE experience in large-scale organizations
  • Software tool & automation development (Python, Golang)
  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform, GCP
  • Troubleshooting across multiple platforms
  • Incident management and postmortems
Good to have:
  • GPU clusters/HPC experience
  • Prometheus and Grafana experience

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

The Walt Disney Company - Senior Principal Software Engineer

The Walt Disney Company

San Francisco, California, United States (On-Site)
2 Months ago
Moon Active - DevOps Team Leader

Moon Active

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
4 Months ago
Ajmera Infotech - Site Reliability Engineer - Kubernetes

Ajmera Infotech

San Jose, California, United States (On-Site)
2 Months ago
Ajmera Infotech - React Developer

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
6 Months ago
Playrix - Principal C++ Software Engineer (Tools)

Playrix

Ukraine (Remote)
5 Months ago
Tencent - SRE Intern

Tencent

(On-Site)
1 Month ago
GoTo Group - Senior Software Engineer - Event Platform

GoTo Group

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Amber - Bazel Senior Build Engineer (Project Based)

Amber

Bucharest, Bucharest, Romania (Remote)
1 Month ago
Tesla - Software Distributed Systems Engineer

Tesla

North Holland, Netherlands (On-Site)
2 Months ago
Saama Technologies,  Inc  - Senior Site Reliability Engineer

Saama Technologies, Inc

Chennai, Tamil Nadu, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Playrix - Senior C++ Software Engineer (Gameplay)

Playrix

Portugal (Remote)
5 Months ago
RoofStack - Senior Test Automation Developer

RoofStack

İstanbul, İstanbul, Türkiye (On-Site)
2 Months ago
Wargaming - Senior Infrastructure Engineer (Python) (Game Engine Development Team)

Wargaming

Nicosia, Nicosia, Cyprus (Hybrid)
4 Months ago
Wargaming - Lead Backend Engineer (Unannounced project)

Wargaming

Guildford, England, United Kingdom (Hybrid)
4 Months ago
Rockstar Games - Associate Principal Technical Artist: Performance Capture Pipeline

Rockstar Games

Edinburgh, Scotland, United Kingdom (On-Site)
6 Months ago
LeoVegas - Cloud Security Engineer

LeoVegas

Växjö, Kronoberg County, Sweden (Hybrid)
5 Months ago
Blazesoft - .Net Developer

Blazesoft

Vaughan, Ontario, Canada (On-Site)
7 Months ago
Go Fund Me - Senior Software Engineer (Backend)

Go Fund Me

Buenos Aires, Buenos Aires, Argentina (Hybrid)
3 Months ago
ByteDance - Senior Software Engineer, Cloud Infrastructure

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
Go Fund Me - Software Engineer (Integrity)

Go Fund Me

Buenos Aires, Buenos Aires, Argentina (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Menlo Park, California, United States

Canva - Senior Manager, Financial Reporting & Technical Accounting

Canva

Los Angeles, California, United States (Remote)
1 Month ago
Vimeo - Associate Product Counsel

Vimeo

New York, New York, United States (Remote)
5 Months ago
Nintendo - CONTRACT - Localization Specialist (Japanese)

Nintendo

Redmond, Washington, United States (Hybrid)
4 Months ago
Pika - Product Designer

Pika

Palo Alto, California, United States (On-Site)
5 Months ago
prizepicks - Product Manager II - Social Engagement

prizepicks

Atlanta, Georgia, United States (Remote)
4 Weeks ago
The Walt Disney Company - Disney Entertainment Strategic Communications Intern

The Walt Disney Company

Burbank, California, United States (On-Site)
2 Months ago
The Walt Disney Company - Lead Software Engineer, Machine Learning - Ad Platforms

The Walt Disney Company

Seattle, Washington, United States (On-Site)
5 Months ago
ByteDance - Research Scientist, Code Generation

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Meta - Global Sales Analytics Lead

Meta

San Francisco, California, United States (Remote)
5 Months ago
Samsung Semiconductor - Senior Manager, Integrated Marketing Communications

Samsung Semiconductor

San Jose, California, United States (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

VGW - Senior Site Reliability Engineer

VGW

Krakow Am See, Mecklenburg-Vorpommern, Germany (On-Site)
6 Months ago
Patterned Learning Career - Senior Architectural Software Engineer

Patterned Learning Career

(Remote)
2 Months ago
Axon - Senior Database Reliability Engineer II

Axon

United States (Remote)
2 Months ago
Sony Interactive Entertainment - Senior Cloud Security Engineer

Sony Interactive Entertainment

Tokyo, Japan (On-Site)
4 Months ago
Ajmera Infotech - SENIOR ASP.NET DEVELOPER

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
9 Months ago
SmileGate - System Engineer (Private Cloud)

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
3 Months ago
PlayStation Global - Senior DevOps Information System Engineer

PlayStation Global

Aliso Viejo, California, United States (On-Site)
1 Month ago
Innoactive - Software Engineer

Innoactive

(Remote)
4 Months ago
Netflix - Media Systems Engineer

Netflix

(On-Site)
3 Months ago
ION - Cloud Engineer Kubernetes

ION

Italy (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

New York, New York, United States (On-Site)

New York, New York, United States (On-Site)

New York, New York, United States (On-Site)

New York, New York, United States (On-Site)

Menlo Park, California, United States (On-Site)

New York, New York, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug