Staff Software Engineer, Site Reliability (SRE)

2 Weeks ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll maintain and optimize a large-scale infrastructure supporting millions of daily active users. Responsibilities include ensuring reliability, scalability, and performance; developing monitoring and automation tools (Python, Golang); collaborating with development teams on CI/CD and system design; establishing SLAs/SLOs; managing incidents and outages; and contributing to disaster recovery planning. The goal is to scale the platform to 3 billion users.
Must have:
  • 5+ years DevOps/SRE experience in a large-scale organization
  • Expertise in Python and Golang for automation
  • Experience with SQL, Linux, Kubernetes, Terraform, GCP
  • Troubleshooting across various platforms
  • Incident management and postmortems
Good to have:
  • Familiarity with GPU clusters/HPC
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $300K

Similar Jobs

Info Stretch - Programmer Analyst 6

Info Stretch

Lansing, Michigan, United States (Hybrid)
4 Months ago
GoFundMe - Senior DevEx Engineer

GoFundMe

Buenos Aires, Buenos Aires, Argentina (On-Site)
6 Hours ago
Rackspace Technology - Lead Cloud Engineer

Rackspace Technology

United States (Remote)
2 Months ago
Nielsen Holdings - DevOps Engineer (Terraform, Jenkins, GitLab CI/CD, Python, Airflow)

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
Sporty Group - LatAM Site Reliability Engineer

Sporty Group

(On-Site)
11 Months ago
Milestone - Senior DevOps Engineer

Milestone

Copenhagen, Denmark (Hybrid)
2 Weeks ago
Canva - Senior Software Engineer - Cloud Security & Compliance, remote across ANZ

Canva

Sydney, New South Wales, Australia (Remote)
4 Months ago
Warner Bros Games - Software Engineer II - DevOps

Warner Bros Games

Bengaluru, Karnataka, India (Hybrid)
4 Weeks ago
Google - Staff Software Engineer, Site Reliability Engineering

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
2 Weeks ago
Google - Staff Software Engineer, Site Reliability Engineering

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Epic Games - Senior Software Engineer, Developer Relations (UE China)

Epic Games

Shanghai, Shanghai, China (On-Site)
2 Weeks ago
Coda - Senior/Staff Software Engineer, Custom Commerce

Coda

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (Hybrid)
1 Year ago
Epic Games - Build Engineer, Fortnite

Epic Games

(On-Site)
2 Months ago
ION - Cloud Engineer/Architect (DevOps)

ION

London, England, United Kingdom (On-Site)
6 Months ago
PlayStation Global - Senior DevOps Information System Engineer

PlayStation Global

Aliso Viejo, California, United States (On-Site)
2 Months ago
Social Discovery Group - Senior DevOps

Social Discovery Group

(Remote)
1 Day ago
UXBERT Labs - Senior DevOps Engineer

UXBERT Labs

Riyadh, Riyadh Province, Saudi Arabia (Hybrid)
3 Months ago
Fluence - Controls Software Engineer-II(m/f/d)

Fluence

Erlangen, Bavaria, Germany (Hybrid)
6 Months ago
ION - Senior DevSecOps Engineer, Italy

ION

Milan, Lombardy, Italy (On-Site)
6 Months ago
Crunchyroll - Staff Software Engineer

Crunchyroll

Hyderabad, Telangana, India (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Francisco, California, United States

Riot Games - Senior Software Engineer (Gameplay/Audio)

Riot Games

Los Angeles, California, United States (On-Site)
3 Days ago
Google - Staff Software Engineer, Platforms

Google

Sunnyvale, California, United States (On-Site)
2 Weeks ago
Google - Senior Account Strategist, Mid-Market Sales

Google

Chicago, Illinois, United States (On-Site)
1 Week ago
Google - Staff Software Engineer, PSE Virtualization Security, Cloud CISO

Google

Sunnyvale, California, United States (On-Site)
1 Week ago
DailyWire - Senior Accountant

DailyWire

Nashville, Tennessee, United States (On-Site)
1 Month ago
Visa - Sr. Director, VCA Managed Services

Visa

Atlanta, Georgia, United States (Hybrid)
6 Days ago
The Walt Disney Company - Pest Control Operator

The Walt Disney Company

Florida, United States (On-Site)
2 Months ago
Canva - Senior Finance Systems Engineer - Revenue (Zuora)

Canva

San Francisco, California, United States (Remote)
1 Month ago
The Walt Disney Company - Pastry - Part Time

The Walt Disney Company

Anaheim, California, United States (On-Site)
3 Days ago
Riot Games - Senior Game Producer - League of Legends, Summoner's Rift Environment

Riot Games

Los Angeles, California, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Metyis - Lead Devops Engineer

Metyis

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Funcom - Senior DevOps Engineer

Funcom

Bucharest, Bucharest, Romania (Hybrid)
5 Months ago
Ajmera Infotech - Kubernetes Experts

Ajmera Infotech

Hyderabad, Telangana, India (On-Site)
5 Months ago
ARHS - Senior Cloud/DevOps Architect

ARHS

Luxembourg (On-Site)
6 Months ago
Microsoft - Site Reliability Engineering II

Microsoft

Bengaluru, Karnataka, India (On-Site)
3 Days ago
Google - Customer Engineer, SAP, Google Cloud

Google

Austin, Texas, United States (On-Site)
2 Weeks ago
KBG Blockchain Game Studios - Back-End Developer (NodeJS)

KBG Blockchain Game Studios

Thành Phố Hồ Chí Minh, Vietnam (On-Site)
9 Months ago
Microsoft - Technical Support Engineer

Microsoft

Bengaluru, Karnataka, India (Hybrid)
2 Weeks ago
Scanline VFX - Release DevOps Engineer

Scanline VFX

Vancouver, British Columbia, Canada (Hybrid)
4 Weeks ago
Microsoft - Principal Software Engineering Manager

Microsoft

(On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug