Staff Software Engineer, Site Reliability (SRE)

2 Months ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll support a massive infrastructure (thousands of nodes, terabytes of data, millions of daily active users) with a goal of reaching 3 billion users. Responsibilities include maintaining production services, developing monitoring and automation tools (Python, Golang), implementing CI/CD processes, collaborating with development teams on scalable systems, establishing SLAs/SLOs, providing system monitoring and incident alerts, participating in on-call rotations, and developing disaster recovery plans. You'll work with Kubernetes, Terraform, and multiple cloud platforms (GCP is a must). The role requires troubleshooting across various platforms and handling incident management and postmortems.
Must have:
  • 5+ years DevOps/SRE experience in a large-scale organization
  • Software tool and automation development (Python, Golang)
  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform
  • Experience with GCP and troubleshooting across platforms
  • Incident management and postmortems
Good to have:
  • Familiarity with GPU clusters/HPC environments
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

Lulalend - Senior DevOps Engineer

Lulalend

Cape Town, Western Cape, South Africa (On-Site)
1 Month ago
Scout - Sr Software Engineer - Mobile (Android)

Scout

(Remote)
1 Month ago
Canva - Senior Platform Engineer (Python) - Analytics Platform

Canva

Sydney, New South Wales, Australia (Remote)
2 Months ago
Edifeces - Principal Cloud Security Engineer

Edifeces

United States (On-Site)
2 Months ago
Coda - Application Security Specialist

Coda

Bangkok, Thailand (Hybrid)
3 Weeks ago
ByteDance - Site Reliability Engineer, Traffic Platform

ByteDance

Singapore (On-Site)
4 Months ago
Google - Software Engineer III, Site Reliability Engineering

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Month ago
Rackspace Technology - Cloud Solution Architect I - Presales

Rackspace Technology

Bengaluru, Karnataka, India (Remote)
2 Months ago
Hitachi - CE Developers-Jul-2024

Hitachi

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Luxoft - Senior ETL Developer

Luxoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Hedra - Senior Backend Engineer

Hedra

New York, New York, United States (On-Site)
2 Months ago
Playtech - Senior Java Developer

Playtech

Sofia, Sofia City Province, Bulgaria (On-Site)
2 Months ago
The Walt Disney Company - Lead Software Engineer in Test, iOS/Android

The Walt Disney Company

Glendale, California, United States (On-Site)
6 Months ago
PwC - IN-Senior Associate_Azure Data Enginer_Advisory Corporate_Advisory_Gurgaon

PwC

Gurugram, Haryana, India (On-Site)
8 Months ago
Axon - Senior Privacy Engineer

Axon

Scottsdale, Arizona, United States (Hybrid)
6 Months ago
Loft Orbital - Senior Site Reliability Engineer

Loft Orbital

(Remote)
1 Month ago
PwC - IN_Senior Associate _Java Developer _Data & Analytics _Advisory _PAN India

PwC

Kolkata, West Bengal, India (On-Site)
7 Months ago
CleverTap - Staff Engineer - DevOps

CleverTap

Mumbai, Maharashtra, India (Hybrid)
4 Months ago
Aeries Technology - Senior Software Developer

Aeries Technology

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
G- space studios - Senior Unreal Engine Developer

G- space studios

(Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Menlo Park, California, United States

IManage - Senior AI Software Engineer

IManage

Chicago, Illinois, United States (Hybrid)
3 Months ago
Crunchyroll - Senior Investigations Analyst

Crunchyroll

Los Angeles, California, United States (On-Site)
2 Months ago
Suki - Senior Backend Engineer (Go/C++)

Suki

Redwood City, California, United States (On-Site)
1 Month ago
Impact Theory - Senior Game Programmer

Impact Theory

Los Angeles, California, United States (On-Site)
6 Months ago
Google - Lead Group Product Manager, Developer AI, Core

Google

San Francisco, California, United States (On-Site)
1 Month ago
Diligent - Sr. Sales & Pipeline Analytics Manager

Diligent

New York, New York, United States (On-Site)
1 Month ago
Noetic - Maximo System Administrator SME

Noetic

Hampton, Virginia, United States (Hybrid)
3 Weeks ago
AI Fund - VP of Marketing

AI Fund

San Francisco, California, United States (Hybrid)
1 Month ago
Infosys - Full Stack Java Developer

Infosys

Richardson, Texas, United States (On-Site)
4 Weeks ago
Trek - Service Technician

Trek

Raleigh, North Carolina, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Nagarro - Cloud Pricing Architect

Nagarro

Germany (Remote)
3 Months ago
Crunchyroll - DevOps Engineer, Core Infrastructure Engineering

Crunchyroll

San Francisco, California, United States (Hybrid)
3 Months ago
Revenera - Senior Site Reliability Engineer

Revenera

Bengaluru, Karnataka, India (Hybrid)
7 Months ago
Google - Software Engineering Manager, Cloud Observability Analytics

Google

Pittsburgh, Pennsylvania, United States (On-Site)
1 Month ago
Patreon - Site Reliability Engineer

Patreon

United States (Remote)
2 Months ago
Nintendo - CONTRACT - Sr Engineer (NTD)

Nintendo

Redmond, Washington, United States (On-Site)
6 Months ago
SparkCognition - Senior DevOps Engineer

SparkCognition

Bengaluru, Karnataka, India (On-Site)
8 Months ago
ByteDance - SRE and DevOps Tech Lead - Edge Cloud Infrastructure - London

ByteDance

London, England, United Kingdom (On-Site)
6 Months ago
NVIDIA - Principal Software Engineer

NVIDIA

California, United States (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

New York, New York, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug