Staff Software Engineer, Site Reliability (SRE)

1 Month ago • 5 Years + • DevOps

Job Summary

Job Description

As a founding member of the Site Reliability Engineering (SRE) function at Character.AI, you'll support a massive infrastructure (thousands of nodes, terabytes of data, millions of daily active users) with a goal of reaching 3 billion users. Responsibilities include maintaining production services, developing monitoring and automation tools (Python, Golang), implementing CI/CD processes, collaborating with development teams on scalable systems, establishing SLAs/SLOs, providing system monitoring and incident alerts, participating in on-call rotations, and developing disaster recovery plans. You'll work with Kubernetes, Terraform, and multiple cloud platforms (GCP is a must). The role requires troubleshooting across various platforms and handling incident management and postmortems.
Must have:
  • 5+ years DevOps/SRE experience in a large-scale organization
  • Software tool and automation development (Python, Golang)
  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform
  • Experience with GCP and troubleshooting across platforms
  • Incident management and postmortems
Good to have:
  • Familiarity with GPU clusters/HPC environments
  • Experience with Prometheus and Grafana

Job Details

About the role

As one of the founding members of our Site Reliability Engineering function here at Character, you’ll have the opportunity to support our infrastructure with thousands of nodes, terabytes of data and millions of daily active users on our site.  You’ll be responsible for ensuring our product's reliability, scalability, and performance as we aggressively grow our user base, with a goal of growing to 3 billion users. Work closely with our development team to design and implement processes and systems that ensure the stability and availability of our service.

What you’ll do

  • Maintain production services and keep them operational.

  • Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.

  • Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.

  • Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.

  • Establish and support SLAs and SLOs for our site

  • Provide system monitoring and incident alerts

  • Participate in on-call rotations to provide support for critical incidents and outages.

  • Develop plans for site reliability and disaster recovery

Who you are

Competitive candidates will have:

  • 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale

  • Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang

  • Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base. 

  • Experience working with multiple cloud computing platforms such as GCP is also a must

  • Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems

  • Experience with incident management and event postmortems

Outstanding candidates will have one or more of the following:

  • Familiarity with GPU clusters and/or HPC environments is preferred

  • Experience with monitoring and logging tools such as Prometheus and Grafana

  • Hands-on experience scaling a consumer product from early days into hypergrowth

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

Limit Break - Senior Site Reliability Engineer

Limit Break

Tokyo, Japan (On-Site)
7 Months ago
Riot Games - Staff Software Engineer, Game Build

Riot Games

Los Angeles, California, United States (On-Site)
1 Day ago
Sony Interactive Entertainment - Senior Cloud Security Engineer

Sony Interactive Entertainment

Tokyo, Japan (On-Site)
5 Months ago
Playrix - Lead Unity Software Engineer (Gameplay)

Playrix

Cyprus (Remote)
6 Months ago
Synechron - Java / Scala Developer

Synechron

Mumbai, Maharashtra, India (On-Site)
1 Day ago
Canva - Senior Software Engineer (Cloud Platform)

Canva

Auckland, Auckland, New Zealand (Remote)
2 Months ago
Virtusa - Cloud DevOps Lead

Virtusa

Andhra Pradesh, India (On-Site)
6 Months ago
Nintendo - Senior Manager, Engineering Infrastructure and IT (NTD)

Nintendo

Redmond, Washington, United States (On-Site)
2 Weeks ago
Google - Software Developer III, Google Kubernetes Engine, Anthos Networking

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
1 Week ago
Rackspace Technology - Cloud Business Consultant

Rackspace Technology

Mexico City, Mexico City, Mexico (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Playrix - Lead C++ Software Engineer (Gameplay)

Playrix

Armenia (Remote)
6 Months ago
PwC - IN-Senior Associate _.Net Developer _Data & Analytics _Advisory _PAN India

PwC

Kolkata, West Bengal, India (On-Site)
6 Months ago
Tellius - Devops Engineer

Tellius

Bengaluru, Karnataka, India (On-Site)
7 Hours ago
Demandbase - Senior Software Engineer (Backend)

Demandbase

Hyderabad, Telangana, India (On-Site)
7 Hours ago
The Walt Disney Company - Senior Software Engineer (1-year contract)

The Walt Disney Company

Hong Kong (On-Site)
5 Months ago
Minecast - Senior Software Engineer - Threat Protection

Minecast

Bengaluru, Karnataka, India (On-Site)
5 Hours ago
Axon - Senior Application Security Engineer

Axon

San Francisco, California, United States (Hybrid)
7 Hours ago
Netflix - Solutions Support Engineer (L5) - Delivery

Netflix

Poland (Remote)
2 Weeks ago
Blazesoft - .Net Developer

Blazesoft

Vaughan, Ontario, Canada (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Menlo Park, California, United States

ByteDance - Senior Software Development Engineer - Cloud Native Databases

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Apple - Firmware Engineer – AirPods and Accessories

Apple

San Diego, California, United States (On-Site)
13 Hours ago
Adyen - Senior Enterprise Account Executive, Unified Commerce

Adyen

New York, New York, United States (Hybrid)
7 Hours ago
lifechruh - Contracts Coordinator

lifechruh

Edmond, Oklahoma, United States (On-Site)
3 Weeks ago
GoMotive - New Business Account Executive, Mid-Market

GoMotive

Austin, Texas, United States (On-Site)
1 Day ago
Next Level Business Services - Sr. Performance Test Engineer

Next Level Business Services

El Segundo, California, United States (On-Site)
6 Months ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Santa Clara, California, United States (Hybrid)
3 Months ago
Google - Senior Software Engineer, Full Stack, Labs

Google

Mountain View, California, United States (On-Site)
2 Days ago
MIQ Digital - Senior Product Manager

MIQ Digital

New York, New York, United States (Hybrid)
8 Hours ago
Fictiv - Senior Product Manager

Fictiv

Oakland, California, United States (Hybrid)
7 Hours ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Microsoft - Technical Support Engineer

Microsoft

Bengaluru, Karnataka, India (Hybrid)
3 Days ago
Ubisoft - Intermediate/Senior Tools Programmer

Ubisoft

Malmö, Skåne County, Sweden (Hybrid)
1 Month ago
Argus Labs - Site Reliability Engineer (APAC)

Argus Labs

Australia (Remote)
2 Weeks ago
Luxoft - Senior DevOps Engineer

Luxoft

Toronto, Ontario, Canada (On-Site)
4 Months ago
Zazz - Data Engineer

Zazz

(Remote)
3 Months ago
Luxoft - Google Cloud Engineer

Luxoft

New Delhi, Delhi, India (Remote)
4 Months ago
Next Level Business Services - Salesforce Devops Engineer

Next Level Business Services

Agoura Hills, California, United States (On-Site)
6 Months ago
Rackspace Technology - Manager, Professional Services Delivery

Rackspace Technology

Gurugram, Haryana, India (Remote)
1 Month ago
Magna International - Senior Cloud Engineer

Magna International

Bengaluru, Karnataka, India (On-Site)
6 Months ago
ION - Senior DevSecOps Engineer, Italy

ION

Pisa, Tuscany, Italy (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug