Software Engineer, Machine Learning Infrastructure

1 Month ago • 4 Years + • DevOps • Artificial Intelligence

Job Summary

Job Description

Character.AI seeks a seasoned Software Engineer specializing in Machine Learning Infrastructure. Responsibilities include providing infrastructure support for ML research and product development, building diagnostic tools for cluster issues and hardware failures, monitoring deployments and experiments, and maximizing GPU utilization. The ideal candidate possesses 4+ years of experience supporting ML infrastructure, developing diagnostic tools, working with cloud platforms (Compute Engine, Kubernetes, Cloud Storage), and experience with GPUs. The role involves building and maintaining training and serving infrastructure for ML research within a rapidly growing AI company.
Must have:
  • 4+ years supporting ML infrastructure
  • Develop diagnostic tools for ML infrastructure
  • Experience with cloud platforms (Compute Engine, Kubernetes, Cloud Storage)
  • GPU experience
Good to have:
  • Large GPU clusters & high-performance computing/networking
  • Large language model training support
  • ML frameworks (Pytorch/TensorFlow/JAX)
  • GPU kernel development

Job Details

About the role

We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.

Responsibilities:

  • Provide infrastructure support to our ML research and product

  • Build tooling to diagnose cluster issues and hardware failures

  • Monitor deployments, manage experiments, and generally support our research

  • Maximize GPU allocation and utilization for both serving and training

Requirements:

  • 4+ years of experience supporting the infrastructure within an ML environment

  • Experience in developing tools used to diagnose ML infrastructure problems and failures

  • Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)

  • Experience working with GPUs

Nice to have

  • Experience with large GPU clusters and high-performance computing/networking

  • Experience with supporting large language model training

  • Experience with ML frameworks like Pytorch/TensorFlow/JAX

  • Experience with GPU kernel development

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

PwC - IN-Senior Associate_ML Engineer_Data &Analytics_Advisory_Bangalore

PwC

Bengaluru, Karnataka, India (On-Site)
6 Months ago
ByteDance - Research Scientist in ML Systems

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
AppLovin - Machine Learning Engineer

AppLovin

Beijing, Beijing, China (On-Site)
8 Months ago
Google - Data Scientist, Extended Workforce Solutions

Google

Hyderabad, Telangana, India (On-Site)
4 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
PlayStation Global - Senior DevOps Information System Engineer

PlayStation Global

Aliso Viejo, California, United States (On-Site)
1 Month ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Hyderabad, Telangana, India (Hybrid)
3 Months ago
Codeway - DevOps Engineer (Mid/Sr)

Codeway

İstanbul, Türkiye (On-Site)
3 Months ago
Google - Staff Software Engineer, Site Reliability Engineering, Google Cloud

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
4 Months ago
Granicus - Sr. DevOps Engineer

Granicus

Bengaluru, Karnataka, India (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Pittsburgh, Pennsylvania, United States (On-Site)
5 Months ago
NVIDIA - Senior System Software Engineer - Triton Inference Server

NVIDIA

California, United States (Remote)
3 Months ago
The Walt Disney Company - Lead Software Engineer, Machine Learning - Ad Platforms

The Walt Disney Company

San Francisco, California, United States (On-Site)
5 Months ago
The Walt Disney Company - Lead Software Engineer, Machine Learning - Ad Platforms

The Walt Disney Company

Seattle, Washington, United States (On-Site)
5 Months ago
Dolby Laboratories - Senior Computer Vision Researcher

Dolby Laboratories

Bengaluru, Karnataka, India (Hybrid)
7 Months ago
Meta - Research Scientist, Computer Vision for Generative AI (PhD)

Meta

New York, New York, United States (On-Site)
5 Months ago
Razer - Solutions Architect

Razer

Singapore (On-Site)
6 Months ago
Meta - Research Intern, Computer Vision for Egocentric Representation Learning (PhD)

Meta

Redmond, Washington, United States (On-Site)
5 Months ago
Arrise Solutions (India)   - Senior Data Scientist (Remote)

Arrise Solutions (India)

Hyderabad, Telangana, India (Remote)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in New York, New York, United States

The Walt Disney Company - Sr. UX Engineer

The Walt Disney Company

Glendale, California, United States (On-Site)
3 Months ago
ByteDance - Software Engineer Intern (Applied Machine Learning) - 2025 Summer/Fall (BS/MS)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Meta - Production Engineer

Meta

Menlo Park, California, United States (Remote)
5 Months ago
Ghostpunch Games - Unreal Engine Developer (Remote)

Ghostpunch Games

Fort Lauderdale, Florida, United States (Remote)
10 Months ago
Crunchyroll - Senior Executive Assistant

Crunchyroll

New York, New York, United States (Hybrid)
1 Month ago
Saviynt - Sr. Principal Software Engineer - Privileged Access Management (PAM)

Saviynt

El Segundo, California, United States (Hybrid)
6 Months ago
Ajmera Infotech - Sr. Asp.NET Engineer

Ajmera Infotech

Austin, Texas, United States (On-Site)
5 Months ago
AGS - American Gaming Systems - PR and Communications Manager

AGS - American Gaming Systems

Nevada, United States (On-Site)
3 Months ago
Redhorse Corp - Resource Efficiency Manager - Level II

Redhorse Corp

West Point, New York, United States (On-Site)
4 Months ago
The Walt Disney Company - Lead Machine Learning Engineer

The Walt Disney Company

New York, New York, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

EXUSIA - Ab Initio Data Engineer

EXUSIA

United States (On-Site)
6 Months ago
Socialpoint - Systems Engineer

Socialpoint

Barcelona, Catalonia, Spain (Hybrid)
9 Months ago
Info Stretch - Programmer Analyst 5

Info Stretch

Lansing, Michigan, United States (Hybrid)
5 Months ago
GoTo Group - Software Engineer - Foundation Security

GoTo Group

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Enphase Energy - Staff Devops Engineer

Enphase Energy

Bengaluru, Karnataka, India (On-Site)
5 Months ago
Playgendary - DevOps (Cloud Engineer)

Playgendary

Limassol, Limassol, Cyprus (Remote)
2 Months ago
Scopely - Lead DevOps/SRE - Unannounced Project

Scopely

Dublin, County Dublin, Ireland (Hybrid)
3 Months ago
Kefir Games - Build Engineer

Kefir Games

Cyprus (On-Site)
5 Months ago
ByteDance - Site Reliability Engineer - Data Infrastructure (Seattle)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
ION - Microsoft System Engineer, Italy

ION

Italy (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug