Software Engineer, Machine Learning Infrastructure

1 Month ago • 4 Years + • DevOps • Artificial Intelligence

Job Summary

Job Description

Character.AI seeks a seasoned Software Engineer for Machine Learning Infrastructure. Responsibilities include providing infrastructure support for ML research and product development, building tooling for diagnosing cluster issues and hardware failures, monitoring deployments, managing experiments, and maximizing GPU allocation. The ideal candidate has 4+ years of experience supporting ML infrastructure, developing diagnostic tools, and working with cloud platforms (Compute Engine, Kubernetes, Cloud Storage) and GPUs. Experience with large GPU clusters, high-performance computing, large language model training, ML frameworks (PyTorch/TensorFlow/JAX), and GPU kernel development are highly desirable.
Must have:
  • 4+ years ML infrastructure support experience
  • Experience developing ML infrastructure diagnostic tools
  • Cloud platform experience (Compute Engine, Kubernetes, Cloud Storage)
  • GPU experience
Good to have:
  • Large GPU cluster & high-performance computing experience
  • Large language model training experience
  • ML framework experience (PyTorch/TensorFlow/JAX)
  • GPU kernel development experience

Job Details

About the role

We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.

Responsibilities:

  • Provide infrastructure support to our ML research and product

  • Build tooling to diagnose cluster issues and hardware failures

  • Monitor deployments, manage experiments, and generally support our research

  • Maximize GPU allocation and utilization for both serving and training

Requirements:

  • 4+ years of experience supporting the infrastructure within an ML environment

  • Experience in developing tools used to diagnose ML infrastructure problems and failures

  • Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)

  • Experience working with GPUs

Nice to have

  • Experience with large GPU clusters and high-performance computing/networking

  • Experience with supporting large language model training

  • Experience with ML frameworks like Pytorch/TensorFlow/JAX

  • Experience with GPU kernel development

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

Orion Innovation - Data Engineer-AI,ML

Orion Innovation

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
GoMotive - Computer Vision Engineer

GoMotive

(Remote)
1 Day ago
ByteDance - Research Scientist- Applied Machine learning Graduates (AML) - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
Windranger Labs - Technical AI Researcher

Windranger Labs

Singapore (On-Site)
1 Month ago
Luxoft - Senior Infrastructure Engineer

Luxoft

Abu Dhabi, Abu Dhabi, United Arab Emirates (On-Site)
4 Months ago
ByteDance - Backend Software Engineer (Business Infra), ByteCloud - 2025 Start

ByteDance

Singapore (On-Site)
6 Months ago
Eleven Labs - Risk & Compliance

Eleven Labs

(Remote)
1 Month ago
Toppan Merrill - Site Reliability Engineer

Toppan Merrill

Chennai, Tamil Nadu, India (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Software Engineer, Model Inference

ByteDance

Seattle, Washington, United States (On-Site)
2 Weeks ago
Google - Technical Program Manager III, AI/ML, Cloud AI Systems

Google

Austin, Texas, United States (On-Site)
2 Weeks ago
Dolby Laboratories - Senior Computer Vision Researcher

Dolby Laboratories

Bengaluru, Karnataka, India (Hybrid)
7 Months ago
Google - Software Engineering Manager, Visual Language and Multimodal Modeling

Google

Sydney, New South Wales, Australia (On-Site)
2 Weeks ago
Framestore - Machine Learning Developer - London Launchpad Internship 2025

Framestore

England, United Kingdom (On-Site)
1 Month ago
Meta - Software Engineer, Machine Learning

Meta

Menlo Park, California, United States (On-Site)
5 Months ago
NVIDIA - Senior Developer Relations Manager - Robotics

NVIDIA

Tokyo, Japan (On-Site)
3 Months ago
Tencent - IaaS Product Solution Architect

Tencent

(On-Site)
3 Weeks ago
Google - Customer Engineer, AI Infrastructure

Google

Seattle, Washington, United States (On-Site)
2 Weeks ago
ByteDance - Algorithm Engineer - Audio Understanding

ByteDance

Singapore (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in New York, New York, United States

Nintendo - Product Specialist (Portuguese)

Nintendo

Redmond, Washington, United States (Hybrid)
6 Months ago
Ramboll3 - Water/Wastewater Engineering Manager

Ramboll3

Albany, New York, United States (On-Site)
2 Months ago
buildstsaff - Java Developer

buildstsaff

Alexandria, Virginia, United States (On-Site)
6 Years ago
Nagarro - Analyst, Operations

Nagarro

New York, New York, United States (On-Site)
6 Months ago
Google - Lead Group Product Manager, Vertex AI Platform Development

Google

Sunnyvale, California, United States (On-Site)
1 Week ago
Next Level Business Services - SAP MM Consultant

Next Level Business Services

Commerce, California, United States (On-Site)
6 Months ago
NVIDIA - System Design Validation Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
3 Weeks ago
AVER LLC - Senior Latent Print Examiner

AVER LLC

United States (On-Site)
6 Months ago
Google - Lead Group Product Manager, Vertex AI Platform Development

Google

Kirkland, Washington, United States (On-Site)
2 Days ago
Google - Senior Mechanical Product Design Engineer

Google

Sunnyvale, California, United States (On-Site)
2 Days ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Ajmera Infotech - Site Reliability Engineer (SRE) - Kubernetes

Ajmera Infotech

Austin, Texas, United States (On-Site)
3 Months ago
ION - Senior DevSecOps Engineer, Italy

ION

Pisa, Tuscany, Italy (On-Site)
6 Months ago
The Walt Disney Company - Principal Software Engineer

The Walt Disney Company

Morrisville, North Carolina, United States (On-Site)
3 Days ago
Microsoft - Software Engineer II

Microsoft

Redmond, Washington, United States (On-Site)
2 Weeks ago
Google - Customer Engineer (English, Japanese)

Google

Tokyo, Japan (On-Site)
2 Weeks ago
Omnissa - Staff Engineer (C++ Windows)

Omnissa

Chennai, Tamil Nadu, India (On-Site)
6 Months ago
ION - Cloud Engineer Kubernetes

ION

Milan, Lombardy, Italy (Hybrid)
6 Months ago
Cadence - Senior Cloud Platform Architect

Cadence

San Jose, California, United States (On-Site)
6 Months ago
White Hat Gaming  - Site Reliability Engineer (SRE)

White Hat Gaming

(Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug