Software Engineer, Machine Learning Infrastructure

4 Months ago • 4 Years + • Devops

Job Summary

Job Description

Character.AI seeks a seasoned Software Engineer for Machine Learning Infrastructure. Responsibilities include providing infrastructure support for ML research and product development, building tooling for diagnosing cluster issues and hardware failures, monitoring deployments, managing experiments, and maximizing GPU allocation. The ideal candidate has 4+ years of experience supporting ML infrastructure, developing diagnostic tools, and working with cloud platforms (Compute Engine, Kubernetes, Cloud Storage) and GPUs. Experience with large GPU clusters, high-performance computing, large language model training, ML frameworks (PyTorch/TensorFlow/JAX), and GPU kernel development are highly desirable.
Must have:
  • 4+ years ML infrastructure support experience
  • Experience developing ML infrastructure diagnostic tools
  • Cloud platform experience (Compute Engine, Kubernetes, Cloud Storage)
  • GPU experience
Good to have:
  • Large GPU cluster & high-performance computing experience
  • Large language model training experience
  • ML framework experience (PyTorch/TensorFlow/JAX)
  • GPU kernel development experience

Job Details

About the role

We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.

Responsibilities:

  • Provide infrastructure support to our ML research and product

  • Build tooling to diagnose cluster issues and hardware failures

  • Monitor deployments, manage experiments, and generally support our research

  • Maximize GPU allocation and utilization for both serving and training

Requirements:

  • 4+ years of experience supporting the infrastructure within an ML environment

  • Experience in developing tools used to diagnose ML infrastructure problems and failures

  • Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)

  • Experience working with GPUs

Nice to have

  • Experience with large GPU clusters and high-performance computing/networking

  • Experience with supporting large language model training

  • Experience with ML frameworks like Pytorch/TensorFlow/JAX

  • Experience with GPU kernel development

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

Ettain Group - Customer Support Engineer

Ettain Group

Richardson, Texas, United States (On-Site)
10 Years ago
quience - Senior Recruiter

quience

United States (Remote)
1 Month ago
Brillio - Senior Data Specialist- R01531001

Brillio

Bengaluru, Karnataka, India (Hybrid)
9 Months ago
Ion - Senior Windows Engineer

Ion

Jersey City, New Jersey, United States (On-Site)
6 Months ago
bytedance - Research Engineer (Machine Learning Training System) - 2025 Start

bytedance

Singapore (On-Site)
9 Months ago
attentive - Staff Site Reliability Engineer

attentive

United States (Remote)
1 Month ago
bytedance - Software Engineer - Edge Cloud Infrastructure

bytedance

Singapore (On-Site)
2 Months ago
Jane Street - Facilities Automation Engineer

Jane Street

New York, United States (On-Site)
2 Months ago
CyberArk - Senior DevOps Engineer

CyberArk

United States (On-Site)
2 Months ago
Shield AI - Sales Solutions Engineer, APAC (R3660)

Shield AI

Seoul, South Korea (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

gyb games - Senior Game Developer (Casual)

gyb games

Istanbul, İstanbul, Türkiye (On-Site)
3 Months ago
Scopely - Senior Client Engineer - Star Trek Fleet Command

Scopely

Dublin, County Dublin, Ireland (Hybrid)
4 Months ago
Sonar Source - Mid-Market Account Executive - DACH

Sonar Source

London, England, United Kingdom (On-Site)
5 Months ago
Interface AI - Senior Strategic Partner Manager

Interface AI

United States (Remote)
4 Months ago
Wrike - CloudOps Team Lead

Wrike

Estonia (Remote)
1 Week ago
NCR Atleos - SW Dev Ops Engineer II

NCR Atleos

Hyderabad, Telangana, India (On-Site)
2 Years ago
DMG - Senior Staff Engineer

DMG

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Alphawave Semi - Analog Design Engineer

Alphawave Semi

Vancouver, British Columbia, Canada (On-Site)
2 Months ago
DevRev - Sales Development Representative

DevRev

Chennai, Tamil Nadu, India (On-Site)
2 Months ago
Capgemini - Powerflex Engineer

Capgemini

Gurugram, Haryana, India (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in New York, New York, United States

bytedance - Backend Software Engineer Intern

bytedance

Seattle, Washington, United States (On-Site)
2 Months ago
Yodlee - Technical Architect - Credit & Analytics Domain

Yodlee

Raleigh, North Carolina, United States (Remote)
3 Months ago
Canva - Customer Success Manager

Canva

Austin, Texas, United States (Remote)
1 Month ago
Match Group - Sr. Software Development Engineer in Test (SDET) – iOS & Android

Match Group

West Hollywood, California, United States (Hybrid)
1 Month ago
Apple - Coatings Engineer | Materials Product Design

Apple

Cupertino, California, United States (On-Site)
1 Month ago
Crunchyroll - People Experience Communications Manager

Crunchyroll

Culver City, California, United States (Hybrid)
5 Months ago
Thatgamecompany - Social Content & Growth Associate (Contract)

Thatgamecompany

United States (Remote)
4 Months ago
Blinkhealth - Supervisor, Pharmacy Operations (Claims and Patient Outreach)

Blinkhealth

Pittsburgh, Pennsylvania, United States (On-Site)
2 Months ago
Mixpanel - Senior Software Engineer, Mobile

Mixpanel

San Francisco, California, United States (Remote)
2 Weeks ago
Reddit - Senior Client Partner, Mid-Market (App Dev - Acquisitions)

Reddit

San Francisco, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Attio - Site Reliability Engineer

Attio

London, England, United Kingdom (Hybrid)
2 Weeks ago
Capgemini - Selenium Java + Azure DevOps

Capgemini

Mumbai, Maharashtra, India (On-Site)
2 Months ago
Interactive Brokers - Platform Engineer - Support

Interactive Brokers

Mumbai, Maharashtra, India (On-Site)
2 Months ago
Mistral AI - AI Solution Architect

Mistral AI

Paris, Île-de-France, France (On-Site)
2 Weeks ago
Tesla - Distributed Systems Engineer, Autobidder Platform

Tesla

North Holland, Netherlands (On-Site)
5 Months ago
Intel  - Senior Infrastructure Engineer - Linux OS

Intel

Phoenix, Arizona, United States (On-Site)
1 Month ago
Argus - Software Engineer (Infrastructure/Backend)

Argus

(Remote)
4 Months ago
miniclip - Cloud Infrastructure Engineer - Cloud Engineer II

miniclip

Lisbon, Lisbon, Portugal (On-Site)
2 Months ago
Palo Alto Networks - Program Coordinator (Cloud Service Providers)

Palo Alto Networks

United Kingdom (Remote)
2 Months ago
illumio - Senior Software Engineer, Kubernetes

illumio

Sunnyvale, California, United States (On-Site)
2 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

Redwood City, California, United States (Hybrid)

Redwood City, California, United States (On-Site)

Redwood City, California, United States (On-Site)

Redwood City, California, United States (On-Site)

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug