Software Engineer, Machine Learning Infrastructure

3 Months ago • 4 Years + • Devops

Job Summary

Job Description

Character.AI seeks a seasoned Software Engineer for Machine Learning Infrastructure. Responsibilities include providing infrastructure support for ML research and product development, building tooling for diagnosing cluster issues and hardware failures, monitoring deployments, managing experiments, and maximizing GPU allocation. The ideal candidate has 4+ years of experience supporting ML infrastructure, developing diagnostic tools, and working with cloud platforms (Compute Engine, Kubernetes, Cloud Storage) and GPUs. Experience with large GPU clusters, high-performance computing, large language model training, ML frameworks (PyTorch/TensorFlow/JAX), and GPU kernel development are highly desirable.
Must have:
  • 4+ years ML infrastructure support experience
  • Experience developing ML infrastructure diagnostic tools
  • Cloud platform experience (Compute Engine, Kubernetes, Cloud Storage)
  • GPU experience
Good to have:
  • Large GPU cluster & high-performance computing experience
  • Large language model training experience
  • ML framework experience (PyTorch/TensorFlow/JAX)
  • GPU kernel development experience

Job Details

About the role

We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.

Responsibilities:

  • Provide infrastructure support to our ML research and product

  • Build tooling to diagnose cluster issues and hardware failures

  • Monitor deployments, manage experiments, and generally support our research

  • Maximize GPU allocation and utilization for both serving and training

Requirements:

  • 4+ years of experience supporting the infrastructure within an ML environment

  • Experience in developing tools used to diagnose ML infrastructure problems and failures

  • Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)

  • Experience working with GPUs

Nice to have

  • Experience with large GPU clusters and high-performance computing/networking

  • Experience with supporting large language model training

  • Experience with ML frameworks like Pytorch/TensorFlow/JAX

  • Experience with GPU kernel development

About Character.AI

Character.AI empowers people to connect, learn and tell stories through interactive entertainment. Over 20 million people visit Character.AI every month, using our technology to supercharge their creativity and imagination. Our platform lets users engage with tens of millions of characters, enjoy unlimited conversations, and embark on infinite adventures.


In just two years, we achieved unicorn status and were honored as Google Play's AI App of the Year—a testament to our innovative technology and visionary approach.


Join us and be a part of establishing this new entertainment paradigm while shaping the future of Consumer AI!

At Character, we value diversity and welcome applicants from all backgrounds. As an equal opportunity employer, we firmly uphold a non-discrimination policy based on race, religion, national origin, gender, sexual orientation, age, veteran status, or disability. Your unique perspectives are vital to our success.

Compensation Range: $150K - $350K

Similar Jobs

CME Group - Site Reliability Engineer III - Markets

CME Group

Belfast, Northern Ireland, United Kingdom (Hybrid)
3 Weeks ago
Ion - Vulnerability Management Analyst

Ion

London, England, United Kingdom (On-Site)
1 Week ago
Assystems - Network Administrator - L2

Assystems

Gurugram, Haryana, India (On-Site)
8 Months ago
Capgemini - SCCM Intune Admin (Consultant, Lead, Architect)

Capgemini

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Riot Games - Manager, Software Engineering - Infrastructure / Cloud Foundations

Riot Games

Los Angeles, California, United States (On-Site)
5 Months ago
bytedance - Software Engineer, Multi Cloud CDN - San Jose / Seattle / Boston

bytedance

Seattle, Washington, United States (On-Site)
6 Months ago
Adobe - Software Development Engineer, Site Reliability Engineering

Adobe

Bucharest, Bucharest, Romania (On-Site)
1 Month ago
Palo Alto Networks - Consulting Director, Cloud Security Operations, Proactive Services

Palo Alto Networks

Netherlands (Remote)
1 Month ago
version 1 - Oracle Cloud Service Delivery Manager

version 1

London, England, United Kingdom (Hybrid)
1 Month ago
Crunchyroll - Staff DevOps Engineer

Crunchyroll

Los Angeles, California, United States (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Axon - Sr. Solutions Architect, Fusus

Axon

Atlanta, Georgia, United States (Hybrid)
1 Month ago
PayPal - Sr. Cloud Engineer - AWS

PayPal

Chicago, Illinois, United States (Hybrid)
3 Weeks ago
bytedance - Senior Software Development Engineer, Virtual Network

bytedance

San Jose, California, United States (On-Site)
5 Months ago
Marvell - Senior Distinguished Engineer

Marvell

Santa Clara, California, United States (On-Site)
1 Month ago
binance - Risk Analyst

binance

Taipei City, Taiwan (Remote)
1 Week ago
Google - Software Engineer III, Security/Privacy, Google Cloud

Google

Sunnyvale, California, United States (On-Site)
2 Months ago
Mozilla - Senior Software Engineer

Mozilla

Poland (Remote)
1 Month ago
Ion - Technical Consultant - Endur

Ion

New York, New York, United States (On-Site)
8 Months ago
10 Chambers - Senior Systems Programmer

10 Chambers

Stockholm, Stockholm County, Sweden (On-Site)
1 Week ago
binance - Binance Accelerator Program - Growth Marketing

binance

Buenos Aires, Buenos Aires, Argentina (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in New York, New York, United States

Nice - Account Executive

Nice

United States (Remote)
2 Days ago
Adyen - Enterprise Account Manager

Adyen

San Francisco, California, United States (On-Site)
1 Month ago
Star schema - Delivery Driver

Star schema

San Marcos, California, United States (On-Site)
1 Week ago
Apple - Admin Assistant

Apple

Sunnyvale, California, United States (On-Site)
1 Month ago
PlayStation Global - Software Engineering Manager, Android

PlayStation Global

Carlsbad, California, United States (On-Site)
2 Months ago
Star schema - Pizza Maker

Star schema

Atlanta, Texas, United States (On-Site)
1 Week ago
world relief - Youth Clubs Coordinator (Part-time)

world relief

Aurora, Illinois, United States (On-Site)
2 Weeks ago
Dave Ramsey - HR Coordinator

Dave Ramsey

Franklin, Tennessee, United States (On-Site)
2 Weeks ago
Epic Games - Senior BCP/DR Specialist

Epic Games

United States (On-Site)
3 Months ago
Crunchyroll - Staff DevOps Engineer

Crunchyroll

Los Angeles, California, United States (Hybrid)
1 Month ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Applied materials  - Software Architect

Applied materials

Chennai, Tamil Nadu, India (On-Site)
2 Weeks ago
Qualcomm - Engineer - Multimedia Automation & Execution

Qualcomm

Hyderabad, Telangana, India (On-Site)
1 Month ago
Capgemini - Solution Architect

Capgemini

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Mendix - Senior Presales Solution Architect

Mendix

Bangkok, Thailand (Remote)
7 Months ago
HCL Tech - Enterprise solution architect

HCL Tech

Texas, United States (On-Site)
4 Weeks ago
Paper Stacking games - Senior/Senior DevOps Engineer

Paper Stacking games

Shanghai, China (On-Site)
4 Weeks ago
Autodesk - Principal Engineer (Full Stack - Node, React, AWS)

Autodesk

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Apple - Software Engineer (Site Reliability)

Apple

Austin, Texas, United States (On-Site)
2 Weeks ago
appier - Staff/Senior Software Engineer, Machine Learning Platform (Ad Cloud)

appier

Taipei City, Taiwan (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

Character is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character is a full-stack AI company with a globally scaled direct-to-consumer platform. 

New York, New York, United States (On-Site)

San Francisco, California, United States (On-Site)

Palo Alto, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (Remote)

San Francisco, California, United States (On-Site)

Menlo Park, California, United States (On-Site)

New York, New York, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Character.AI

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug