Home >

Jobs >

AI Infra Engineer

Perplexity

California, United States (On-site)

AI Infra Engineer

2 Months ago • 3-5 Years • Research Development • $190,000 PA - $250,000 PA

Job Summary

Job Description

Perplexity is a rapidly growing AI-powered answer engine, seeking an AI Infra Engineer to join their team. This role involves a hybrid SRE/Dev Engineering capacity, focusing on building, deploying, and optimizing large-scale AI training and inference clusters. Key responsibilities include managing Kubernetes and Slurm environments, developing APIs for AI workloads, implementing resource scheduling, and enhancing system performance and observability. The role requires strong expertise in Kubernetes and Slurm, Python and C++ programming for automation, and experience with ML frameworks like PyTorch in distributed training scenarios.

Must have:

Expert Kubernetes administration
Hands-on Slurm workload management
Experience with distributed training systems
Deep understanding of container orchestration
Proficiency in Python and C++
Experience with PyTorch for distributed training
Strong debugging and monitoring skills

Good to have:

Kubernetes operators for ML
Advanced Slurm administration
GPU cluster management
Experience with TensorFlow
HPC environments knowledge
Infrastructure as Code (Terraform, Ansible)

Perks:

Equity may be part of total compensation
Comprehensive health, dental, and vision insurance
401(k) plan

15 skills required

15 skills required for this role

Add these skills to join the top 1% applicants for this job

team-management

problem-solving

resource-allocation

cpp

resource-planning

cuda

networking

yaml

aws

ansible

terraform

pytorch

kubernetes

python

tensorflow

Job Details

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason: everyone can be curious.

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will work in a hybrid SRE/Dev Engineering capacity, partnering closely with our Infrastructure and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
Strong understanding of networking, storage, and compute resource management for ML workloads
Experience developing APIs and managing distributed systems for both batch and real-time workloads
Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

Experience with Kubernetes operators and custom controllers for ML workloads
Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
Familiarity with GPU cluster management and CUDA optimization
Experience with other ML frameworks like TensorFlow or distributed training libraries
Background in HPC environments, parallel computing, and high-performance networking
Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

Demonstrated experience managing large-scale Kubernetes deployments in production environments
Proven track record with Slurm cluster administration and HPC workload management
Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
Experience supporting both long-running training jobs and high-availability inference services
Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

The cash compensation range for this role is $190,000 - $250,000.

Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

Equity: In addition to the base salary, equity may be part of the total compensation package.
Benefits: Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

Similar Jobs

Associate Customer Success Manager

USE Insider

Istanbul, İstanbul, Türkiye (On-Site)

• 10 Months ago

Regulatory Product Compliance Manager

Adyen

London, England, United Kingdom (On-Site)

• 2 Months ago

Senior Big Data Engineer (Core Data)

playrix

Ireland (Remote)

• 7 Months ago

Senior Product Designer

Survay Monkey

Ottawa, Ontario, Canada (Remote)

• 2 Months ago

Associate Principal Engineer / Salesforce Enterprise Architect

Nagarro

Atlanta, Georgia, United States (Hybrid)

• 1 Month ago

Senior Software Engineer, Computer Vision and Deep Learning

Mashgin

Palo Alto, California, United States (Hybrid)

• 10 Months ago

ML Ops Engineer

Sonar Source

Geneva, Geneva, Switzerland (On-Site)

• 7 Months ago

R&D ICT Projects Economist / Admin Assistant

DOTSOFT SA

Pylaia, Greece (On-Site)

• 1 Month ago

Senior Staff Machine Learning Engineer

San Francisco, California, United States (Remote)

• 1 Month ago

Gen AI Specialist

Bosch Group

Bengaluru, Karnataka, India (On-Site)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

MTS, Robotics Engineer

Sima AI

San Jose, California, United States (On-Site)

• 1 Month ago

AI Software Architect

Wolters Kluwer

Tuscany, Italy (Hybrid)

• 1 Year ago

Motion Graphic Designer (German Speaker)

The Globel Talent Co

Johannesburg, Gauteng, South Africa (Remote)

• 3 Months ago

Senior Sales Engineer

Ambient.ai

Boston, Massachusetts, United States (Remote)

• 2 Months ago

Software Engineer – Cloud Data Protection

HYCU

Bengaluru, Karnataka, India (Hybrid)

• 1 Year ago

Territory Account Executive

Toast

Trenton, New Jersey, United States (On-Site)

• 2 Months ago

Forex Application Development Intern - Singapore 900071

Tencent

Singapore (On-Site)

• 8 Months ago

Senior Cloud Infrastructure Engineer (AWS/Azure)

Rackspace Technology

Germany (Remote)

• 1 Month ago

Strategic Account Executive

Saviynt

London, England, United Kingdom (Remote)

• 5 Months ago

Technical Animator

Playnetic

(Remote)

• 2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Palo Alto, California, United States

Senior Tools Programmer, UEFN

Epic Games

Cary, North Carolina, United States (On-Site)

• 7 Months ago

Application Specialist

AidKit

United States (Remote)

• 1 Month ago

Procurement Manager - Marketing, AMS

bytedance

San Jose, California, United States (On-Site)

• 4 Months ago

Director of Enterprise Partnerships

Nasdaq

New York, New York, United States (Hybrid)

• 1 Year ago

Golf Simulator Installation Specialist

Trackman

Arizona, United States (On-Site)

• 3 Months ago

Machine Learning Engineer - Machine Learning Infrastructure

bytedance

San Jose, California, United States (On-Site)

• 10 Months ago

Marketing Operation Intern

Tencent

Los Angeles, California, United States (On-Site)

• 1 Month ago

SAP HANA Developer

Next Level Business Services

Charlotte, North Carolina, United States (On-Site)

• 10 Months ago

Vice President, Corporate Development

Sailpoint

United States (On-Site)

• 3 Months ago

Brand Champion - Dell

Zones

Auburn, Washington, United States (Hybrid)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Research Development Jobs

AI Engineer

AI Fund

United States (Remote)

• 7 Months ago

Machine Learning Engineer

Moloco

Seoul, South Korea (On-Site)

• 3 Months ago

Junior R&D Engineer

Ubisoft

Pune, Maharashtra, India (Hybrid)

• 3 Months ago

Senior Machine Learning Engineer - Perception

zoox

Foster City, California, United States (Hybrid)

• 3 Months ago

AI Programmer

Cloud Imperium Games

Frankfurt Am Main, Hessen, Germany (On-Site)

• 1 Year ago

Compliance Manager- - AI & Technology

YouGov

London, England, United Kingdom (Hybrid)

• 1 Month ago

Machine Learning Engineer

Hedra

San Francisco, California, United States (On-Site)

• 6 Months ago

Analytic Science Principal Scientist

FICO

San Diego, California, United States (On-Site)

• 4 Months ago

Director of AI Transformation

Aledade

Arlington, Virginia, United States (Remote)

• 2 Months ago

AI Value Architect (PM/PO)

CyberArk

United States (Hybrid)

• 4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Perplexity

60 Active Jobs

Get notified when new jobs are added by Perplexity

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

AI Infra Engineer

Job Summary

Job Description

15 skills required

15 skills required for this role

Job Details

Responsibilities

Qualifications

Required Skills

Preferred Skills

Required Experience

Similar Jobs

Associate Customer Success Manager

Regulatory Product Compliance Manager

Senior Big Data Engineer (Core Data)

Senior Product Designer

Associate Principal Engineer / Salesforce Enterprise Architect

Senior Software Engineer, Computer Vision and Deep Learning

ML Ops Engineer

R&D ICT Projects Economist / Admin Assistant

Senior Staff Machine Learning Engineer

Gen AI Specialist

Similar Skill Jobs

MTS, Robotics Engineer

AI Software Architect

Motion Graphic Designer (German Speaker)

Senior Sales Engineer

Software Engineer – Cloud Data Protection

Territory Account Executive

Forex Application Development Intern - Singapore 900071

Senior Cloud Infrastructure Engineer (AWS/Azure)

Strategic Account Executive

Technical Animator

Jobs in Palo Alto, California, United States

Senior Tools Programmer, UEFN

Application Specialist

Procurement Manager - Marketing, AMS

Director of Enterprise Partnerships

Golf Simulator Installation Specialist

Machine Learning Engineer - Machine Learning Infrastructure

Marketing Operation Intern

SAP HANA Developer

Vice President, Corporate Development

Brand Champion - Dell

Research Development Jobs

AI Engineer

Machine Learning Engineer

Junior R&D Engineer

Senior Machine Learning Engineer - Perception

AI Programmer

Compliance Manager- - AI & Technology

Machine Learning Engineer

Analytic Science Principal Scientist

Director of AI Transformation

AI Value Architect (PM/PO)

About The Company

Global Communications Operations Lead

Product Manager - Enterprise Growth

Android Mobile Engineer

Software Engineer - Agent Infra

Product Designer - Growth

AI Interaction Designer

Senior Revenue Accountant

Senior Accountant

Recruiting Coordinator (Contract)

Data Scientist - Growth

Level Up Your Career in Game Development!