Senior Research Scientist/Engineer - AI Infrastructure

1 Month ago • All levels • Devops • $184,000 PA - $337,000 PA

Job Summary

Job Description

The job involves designing, building, and maintaining robust AI infrastructure for training and serving large ML workloads. Responsibilities include designing scalable architectures, optimizing performance, building distributed systems, managing data pipelines, and collaborating with researchers. The ideal candidate should be proficient in infrastructure design, performance optimization, distributed systems, and data pipeline engineering. The role requires leading design, implementing service-oriented architectures, profiling and optimizing the ML stack, building and operating large-scale deployment systems, architecting data ingestion pipelines, and integrating experiment management tools. The candidate will also mentor engineers on best practices.
Must have:
  • Expertise in infrastructure design and architecture.
  • Experience in performance optimization.
  • Skills in building distributed systems and scalability.
  • Knowledge of data pipeline and workflow engineering.
Perks:
  • Medical, dental, and vision insurance.
  • 401(k) savings plan with company match.
  • Paid parental leave.
  • Short-term and long-term disability coverage.
  • Life insurance.
  • Wellbeing benefits.
  • 10 paid holidays per year.
  • 10 paid sick days per year.
  • 17 days of Paid Personal Time (prorated).

Job Details

Team Introduction: The infra4AI Research and Architecture Team is responsible for the foundational hardware and software systems specifically engineered to support the demanding and often experimental workloads of developing new artificial intelligence models and systems. It serves as the bedrock upon which researchers and engineers create, train, test, and iterate on novel AI architectures, from large language models (LLMs) to specialized neural networks. We are seeking a highly skilled and motivated AI Infrastructure Researchers and Engineers to join our dynamic team. In this role, you will be responsible for designing, building, deploying, and maintaining the robust and scalable infrastructure that powers our cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives. You will work closely with our AI/ML researchers, data scientists, and software engineers to create an efficient, high-performance environment for training, inference, and data processing. Your expertise will be critical in enabling the next generation of AI-driven products and services. Responsibilities The ideal candidate should be an expert in at least one of the following fields to define and design the next-gen AI Infrastructure: - Infrastructure Design & Architecture - Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads. - Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security. - Performance Optimization - Profile and optimize every layer of the ML stack—ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks. - Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving. - Distributed Systems & Scalability - Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud). - Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement. - Data Pipeline & Workflow Engineering - Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets. - Integrate experiment management and workflow orchestration tools (Airflow, Kubeflow, Metaflow) to streamline research-to-production. - Collaboration & Mentorship - Partner with ML researchers to translate prototype requirements into production-grade systems. - Mentor and coach engineers on best practices in performance tuning, systems design, and reliability engineering.

Similar Jobs

Workato - Senior Infrastructure Engineer (OpenSearch)

Workato

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Canonical - Enterprise Customer Success Manager

Canonical

(Remote)
1 Month ago
Intel  - Systems and Solutions Architect

Intel

Santa Clara, California, United States (On-Site)
1 Year ago
BioFire - Fine Tuning Specialist - Medical Device

BioFire

Morrisville, North Carolina, United States (On-Site)
1 Month ago
Capgemini - Business Advisor

Capgemini

Noida, Uttar Pradesh, India (On-Site)
2 Months ago
Next Level Business Services - Salesforce Solution Architect

Next Level Business Services

Diamond Bar, California, United States (On-Site)
8 Months ago
Google - Software Engineer III, Infrastructure, Google Cloud AI

Google

Kirkland, Washington, United States (On-Site)
8 Months ago
undefined - Senior Front End infrastructure Engineer

Tel Aviv-Yafo, Tel Aviv District, Israel (Hybrid)
1 Month ago
Turbulent - Senior DevOps Engineer

Turbulent

Montreal, Quebec, Canada (On-Site)
3 Months ago
smarsh - Platform Engineer III - MongoDB

smarsh

Belfast, Northern Ireland, United Kingdom (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Google - Software Engineer III, Google Cloud Global Networking

Google

Atlanta, Georgia, United States (On-Site)
5 Days ago
Ubisoft - Senior/Expert Online Infrastructure Engineer

Ubisoft

Malmö, Skåne County, Sweden (Hybrid)
2 Days ago
Remote control productions - Business Development Manager/Sales

Remote control productions

Munich, Bavaria, Germany (Hybrid)
2 Months ago
In The Pocket - C# Engineer (.NET)

In The Pocket

Bucharest, Bucharest, Romania (On-Site)
1 Month ago
Google - PhD Software Engineer

Google

Sunnyvale, California, United States (On-Site)
2 Months ago
fluence - Engineer, RMDC

fluence

Bengaluru, Karnataka, India (On-Site)
1 Month ago
bytedance - AI/LLM Network Software Engineer (High Speed Network)

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago
DraftKings - Lead Site Reliability Engineer

DraftKings

Boston, Massachusetts, United States (On-Site)
3 Months ago
HCL Tech - Enterprise solution architect

HCL Tech

New Jersey, United States (On-Site)
1 Month ago
Sonar Source - Major Account Manager - DACH

Sonar Source

Geneva, Geneva, Switzerland (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Jobs in San Jose, California, United States

Scout - Senior Software Engineer, Full Stack

Scout

Fremont, California, United States (On-Site)
2 Months ago
Scale AI - Technical Product Manager

Scale AI

San Francisco, California, United States (On-Site)
2 Months ago
PayPal - Senior Analyst, FP&A

PayPal

San Jose, California, United States (Hybrid)
3 Weeks ago
Nightfall AI - Operations Coordinator

Nightfall AI

San Francisco, California, United States (On-Site)
2 Months ago
Demandbase - Strategic Finance Manager - GTM Finance

Demandbase

San Francisco, California, United States (On-Site)
1 Month ago
Google - Account Strategist, Mid-Market Sales

Google

Ann Arbor, Michigan, United States (On-Site)
2 Months ago
Apple - SAP Project Manager

Apple

Austin, Texas, United States (On-Site)
1 Month ago
undefined - Coordinator, Events

Georgia, United States (On-Site)
3 Weeks ago
Kavalirio - Welding Engineer

Kavalirio

Mount Pleasant, Pennsylvania, United States (On-Site)
2 Months ago
Mashgin - Deployment Engineer - North Carolina

Mashgin

Charlotte, North Carolina, United States (Remote)
8 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Workato - Senior Infrastructure Engineer (OpenSearch)

Workato

Lisbon, Lisbon, Portugal (On-Site)
1 Month ago
Intel  - Senior Infrastructure Engineer - Windows OS

Intel

Phoenix, Arizona, United States (On-Site)
1 Year ago
Impronics Technologies - AWS Cloud Engineer

Impronics Technologies

Gurugram, Haryana, India (On-Site)
1 Year ago
CrowdStrike - Backend Engineer III - Falcon NG-SIEM, Global Serverless Platform

CrowdStrike

Aarhus, Denmark (Hybrid)
1 Month ago
London stock Exchange - Senior DevOps Engineer

London stock Exchange

Colombo, Western Province, Sri Lanka (On-Site)
1 Month ago
Tencent - Tencent Cloud - Senior Cloud Architect (R&D & Solution Design)

Tencent

Singapore (On-Site)
7 Months ago
NVIDIA - Solutions Architect, Generative AI

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
Contentstack - Senior Engineer I - DevOps

Contentstack

Chennai, Tamil Nadu, India (Hybrid)
2 Months ago
bytedance - Site Reliability Engineer, ML System - Foundation Model

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago
Toast - Staff Cloud Engineer

Toast

United States (Remote)
3 Weeks ago

Get notifed when new similar jobs are uploaded

About The Company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
View All Jobs

Get notified when new jobs are added by bytedance

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug