Software Engineer, LLM Storage System Intern

3 Months ago • All levels • System Design

Job Summary

Job Description

This Software Engineer internship focuses on designing and developing components for large model storage systems. Responsibilities include optimizing KVCache, improving data IO, and managing a multi-level storage system using various media like HBM and remote storage. The role involves optimizing KV Cache hit rate, designing data access interfaces, and ensuring system stability in Kubernetes. The engineer will also handle system setup, disaster recovery, and data placement across clusters for multi-datacenter scenarios. This internship offers an opportunity to contribute to cutting-edge AI technology and work with a world-class research team.
Must have:
  • Design and develop components for machine learning systems.
  • Optimize KVCache for large model inference.
  • Implement multi-level storage systems.
  • Design efficient and user-friendly data access interfaces.
  • Manage and maintain storage systems in Kubernetes.
  • Handle system setup and disaster recovery in multi-cloud scenarios.

Job Details

About the Team ByteDance Doubao Large Model Team was established in 2023, dedicated to developing the most advanced AI large model technology in the industry, becoming a world-class research team, and contributing to the development of technology and society. The Doubao large model team has a long-term vision and determination in the field of AI, with research directions covering NLP, CV, speech, etc. They have laboratories and research positions in China, Singapore, the US and other places. The team relies on sufficient data, computing and other resources on the platform, continuously invests in related fields, and has launched self-developed general large models, providing MultiModal Machine Learning capabilities. Downstream support includes 50 + businesses such as Doubao, Coze, Dreamina, and is open to enterprise customers through Volcengine. Currently, Doubao APP has become the largest AIGC application in the Chinese market. 1. Assume responsibility for the design and development of components associated with the storage of machine learning systems, catering to diverse business scenarios of large model inference (LLM/S2S/VLM/multimodal, etc.). This includes model distribution and loading, KVCache optimization, enhancement of data IO performance, and improvement of TTFT and TBT in LLM serving 2. Take charge of designing and implementing a multi-level storage system for large model inference. Comprehensively utilize various media, including HBM, host memory, distributed disk, and remote large-capacity storage systems (HDFS/object storage) for data storage and migration management. Realize an integrated hierarchical system of "near-compute cache + remote large-capacity storage". 3. Be accountable for optimizing the hit rate of large model KV Cache. Formulate customized optimization strategies from multiple system dimensions, such as the inference framework, traffic scheduling, and multi-level cache. Optimize data IO performance by fully leveraging NVLink, RDMA high-speed network, and GPU Direct technologies on the near-compute side to achieve efficient data transmission. Optimize the storage strategy of data replicas to achieve a reasonable distribution of load traffic and stored data. 4. Undertake the design and implementation of efficient and user-friendly data access interfaces. Realize seamless docking with the inference framework, and manage the lifecycle of KV Cache. 5. Be responsible for the access, management, operation and maintenance, and monitoring of the multi-level storage system in the Kubernetes scenario to ensure stability. 6. Assume the task of system setup and disaster recovery in multi-datacenter, multi-region, and multi-cloud scenarios, and optimize data placement across clusters.

Similar Jobs

Guardian - Senior Lead Engineer - IT

Guardian

Chennai, Tamil Nadu, India (Hybrid)
1 Year ago
Balbix - Staff DevOps Engineer

Balbix

Gurugram, India (On-Site)
4 Months ago
Sprinkler - Senior Director- Product Engineering

Sprinkler

Gurugram, Haryana, India (On-Site)
1 Year ago
Ello - Tech Lead, Generative AI & Machine Learning

Ello

San Francisco, California, United States (On-Site)
4 Months ago
 Many Chat  Inc  - Senior Python Engineer (Analytics & Insights Services)

Many Chat Inc

Amsterdam, North Holland, Netherlands (Hybrid)
2 Weeks ago
Capgemini - Application Security Architect (AppSec)

Capgemini

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Apple - Software Engineer - Embedded Systems

Apple

Cupertino, California, United States (On-Site)
2 Months ago
extreme network - STAFF SW SYSTEMS ENGINEER

extreme network

Bengaluru, Karnataka, India (Hybrid)
3 Months ago
NXP - Automotive E/E System Architect

NXP

San Jose, California, United States (Hybrid)
1 Month ago
SoftSwiss - Systems Engineer

SoftSwiss

Poznań, Greater Poland Voivodeship, Poland (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

HappyRobot - QA Engineer

HappyRobot

San Francisco, California, United States (Hybrid)
4 Months ago
Abridge - Senior Platform Engineer

Abridge

San Francisco, California, United States (Hybrid)
2 Months ago
bytedance - Software Engineer Graduate (AIGC Platform - Monetization GenAI) - 2025 Start (PhD)

bytedance

San Jose, California, United States (On-Site)
2 Weeks ago
Qualcomm - DSP Tools Automation Engineer (With expertise in Python and GIT)

Qualcomm

Bengaluru, Karnataka, India (On-Site)
2 Months ago
luxsoft - Technical Lead / Senior Data Engineer

luxsoft

Italy, New York, United States (Remote)
1 Month ago
Thousand Eyes - Senior Software Engineer, Security and Reliability

Thousand Eyes

San Francisco, California, United States (On-Site)
1 Month ago
Ethos Life - Senior Backend Engineer

Ethos Life

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Brillio - Enterprise Architect, AWS - R01535258

Brillio

Bengaluru, Karnataka, India (Hybrid)
9 Months ago
TransUnion - Architect

TransUnion

Pune, Maharashtra, India (Hybrid)
2 Weeks ago
Reltio - Sr Director, Software Engineering-AI/ML

Reltio

Bengaluru, Karnataka, India (Hybrid)
2 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Singapore

Applied materials  - Quality Manager

Applied materials

Singapore (On-Site)
2 Weeks ago
bytedance - Software Engineer - Data Engineering (Video Arch)

bytedance

Singapore (On-Site)
9 Months ago
NinjaVan - Class 3 Delivery Staff Driver (Full Time, 10 Footer Lorry)

NinjaVan

Singapore, Singapore (On-Site)
10 Months ago
bytedance - Digital Product Designer - Enterprise Products

bytedance

Singapore (On-Site)
3 Months ago
Enverus - Business Development Representative, South-East Asia

Enverus

Singapore (On-Site)
3 Weeks ago
bytedance - Software Engineer (Distributed Storage), Cloud Infrastructure

bytedance

Singapore (On-Site)
9 Months ago
Argus - Technical Artist (APAC)

Argus

Singapore (Remote)
4 Months ago
hogarth - Content Creator Intern

hogarth

Singapore (On-Site)
2 Months ago
Riot Games - Senior Manager, Game Production - League of Legends

Riot Games

Singapore (On-Site)
7 Months ago
IGG - Unity Programmer Intern

IGG

Singapore (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

System Design Jobs

Luxoft - Lead Java Developer (for Trading Application)

Luxoft

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (Remote)
8 Months ago
Apple - Software Engineer - Backend Systems (Golang)

Apple

San Diego, California, United States (On-Site)
2 Months ago
Cubic corporation - Associate Systems Support Engineer

Cubic corporation

Salfords, England, United Kingdom (Hybrid)
1 Year ago
Palo Alto Networks - Senior Systems Engineer - Orange EMEAL

Palo Alto Networks

Paris, Île-de-France, France (Remote)
9 Months ago
Regent craft - Embedded Systems Engineer

Regent craft

North Kingstown, Rhode Island, United States (On-Site)
1 Month ago
Illumina - Staff Software Systems Engineer (MES Camstar)

Illumina

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Accenture - Application Developer

Accenture

Mumbai, Maharashtra, India (On-Site)
1 Month ago
AECOM - Electrical Engineer – Power Systems / Federal Projects

AECOM

Roanoke, Virginia, United States (On-Site)
1 Month ago
bytedance - Backend Engineer, Machine Learning Systems - Singapore

bytedance

Singapore (On-Site)
9 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
View All Jobs

Get notified when new jobs are added by bytedance

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug