Research Scientist / Engineer – Multimodal Capabilities

14 Minutes ago • All levels • Research Development • $200,000 PA - $300,000 PA

Job Summary

Job Description

The Multimodal Capabilities team at Luma focuses on unlocking advanced capabilities in our foundation models through strategic research into multimodal understanding and generation. This team tackles fundamental research questions around how different modalities can be combined to enable new behaviors and capabilities, working on the open-ended challenges of what makes multimodal AI systems truly powerful and versatile.

Must have:

Identify capability gaps and research solutions
Design datasets, experiments, and methodologies to systematically improve model capabilities across vision, audio, and language
Develop evaluation frameworks and benchmarking approaches for multimodal AI capabilities
Create prototypes and demonstrations that showcase new multimodal capabilities
Strong programming skills in Python and PyTorch
Experience with multimodal data processing pipelines and large-scale dataset curation
Understanding of computer vision, audio processing, and / or natural language processing techniques

Good to have:

Expertise working with interleaved multimodal data
Hands-on experience with Vision Language Models, Audio Language Models, or generative video models

Perks:

Offers Equity

5 skills required

5 skills required for this role

Add these skills to join the top 1% applicants for this job

game-texts

prototyping

pytorch

computer-vision

python

Job Details

About the Role

Responsibilities

Identify capability gaps and research solutions
Design datasets, experiments, and methodologies to systematically improve model capabilities across vision, audio, and language
Develop evaluation frameworks and benchmarking approaches for multimodal AI capabilities
Create prototypes and demonstrations that showcase new multimodal capabilities

Experience

Strong programming skills in Python and PyTorch
Experience with multimodal data processing pipelines and large-scale dataset curation
Understanding of computer vision, audio processing, and / or natural language processing techniques
(Preferred) Expertise working with interleaved multimodal data
(Preferred) Hands-on experience with Vision Language Models, Audio Language Models, or generative video models