Multimodal

Multimodal pipelines that feel real time

Stream audio and visuals, then fuse the signals into a single response loop with tight latency control.

Build multimodal systems that blend audio, vision, and avatar streams while keeping latency low.

3 guides4 focus areasSpeech in/out

Starter kit

Focus areas

Audio pipelines

Stream PCM and manage buffering for transcription and speech.

Vision analysis

Fuse image inputs with text instructions for grounded responses.

Avatar sync

Coordinate audio, frames, and blendshapes in lockstep.

Realtime UX

Design turn taking and interruption handling for voice systems.

Guides in this topic

Multimodal guides

Curated recipes, playbooks, and walkthroughs for this topic area.

Build real-time avatar experiences with synchronized audio and facial animation.

Fuse visual and audio inputs into a single reasoning flow.

Stream audio in, synthesize audio out, and handle turn taking.

Start here

Avatar streaming guide

Build real-time avatar experiences with synchronized audio and facial animation.

Realtime voice assistant

Stream audio in, synthesize audio out, and handle turn taking.