Disruptive Rain Cookbook

Markdown view
# Avatar streaming guide

Build real-time avatar experiences with synchronized audio and facial animation.

- Date: Sep 15, 2025
- Reading time: 20 min
- Level: Advanced
- Tags: Avatar, Streaming, Multimodal

## Takeaways
- Stream synchronized audio and blendshape frames over WebSocket.
- Handle emotion parameters for expressive avatar responses.
- Implement robust reconnection and buffering strategies.

## Avatar streaming overview

The Disruptive Rain Avatar API generates synchronized audio and facial animation in real-time. It outputs ARKit-compatible blendshapes that can drive 3D avatar models.

Audio and animation frames stream together over a single WebSocket connection for tight synchronization.

- Audio and blendshapes stream over the same WebSocket connection.
- Frame synchronization uses timestamp-based alignment.
- Emotion parameters control facial expressions.

## WebSocket connection setup

Connect to the avatar streaming endpoint with session configuration. The server will stream audio chunks and blendshape frames in lockstep.

```ts
const ws = new WebSocket('wss://<gateway-host>/v1/avatar/stream');

ws.onopen = () => {
  ws.send(JSON.stringify({
    sessionId: 'avatar_' + Date.now(),
    voiceId: 'default',
    emotion: 'neutral',
    speed: 1.0,
  }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'audio') {
    audioQueue.push(msg.audio);
  } else if (msg.type === 'blendshape') {
    renderBlendshapes(msg.weights, msg.frameIndex);
  }
};
```

## Emotion and expression control

Control avatar expressions by setting emotion parameters. The system interpolates between emotion states smoothly.

- Supported emotions: neutral, happy, sad, angry, surprised, fearful.
- Blend multiple emotions with weighted parameters.
- Speech content automatically influences lip sync.

## Latency optimization

Avatar streaming is latency-sensitive. Buffer a small number of frames and use adaptive playback to handle network jitter.

- Target 100-200ms end-to-end latency for conversational UX.
- Pre-buffer 3-5 frames before starting playback.
- Implement frame skip logic for sustained network issues.