Normalize inputs
Convert audio, images, and text into consistent summaries or embeddings before fusion.
Fusion prompt
Use a single prompt contract that describes how each modality should be weighted.
Latency handling
Stream partial results when one modality is slower to arrive.