← All cheatsheets
Cheatsheet
Multimodal AI
Systems that process and generate multiple data modalities — images, audio, and text. Interviewers focus on fusion strategies, modality-specific failure modes, and latency.
01
Vision inputs
Vision-Language Models (VLMs)
- Image encoder (ViT or CNN) + projection layer + language model decoder
- Vision tokens are prepended to text tokens in the LLM context
- Image resolution affects token count: higher resolution → more tokens → longer context
- Models: GPT-4V, Claude 3, Gemini 1.5, LLaVA (open source)
✓
Discuss the vision-token budget when asked about multimodal latency and cost
✗
Treat images as free — a high-resolution image can consume hundreds of tokens
Vision Input Preprocessing
- Resize images to model's expected resolution range to control token count and cost
- Dynamic tiling: split large images into tiles for fine-detail tasks (e.g., document reading)
- Normalize pixel values per model requirements (model-specific channel mean/std)
- For documents, send both the image and extracted text for best accuracy
✓
Profile image token cost in your use case — tiling large documents can 10× the context length
02
OCR
OCR in Multimodal Pipelines
- VLMs have reasonable OCR capability but dedicated OCR (Tesseract, Google Vision, AWS Textract) is more accurate for dense text
- Hybrid: run OCR to get text, pass both text and image to the VLM for layout-aware reasoning
- Table extraction from images requires specialized tools or models (e.g., PaddleOCR, Unstructured.io)
- Handwriting and low-contrast scans degrade OCR accuracy significantly
✓
Use dedicated OCR for text-heavy documents; use VLM for understanding layout, charts, and mixed content
03
Speech
Speech-to-Text Integration
- Whisper (OpenAI) and cloud ASR APIs are the common transcription layers
- Word error rate (WER) varies by accent, domain vocabulary, and audio quality
- Punctuation and speaker diarization are often separate post-processing steps
- Streaming ASR reduces latency by providing partial transcripts before the utterance ends
✓
Evaluate ASR on domain-specific vocabulary — general models degrade on technical jargon
✗
Feed raw transcripts without punctuation directly to the LLM — it hurts comprehension
Text-to-Speech (TTS)
- Neural TTS (ElevenLabs, OpenAI TTS, Google Neural2) produces near-human-quality speech
- Latency optimization: start generating audio before the full LLM response is complete (streaming TTS)
- Voice cloning requires careful consent and policy management
- SSML tags control prosody, pauses, and emphasis for more natural speech
✓
Stream LLM output into TTS for lowest perceived latency in voice applications
04
Fusion
Multimodal Fusion Strategies
- Early fusion: combine raw modalities before any encoding — rare in LLM systems
- Late fusion: encode each modality separately, then merge representations
- Cross-attention fusion: modalities attend to each other during encoding (most expressive)
- Projection layer fusion: encode each modality independently, project into shared embedding space, concatenate for LLM
✓
Describe projection-layer fusion (encode → project → concatenate) for VLM architectures
05
Latency
Multimodal Latency Optimization
- Image encoding is fast (<50 ms) but high resolution adds vision tokens that slow generation
- ASR transcription latency can be hidden with streaming; fire LLM call on partial transcript
- Cache encoded image representations for documents that are queried repeatedly
- Compress images before sending to API — unnecessary full-resolution uploads add network latency
✓
Parallelize independent modality preprocessing (OCR + ASR) before feeding both to the LLM
06
Grounding
Visual Grounding
- Grounding: associating model output claims with specific regions or elements in the image
- Bounding box predictions allow the model to localize objects (GLIP, Grounding DINO)
- VLMs can hallucinate objects not present in the image — visual grounding reduces this
- Verification: cross-check text claims against image content with a secondary vision classifier
✓
Ask the model to cite specific image regions when accuracy is critical — reduces hallucination
✗
Trust VLM outputs about image content without downstream verification for high-stakes applications