← All cheatsheets
Cheatsheet

Multimodal AI

Systems that process and generate multiple data modalities — images, audio, and text. Interviewers focus on fusion strategies, modality-specific failure modes, and latency.

01

Vision inputs

Vision-Language Models (VLMs)

  • Image encoder (ViT or CNN) + projection layer + language model decoder
  • Vision tokens are prepended to text tokens in the LLM context
  • Image resolution affects token count: higher resolution → more tokens → longer context
  • Models: GPT-4V, Claude 3, Gemini 1.5, LLaVA (open source)
Discuss the vision-token budget when asked about multimodal latency and cost
Treat images as free — a high-resolution image can consume hundreds of tokens

Vision Input Preprocessing

  • Resize images to model's expected resolution range to control token count and cost
  • Dynamic tiling: split large images into tiles for fine-detail tasks (e.g., document reading)
  • Normalize pixel values per model requirements (model-specific channel mean/std)
  • For documents, send both the image and extracted text for best accuracy
Profile image token cost in your use case — tiling large documents can 10× the context length
02

OCR

OCR in Multimodal Pipelines

  • VLMs have reasonable OCR capability but dedicated OCR (Tesseract, Google Vision, AWS Textract) is more accurate for dense text
  • Hybrid: run OCR to get text, pass both text and image to the VLM for layout-aware reasoning
  • Table extraction from images requires specialized tools or models (e.g., PaddleOCR, Unstructured.io)
  • Handwriting and low-contrast scans degrade OCR accuracy significantly
Use dedicated OCR for text-heavy documents; use VLM for understanding layout, charts, and mixed content
03

Speech

Speech-to-Text Integration

  • Whisper (OpenAI) and cloud ASR APIs are the common transcription layers
  • Word error rate (WER) varies by accent, domain vocabulary, and audio quality
  • Punctuation and speaker diarization are often separate post-processing steps
  • Streaming ASR reduces latency by providing partial transcripts before the utterance ends
Evaluate ASR on domain-specific vocabulary — general models degrade on technical jargon
Feed raw transcripts without punctuation directly to the LLM — it hurts comprehension

Text-to-Speech (TTS)

  • Neural TTS (ElevenLabs, OpenAI TTS, Google Neural2) produces near-human-quality speech
  • Latency optimization: start generating audio before the full LLM response is complete (streaming TTS)
  • Voice cloning requires careful consent and policy management
  • SSML tags control prosody, pauses, and emphasis for more natural speech
Stream LLM output into TTS for lowest perceived latency in voice applications
04

Fusion

Multimodal Fusion Strategies

  • Early fusion: combine raw modalities before any encoding — rare in LLM systems
  • Late fusion: encode each modality separately, then merge representations
  • Cross-attention fusion: modalities attend to each other during encoding (most expressive)
  • Projection layer fusion: encode each modality independently, project into shared embedding space, concatenate for LLM
Describe projection-layer fusion (encode → project → concatenate) for VLM architectures
05

Latency

Multimodal Latency Optimization

  • Image encoding is fast (<50 ms) but high resolution adds vision tokens that slow generation
  • ASR transcription latency can be hidden with streaming; fire LLM call on partial transcript
  • Cache encoded image representations for documents that are queried repeatedly
  • Compress images before sending to API — unnecessary full-resolution uploads add network latency
Parallelize independent modality preprocessing (OCR + ASR) before feeding both to the LLM
06

Grounding

Visual Grounding

  • Grounding: associating model output claims with specific regions or elements in the image
  • Bounding box predictions allow the model to localize objects (GLIP, Grounding DINO)
  • VLMs can hallucinate objects not present in the image — visual grounding reduces this
  • Verification: cross-check text claims against image content with a secondary vision classifier
Ask the model to cite specific image regions when accuracy is critical — reduces hallucination
Trust VLM outputs about image content without downstream verification for high-stakes applications