Multimodal AI Cheatsheet — AI Engineer Interview Prep

01

Vision inputs

Vision-Language Models (VLMs)

Image encoder (ViT or CNN) + projection layer + language model decoder
Vision tokens are prepended to text tokens in the LLM context
Image resolution affects token count: higher resolution → more tokens → longer context
Models: GPT-4V, Claude 3, Gemini 1.5, LLaVA (open source)

✓ Discuss the vision-token budget when asked about multimodal latency and cost

✗ Treat images as free — a high-resolution image can consume hundreds of tokens

Vision Input Preprocessing

Resize images to model's expected resolution range to control token count and cost
Dynamic tiling: split large images into tiles for fine-detail tasks (e.g., document reading)
Normalize pixel values per model requirements (model-specific channel mean/std)
For documents, send both the image and extracted text for best accuracy

✓ Profile image token cost in your use case — tiling large documents can 10× the context length

02

OCR in Multimodal Pipelines

VLMs have reasonable OCR capability but dedicated OCR (Tesseract, Google Vision, AWS Textract) is more accurate for dense text
Hybrid: run OCR to get text, pass both text and image to the VLM for layout-aware reasoning
Table extraction from images requires specialized tools or models (e.g., PaddleOCR, Unstructured.io)
Handwriting and low-contrast scans degrade OCR accuracy significantly

✓ Use dedicated OCR for text-heavy documents; use VLM for understanding layout, charts, and mixed content

03

Speech-to-Text Integration

Whisper (OpenAI) and cloud ASR APIs are the common transcription layers
Word error rate (WER) varies by accent, domain vocabulary, and audio quality
Punctuation and speaker diarization are often separate post-processing steps
Streaming ASR reduces latency by providing partial transcripts before the utterance ends

✓ Evaluate ASR on domain-specific vocabulary — general models degrade on technical jargon

✗ Feed raw transcripts without punctuation directly to the LLM — it hurts comprehension

Text-to-Speech (TTS)

Neural TTS (ElevenLabs, OpenAI TTS, Google Neural2) produces near-human-quality speech
Latency optimization: start generating audio before the full LLM response is complete (streaming TTS)
Voice cloning requires careful consent and policy management
SSML tags control prosody, pauses, and emphasis for more natural speech

✓ Stream LLM output into TTS for lowest perceived latency in voice applications

04

Multimodal Fusion Strategies

Early fusion: combine raw modalities before any encoding — rare in LLM systems
Late fusion: encode each modality separately, then merge representations
Cross-attention fusion: modalities attend to each other during encoding (most expressive)
Projection layer fusion: encode each modality independently, project into shared embedding space, concatenate for LLM

✓ Describe projection-layer fusion (encode → project → concatenate) for VLM architectures

05

Multimodal Latency Optimization

Image encoding is fast (<50 ms) but high resolution adds vision tokens that slow generation
ASR transcription latency can be hidden with streaming; fire LLM call on partial transcript
Cache encoded image representations for documents that are queried repeatedly
Compress images before sending to API — unnecessary full-resolution uploads add network latency

✓ Parallelize independent modality preprocessing (OCR + ASR) before feeding both to the LLM

06

Visual Grounding

Grounding: associating model output claims with specific regions or elements in the image
Bounding box predictions allow the model to localize objects (GLIP, Grounding DINO)
VLMs can hallucinate objects not present in the image — visual grounding reduces this
Verification: cross-check text claims against image content with a secondary vision classifier

✓ Ask the model to cite specific image regions when accuracy is critical — reduces hallucination

✗ Trust VLM outputs about image content without downstream verification for high-stakes applications