← All cheatsheets
Cheatsheet

Vector DBs

How to store, index, and retrieve dense vector representations at scale. Interviewers test your understanding of ANN tradeoffs, embedding model selection, and operational concerns.

01

Embedding choice

Embedding Model Selection

  • Domain match matters: general-purpose models (OpenAI ada, E5, BGE) underperform domain-specific ones
  • Evaluate on your own dataset with MTEB or a custom recall@K benchmark — leaderboard ranks ≠ production rank
  • Larger embedding models give better quality but cost more per token
  • Embedding dimension: higher dims (1536, 3072) capture more information but increase storage and ANN cost
Run retrieval recall@K on a labeled sample of your production queries before choosing an embedding model
Default to the highest-MTEB model without evaluating on your specific domain

Bi-Encoder vs Cross-Encoder

  • Bi-encoder: embed query and document independently; compare with cosine/dot product
  • Cross-encoder: concatenate query + document; run through model jointly — much more accurate
  • Bi-encoders are fast and precomputable; cross-encoders are slow but superior for re-ranking
  • Production pattern: bi-encoder for ANN recall, cross-encoder for re-ranking top-K
Use bi-encoders for retrieval and cross-encoders for re-ranking — don't use cross-encoders at ANN scale
02

ANN indexes

HNSW (Hierarchical Navigable Small World)

  • Graph-based ANN index; achieves high recall with low query latency
  • Key parameters: M (graph connectivity) and ef_construction (build-time quality)
  • Query-time ef controls recall vs latency tradeoff per request
  • Memory-intensive: full graph lives in RAM; not ideal for billions of vectors on a budget
Recommend HNSW for latency-sensitive production systems with millions of vectors
Use HNSW for billions of vectors without compression — memory will be prohibitive

IVF (Inverted File Index) and Product Quantization

  • IVF clusters vectors into Voronoi cells; search is restricted to the nearest nprobe cells
  • PQ compresses vectors into short codes (e.g., 64 bytes vs 768×4 = 3072 bytes) for memory savings
  • IVF-PQ is the dominant approach for billion-scale retrieval in FAISS
  • PQ compression trades recall for memory; tune nlist and nprobe against your recall target
Name FAISS IVF-PQ when discussing billion-scale vector search under memory constraints

Managed Vector DBs

  • Pinecone, Weaviate, Qdrant, Chroma, pgvector, and Milvus are common choices
  • pgvector is convenient for teams already on PostgreSQL but is slower at scale than dedicated vector DBs
  • Managed services handle sharding, replication, and index builds automatically
  • Evaluate on your own query volume and latency SLA before choosing
Know at least two vector DB products with their relative tradeoffs (e.g., Qdrant vs pgvector)
03

Metadata filters

Pre- and Post-Filtering

  • Pre-filter: restrict ANN search to matching metadata subset before distance computation
  • Post-filter: run ANN over full index, then discard results that don't match metadata predicates
  • Pre-filtering can yield very small candidate sets that miss true nearest neighbors
  • Most managed vector DBs support hybrid pre-filter + ANN to balance precision and recall
Ask about selectivity of the metadata filter; highly selective filters need special handling
Post-filter with a tiny top-K — you may discard the true best matches
04

Recall

Measuring ANN Recall

  • Recall@K: fraction of true top-K (from exact search) returned by ANN within the top-K results
  • Baseline: run exact brute-force search on a sample to get ground truth
  • Tune index parameters to hit a target recall (e.g., 95% recall@10) at acceptable latency
  • Recall degrades as data distribution shifts — re-benchmark after major data changes
Always measure recall@K on your actual data distribution, not just index defaults

Recall vs Latency Tradeoff

  • Higher ef / nprobe values increase recall at the cost of higher latency
  • For most RAG applications, 90–95% recall@10 is sufficient; 99%+ is rarely worth the latency cost
  • Quantization (PQ, SQ) reduces memory and latency but lowers recall
  • Re-ranking can recover recall lost by aggressive quantization
Frame the recall-latency decision in terms of downstream task accuracy, not just ANN metrics
05

Dimensionality

Embedding Dimensionality

  • Higher dims → better representational capacity but more storage, slower ANN, and larger KV transfers
  • Matryoshka embeddings (e.g., OpenAI text-embedding-3) support truncation to lower dims with modest quality loss
  • PCA or ICA can reduce dims post-hoc; retrain if possible for best quality
  • Balance between embedding quality and infrastructure cost is a system design decision
Mention Matryoshka embeddings as a way to tune the quality-cost tradeoff without re-embedding the corpus
06

Updates

Online Index Updates

  • HNSW supports online inserts; IVF-flat supports inserts but cluster quality drifts over time
  • Batch rebuilds restore index quality after many incremental inserts
  • Deletions are expensive in many ANN libraries — use a tombstone + periodic compaction pattern
  • Blue-green index swap: build a new index offline and atomically swap in production for zero downtime
Plan for periodic full index rebuilds even if your system supports online inserts
Ignore index rebuild cadence — ANN quality degrades silently after many incremental updates