Vector DBs Cheatsheet — AI Engineer Interview Prep

01

Embedding choice

Embedding Model Selection

Domain match matters: general-purpose models (OpenAI ada, E5, BGE) underperform domain-specific ones
Evaluate on your own dataset with MTEB or a custom recall@K benchmark — leaderboard ranks ≠ production rank
Larger embedding models give better quality but cost more per token
Embedding dimension: higher dims (1536, 3072) capture more information but increase storage and ANN cost

✓ Run retrieval recall@K on a labeled sample of your production queries before choosing an embedding model

✗ Default to the highest-MTEB model without evaluating on your specific domain

Bi-Encoder vs Cross-Encoder

Bi-encoder: embed query and document independently; compare with cosine/dot product
Cross-encoder: concatenate query + document; run through model jointly — much more accurate
Bi-encoders are fast and precomputable; cross-encoders are slow but superior for re-ranking
Production pattern: bi-encoder for ANN recall, cross-encoder for re-ranking top-K

✓ Use bi-encoders for retrieval and cross-encoders for re-ranking — don't use cross-encoders at ANN scale

02

HNSW (Hierarchical Navigable Small World)

Graph-based ANN index; achieves high recall with low query latency
Key parameters: M (graph connectivity) and ef_construction (build-time quality)
Query-time ef controls recall vs latency tradeoff per request
Memory-intensive: full graph lives in RAM; not ideal for billions of vectors on a budget

✓ Recommend HNSW for latency-sensitive production systems with millions of vectors

✗ Use HNSW for billions of vectors without compression — memory will be prohibitive

IVF (Inverted File Index) and Product Quantization

IVF clusters vectors into Voronoi cells; search is restricted to the nearest nprobe cells
PQ compresses vectors into short codes (e.g., 64 bytes vs 768×4 = 3072 bytes) for memory savings
IVF-PQ is the dominant approach for billion-scale retrieval in FAISS
PQ compression trades recall for memory; tune nlist and nprobe against your recall target

✓ Name FAISS IVF-PQ when discussing billion-scale vector search under memory constraints

Managed Vector DBs

Pinecone, Weaviate, Qdrant, Chroma, pgvector, and Milvus are common choices
pgvector is convenient for teams already on PostgreSQL but is slower at scale than dedicated vector DBs
Managed services handle sharding, replication, and index builds automatically
Evaluate on your own query volume and latency SLA before choosing

✓ Know at least two vector DB products with their relative tradeoffs (e.g., Qdrant vs pgvector)

03

Pre- and Post-Filtering

Pre-filter: restrict ANN search to matching metadata subset before distance computation
Post-filter: run ANN over full index, then discard results that don't match metadata predicates
Pre-filtering can yield very small candidate sets that miss true nearest neighbors
Most managed vector DBs support hybrid pre-filter + ANN to balance precision and recall

✓ Ask about selectivity of the metadata filter; highly selective filters need special handling

✗ Post-filter with a tiny top-K — you may discard the true best matches

04

Measuring ANN Recall

Recall@K: fraction of true top-K (from exact search) returned by ANN within the top-K results
Baseline: run exact brute-force search on a sample to get ground truth
Tune index parameters to hit a target recall (e.g., 95% recall@10) at acceptable latency
Recall degrades as data distribution shifts — re-benchmark after major data changes

✓ Always measure recall@K on your actual data distribution, not just index defaults

Recall vs Latency Tradeoff

Higher ef / nprobe values increase recall at the cost of higher latency
For most RAG applications, 90–95% recall@10 is sufficient; 99%+ is rarely worth the latency cost
Quantization (PQ, SQ) reduces memory and latency but lowers recall
Re-ranking can recover recall lost by aggressive quantization

✓ Frame the recall-latency decision in terms of downstream task accuracy, not just ANN metrics

05

Embedding Dimensionality

Higher dims → better representational capacity but more storage, slower ANN, and larger KV transfers
Matryoshka embeddings (e.g., OpenAI text-embedding-3) support truncation to lower dims with modest quality loss
PCA or ICA can reduce dims post-hoc; retrain if possible for best quality
Balance between embedding quality and infrastructure cost is a system design decision

✓ Mention Matryoshka embeddings as a way to tune the quality-cost tradeoff without re-embedding the corpus

06

Online Index Updates

HNSW supports online inserts; IVF-flat supports inserts but cluster quality drifts over time
Batch rebuilds restore index quality after many incremental inserts
Deletions are expensive in many ANN libraries — use a tombstone + periodic compaction pattern
Blue-green index swap: build a new index offline and atomically swap in production for zero downtime

✓ Plan for periodic full index rebuilds even if your system supports online inserts

✗ Ignore index rebuild cadence — ANN quality degrades silently after many incremental updates