← All cheatsheets
Cheatsheet
RAG
Patterns for grounding LLM responses in retrieved documents. Interviewers probe your ability to identify and fix quality bottlenecks across the retrieval and generation pipeline.
01
Chunking
Chunking Strategies
- Fixed-size: split every N tokens with overlap; simple but ignores semantic boundaries
- Sentence / paragraph: respects natural boundaries; better coherence per chunk
- Recursive character splitting: tries multiple separators (\n\n, \n, ' ') from coarsest to finest
- Semantic chunking: split when embedding similarity between consecutive sentences drops sharply
✓
Match chunk size to the model's context window and the expected query granularity
✗
Use very large chunks hoping the model will pick relevant parts — retrieval quality drops
Chunk Overlap and Context Preservation
- Overlapping chunks prevent key facts from falling on a chunk boundary and being lost
- Typical overlap: 10–20% of chunk size
- Parent-child chunking: index small child chunks but retrieve their larger parent for context
- Too much overlap inflates index size and causes duplicate retrieved content
✓
Use parent-child chunking when the retrieval granularity and generation context size differ
Document Metadata Augmentation
- Attach source, date, section title, and page number to every chunk
- Prepending document title or section header into the chunk text improves embedding quality
- Hypothetical Document Embeddings (HyDE): embed a generated hypothetical answer for query expansion
✓
Include section headers in the chunk text, not just as metadata, so embeddings capture topic context
02
Re-ranking
Two-Stage Retrieval
- Stage 1: fast ANN retrieval of top-K candidates (e.g., top-100)
- Stage 2: cross-encoder or LLM re-ranker scores each candidate against the query
- Cross-encoders see query + document together — much more accurate than bi-encoder recall
- Re-ranking adds latency; only top-K' (e.g., top-5) from re-ranked list go into the prompt
✓
Always mention two-stage retrieval when discussing how to improve RAG precision
✗
Skip re-ranking and try to compensate with a bigger top-K in the prompt — LLM gets confused by noise
Re-ranker Choices
- Cross-encoder models: Cohere Rerank, BGE-Reranker, ms-marco models
- LLM-based re-rankers: use a small model to score relevance (more expensive, sometimes more accurate)
- Reciprocal Rank Fusion (RRF): merges rankings from multiple retrievers without a separate model
- Re-ranker quality is evaluated by NDCG, MRR, and recall@K on a labeled relevance set
✓
Know at least one re-ranker by name; Cohere Rerank and BGE are common interview examples
03
Hybrid search
Hybrid Search (Dense + Sparse)
- Dense retrieval: semantic embedding similarity (ANN search)
- Sparse retrieval: keyword matching via BM25 or TF-IDF
- Hybrid combines both: catches exact-match keywords that embeddings miss
- Score fusion: Reciprocal Rank Fusion (RRF) is a simple, robust merging strategy
✓
Default to hybrid search in production RAG — pure dense retrieval misses exact-match queries
✗
Weight keyword and semantic scores with a hand-tuned alpha without measuring on your dataset
Query Expansion
- Expand the user query with synonyms, hypothetical answers, or sub-queries before retrieval
- HyDE: generate a hypothetical document, embed it, search with that embedding
- Multi-query: generate several query variants, retrieve for each, merge results
- Expansion increases recall but adds latency and may introduce noise
✓
Use query expansion when users phrase queries differently from how content is written
04
Freshness
Index Freshness
- Stale documents cause hallucinated or outdated answers
- Incremental indexing: process only new/changed documents rather than full re-index
- Time-decay scoring: boost recent documents in re-ranking
- Freshness SLA should be defined per use case (near-real-time vs daily batch)
✓
Ask about acceptable staleness when designing a RAG system; it drives architecture choices
✗
Use a weekly batch re-index for a news or pricing use case — freshness mismatch breaks trust
Change Detection and Deletion
- Deleted source documents must be removed from the index; orphaned chunks cause ghost answers
- Change detection: hash document content; re-embed and re-index on hash change
- Soft deletes with timestamps allow time-travel queries
✓
Handle document deletion explicitly in your indexing pipeline — it's an interview differentiator
05
Citations
Source Attribution
- Include chunk source metadata in the prompt so the model can cite it
- Inline citation format: 'According to [Source A]...' reduces hallucinations
- Post-generation verification: check if cited chunks actually contain the claimed content
- Attribution grounding score: measure fraction of answer claims that are traceable to retrieved chunks
✓
Build citation verification into the output pipeline — models fabricate citations if unchecked
✗
Trust the model to cite only real sources without downstream verification
Faithfulness vs Relevance
- Faithfulness: does the answer only assert things supported by the retrieved context?
- Relevance: does the answer address the user's question?
- A faithful but irrelevant answer and a relevant but unfaithful (hallucinated) answer are both failures
- RAGAS framework provides automated metrics for both dimensions
✓
Measure faithfulness and relevance separately — they often have independent failure modes
06
Latency budgets
RAG Latency Budget
- End-to-end RAG: embed query → ANN search → re-rank → LLM generate; each step adds latency
- Typical breakdown: embedding 5–20 ms, ANN 5–50 ms, re-rank 50–200 ms, LLM 500–3000 ms
- Parallelize independent retrievals (multi-source) rather than serializing them
- Cache common query embeddings and retrieved results to avoid repeated work
✓
Profile each stage separately in production; LLM generation usually dominates but retrieval can spike
✗
Add re-ranking without measuring its latency contribution against accuracy gain
Retrieval Caching
- Cache embedding results for frequently repeated queries with a fast TTL store (Redis)
- Semantic cache: cluster semantically similar queries and serve the same retrieved set
- Cache invalidation must align with index freshness — stale cache + fresh index = wrong answers
✓
Implement exact-match embedding cache as a quick latency win before pursuing semantic caching