← All cheatsheets
Cheatsheet

RAG

Patterns for grounding LLM responses in retrieved documents. Interviewers probe your ability to identify and fix quality bottlenecks across the retrieval and generation pipeline.

01

Chunking

Chunking Strategies

  • Fixed-size: split every N tokens with overlap; simple but ignores semantic boundaries
  • Sentence / paragraph: respects natural boundaries; better coherence per chunk
  • Recursive character splitting: tries multiple separators (\n\n, \n, ' ') from coarsest to finest
  • Semantic chunking: split when embedding similarity between consecutive sentences drops sharply
Match chunk size to the model's context window and the expected query granularity
Use very large chunks hoping the model will pick relevant parts — retrieval quality drops

Chunk Overlap and Context Preservation

  • Overlapping chunks prevent key facts from falling on a chunk boundary and being lost
  • Typical overlap: 10–20% of chunk size
  • Parent-child chunking: index small child chunks but retrieve their larger parent for context
  • Too much overlap inflates index size and causes duplicate retrieved content
Use parent-child chunking when the retrieval granularity and generation context size differ

Document Metadata Augmentation

  • Attach source, date, section title, and page number to every chunk
  • Prepending document title or section header into the chunk text improves embedding quality
  • Hypothetical Document Embeddings (HyDE): embed a generated hypothetical answer for query expansion
Include section headers in the chunk text, not just as metadata, so embeddings capture topic context
02

Re-ranking

Two-Stage Retrieval

  • Stage 1: fast ANN retrieval of top-K candidates (e.g., top-100)
  • Stage 2: cross-encoder or LLM re-ranker scores each candidate against the query
  • Cross-encoders see query + document together — much more accurate than bi-encoder recall
  • Re-ranking adds latency; only top-K' (e.g., top-5) from re-ranked list go into the prompt
Always mention two-stage retrieval when discussing how to improve RAG precision
Skip re-ranking and try to compensate with a bigger top-K in the prompt — LLM gets confused by noise

Re-ranker Choices

  • Cross-encoder models: Cohere Rerank, BGE-Reranker, ms-marco models
  • LLM-based re-rankers: use a small model to score relevance (more expensive, sometimes more accurate)
  • Reciprocal Rank Fusion (RRF): merges rankings from multiple retrievers without a separate model
  • Re-ranker quality is evaluated by NDCG, MRR, and recall@K on a labeled relevance set
Know at least one re-ranker by name; Cohere Rerank and BGE are common interview examples
04

Freshness

Index Freshness

  • Stale documents cause hallucinated or outdated answers
  • Incremental indexing: process only new/changed documents rather than full re-index
  • Time-decay scoring: boost recent documents in re-ranking
  • Freshness SLA should be defined per use case (near-real-time vs daily batch)
Ask about acceptable staleness when designing a RAG system; it drives architecture choices
Use a weekly batch re-index for a news or pricing use case — freshness mismatch breaks trust

Change Detection and Deletion

  • Deleted source documents must be removed from the index; orphaned chunks cause ghost answers
  • Change detection: hash document content; re-embed and re-index on hash change
  • Soft deletes with timestamps allow time-travel queries
Handle document deletion explicitly in your indexing pipeline — it's an interview differentiator
05

Citations

Source Attribution

  • Include chunk source metadata in the prompt so the model can cite it
  • Inline citation format: 'According to [Source A]...' reduces hallucinations
  • Post-generation verification: check if cited chunks actually contain the claimed content
  • Attribution grounding score: measure fraction of answer claims that are traceable to retrieved chunks
Build citation verification into the output pipeline — models fabricate citations if unchecked
Trust the model to cite only real sources without downstream verification

Faithfulness vs Relevance

  • Faithfulness: does the answer only assert things supported by the retrieved context?
  • Relevance: does the answer address the user's question?
  • A faithful but irrelevant answer and a relevant but unfaithful (hallucinated) answer are both failures
  • RAGAS framework provides automated metrics for both dimensions
Measure faithfulness and relevance separately — they often have independent failure modes
06

Latency budgets

RAG Latency Budget

  • End-to-end RAG: embed query → ANN search → re-rank → LLM generate; each step adds latency
  • Typical breakdown: embedding 5–20 ms, ANN 5–50 ms, re-rank 50–200 ms, LLM 500–3000 ms
  • Parallelize independent retrievals (multi-source) rather than serializing them
  • Cache common query embeddings and retrieved results to avoid repeated work
Profile each stage separately in production; LLM generation usually dominates but retrieval can spike
Add re-ranking without measuring its latency contribution against accuracy gain

Retrieval Caching

  • Cache embedding results for frequently repeated queries with a fast TTL store (Redis)
  • Semantic cache: cluster semantically similar queries and serve the same retrieved set
  • Cache invalidation must align with index freshness — stale cache + fresh index = wrong answers
Implement exact-match embedding cache as a quick latency win before pursuing semantic caching