RAG Cheatsheet — AI Engineer Interview Prep

01

Chunking

Chunking Strategies

Fixed-size: split every N tokens with overlap; simple but ignores semantic boundaries
Sentence / paragraph: respects natural boundaries; better coherence per chunk
Recursive character splitting: tries multiple separators (\n\n, \n, ' ') from coarsest to finest
Semantic chunking: split when embedding similarity between consecutive sentences drops sharply

✓ Match chunk size to the model's context window and the expected query granularity

✗ Use very large chunks hoping the model will pick relevant parts — retrieval quality drops

Chunk Overlap and Context Preservation

Overlapping chunks prevent key facts from falling on a chunk boundary and being lost
Typical overlap: 10–20% of chunk size
Parent-child chunking: index small child chunks but retrieve their larger parent for context
Too much overlap inflates index size and causes duplicate retrieved content

✓ Use parent-child chunking when the retrieval granularity and generation context size differ

Document Metadata Augmentation

Attach source, date, section title, and page number to every chunk
Prepending document title or section header into the chunk text improves embedding quality
Hypothetical Document Embeddings (HyDE): embed a generated hypothetical answer for query expansion

✓ Include section headers in the chunk text, not just as metadata, so embeddings capture topic context

02

Re-ranking

Two-Stage Retrieval

Stage 1: fast ANN retrieval of top-K candidates (e.g., top-100)
Stage 2: cross-encoder or LLM re-ranker scores each candidate against the query
Cross-encoders see query + document together — much more accurate than bi-encoder recall
Re-ranking adds latency; only top-K' (e.g., top-5) from re-ranked list go into the prompt

✓ Always mention two-stage retrieval when discussing how to improve RAG precision

✗ Skip re-ranking and try to compensate with a bigger top-K in the prompt — LLM gets confused by noise

Re-ranker Choices

Cross-encoder models: Cohere Rerank, BGE-Reranker, ms-marco models
LLM-based re-rankers: use a small model to score relevance (more expensive, sometimes more accurate)
Reciprocal Rank Fusion (RRF): merges rankings from multiple retrievers without a separate model
Re-ranker quality is evaluated by NDCG, MRR, and recall@K on a labeled relevance set

✓ Know at least one re-ranker by name; Cohere Rerank and BGE are common interview examples

03

Hybrid search

Hybrid Search (Dense + Sparse)

Dense retrieval: semantic embedding similarity (ANN search)
Sparse retrieval: keyword matching via BM25 or TF-IDF
Hybrid combines both: catches exact-match keywords that embeddings miss
Score fusion: Reciprocal Rank Fusion (RRF) is a simple, robust merging strategy

✓ Default to hybrid search in production RAG — pure dense retrieval misses exact-match queries

✗ Weight keyword and semantic scores with a hand-tuned alpha without measuring on your dataset

Query Expansion

Expand the user query with synonyms, hypothetical answers, or sub-queries before retrieval
HyDE: generate a hypothetical document, embed it, search with that embedding
Multi-query: generate several query variants, retrieve for each, merge results
Expansion increases recall but adds latency and may introduce noise

✓ Use query expansion when users phrase queries differently from how content is written

04

Freshness

Index Freshness

Stale documents cause hallucinated or outdated answers
Incremental indexing: process only new/changed documents rather than full re-index
Time-decay scoring: boost recent documents in re-ranking
Freshness SLA should be defined per use case (near-real-time vs daily batch)

✓ Ask about acceptable staleness when designing a RAG system; it drives architecture choices

✗ Use a weekly batch re-index for a news or pricing use case — freshness mismatch breaks trust

Change Detection and Deletion

Deleted source documents must be removed from the index; orphaned chunks cause ghost answers
Change detection: hash document content; re-embed and re-index on hash change
Soft deletes with timestamps allow time-travel queries

✓ Handle document deletion explicitly in your indexing pipeline — it's an interview differentiator

05

Citations

Source Attribution

Include chunk source metadata in the prompt so the model can cite it
Inline citation format: 'According to [Source A]...' reduces hallucinations
Post-generation verification: check if cited chunks actually contain the claimed content
Attribution grounding score: measure fraction of answer claims that are traceable to retrieved chunks

✓ Build citation verification into the output pipeline — models fabricate citations if unchecked

✗ Trust the model to cite only real sources without downstream verification

Faithfulness vs Relevance

Faithfulness: does the answer only assert things supported by the retrieved context?
Relevance: does the answer address the user's question?
A faithful but irrelevant answer and a relevant but unfaithful (hallucinated) answer are both failures
RAGAS framework provides automated metrics for both dimensions

✓ Measure faithfulness and relevance separately — they often have independent failure modes

06

Latency budgets

RAG Latency Budget

End-to-end RAG: embed query → ANN search → re-rank → LLM generate; each step adds latency
Typical breakdown: embedding 5–20 ms, ANN 5–50 ms, re-rank 50–200 ms, LLM 500–3000 ms
Parallelize independent retrievals (multi-source) rather than serializing them
Cache common query embeddings and retrieved results to avoid repeated work

✓ Profile each stage separately in production; LLM generation usually dominates but retrieval can spike

✗ Add re-ranking without measuring its latency contribution against accuracy gain

Retrieval Caching

Cache embedding results for frequently repeated queries with a fast TTL store (Redis)
Semantic cache: cluster semantically similar queries and serve the same retrieved set
Cache invalidation must align with index freshness — stale cache + fresh index = wrong answers

✓ Implement exact-match embedding cache as a quick latency win before pursuing semantic caching