All playbooks / Retrieval & RAG

Playbook · Retrieval & RAG · Editors' pick

How do you evaluate a RAG system before shipping it to production?

Open by separating retrieval evaluation from generation evaluation. They fail for different reasons, so I do not score them as one fuzzy quality number. Then I commit to one concrete release path: a frozen golden set for offline retrieval metrics, an answer-quality layer for faithfulness, a shadow run on live traffic, and an explicit…

Staff High frequency 18 min read Free
Practical answer framework for AI engineer interview loops.

01The 90-second answer

Open by separating retrieval evaluation from generation evaluation. They fail for different reasons, so I do not score them as one fuzzy quality number. Then I commit to one concrete release path: a frozen golden set for offline retrieval metrics, an answer-quality layer for faithfulness, a shadow run on live traffic, and an explicit rollout gate before any user sees the new version.

If the interviewer wants the short version, I say: no regression on retrieval recall, no unacceptable drop in faithfulness, neutral or better latency at the percentile that matters, then ramp traffic gradually with rollback signals already defined.

02Weak vs Strong Answer

Weak answer

"We just benchmark final answers with an LLM judge and iterate until the outputs look better."

Strong answer

"I split retrieval evals from generation evals because they fail for different reasons. First I prove the right evidence is being retrieved. Then I score whether the answer used that evidence faithfully. After that I run a shadow rollout and only ramp once the candidate clears an explicit quality and latency gate."

03Why two layers

The common mistake is treating the final answer as the only thing worth scoring. That hides the actual fault line. Retrieval can fail by missing the right passage, returning noisy neighbors, or ranking the right chunk too low. Generation can fail even when retrieval is correct, because the prompt is weak, the model overgeneralizes, or the answer ignores the cited evidence.

If you conflate those layers, debugging turns into guessing.

04The golden set

This is the load-bearing artifact. I build it around representative query slices, not one giant bucket of random prompts. I want long-tail questions, ambiguous wording, multi-hop queries, queries with no good answer, and the high-value product tasks users actually repeat in production.

What to measure

  • Recall@k: did at least one relevant passage show up in the top k?
  • nDCG: did the retriever rank the best evidence high enough for the generator to use it?
  • Empty-handed rate: how often did retrieval come back with nothing useful?
  • No-answer precision: when the correct behavior is to abstain, does the system actually abstain?
# Keep the retrieval gate boring, fast, and stable.
for query, expected_passages in golden_set:
    hits = retriever.search(query, k=10)
    metrics.recall_at_5.update(hits, expected_passages)
    metrics.ndcg.update(hits, expected_passages)
    if not hits:
        metrics.empty_handed.increment()

assert metrics.recall_at_5.delta >= -0.02, "retrieval regressed"

Version the golden set. Do not quietly overwrite it when new traffic arrives. Refresh it on purpose so you can explain what changed and why.

05Shadow traffic and rollout gates

Offline evals are necessary, but they are not enough. Before I expose a new retriever, reranker, chunking strategy, or index refresh policy to users, I mirror production traffic to the candidate path and compare it against the live system.

The point of the shadow run is not vanity metrics. It is distribution realism. Production queries are messier than your curated eval set, and they reveal whether the candidate system fails on phrasing, latency spikes, or noisy tail cases.

Strong vs weak rollout posture

Strong Weak
Mirror production queries to both systems and compare on the same judge or rubric. Ship and "watch the dashboard."
Define the gate before the run starts. Decide after the graphs move.
Do not ramp until faithfulness, retrieval quality, and latency are within bounds. Trust one blended quality score.

My gate is explicit: no meaningful regression on the golden set, candidate faithfulness within a defined budget on the shadow slice, and no unacceptable latency hit at the percentile the product cares about. Then I ramp gradually with a rollback trigger already chosen.

06Tradeoffs interviewers probe

This is usually where the interviewer checks whether your answer came from real launches or just eval vocabulary.

  • Golden-set drift: a frozen eval set gets stale. Refresh it on a schedule, but keep prior versions so you can compare over time instead of masking regressions.
  • LLM-as-judge bias: judges can be too forgiving, correlated with the model being evaluated, or bad at catching subtle factuality errors. Audit a slice with human review, especially for high-risk domains.
  • Latency versus recall: reranking often improves ranking quality but eats budget. It is worth it when top-of-list accuracy matters downstream; it is waste when the task is tolerant to a slightly noisier top-k.
  • Shadow-run cost: shadowing retrieval is cheap compared with shadowing generation. Sample the expensive path if needed, but do not skip production realism entirely.
  • No-answer handling: a polished system knows when not to answer. If your eval never rewards abstention, the model learns to answer confidently even when it should not.

07Follow-up questions to expect

  1. How would you build the golden set if the product is new and ground truth is weak?
  2. What exactly do you mean by faithfulness, and how would you score it?
  3. When would you trust human review over an LLM judge?
  4. How would you debug a regression that appears only in the shadow run?
  5. What would make you roll back after ramping to 100% traffic?
  6. How would you test a new chunking strategy without fooling yourself?
  7. What changes if the index updates every hour instead of every month?
  8. How do you evaluate no-answer behavior for unsafe or unsupported questions?
Next playbook

How would you implement a RAG pipeline from scratch?

15 min · Retrieval & RAG