← All playbooks / Retrieval & RAG
How do you evaluate a RAG system before shipping it to production?
Open by separating retrieval evaluation from generation evaluation. They fail for different reasons, so I do not score them as one fuzzy quality number. Then I commit to one concrete release path: a frozen golden set for offline retrieval metrics, an answer-quality layer for faithfulness, a shadow run on live traffic, and an explicit…
01The 90-second answer
Open by separating retrieval evaluation from generation evaluation. They fail for different reasons, so I do not score them as one fuzzy quality number. Then I commit to one concrete release path: a frozen golden set for offline retrieval metrics, an answer-quality layer for faithfulness, a shadow run on live traffic, and an explicit rollout gate before any user sees the new version.
If the interviewer wants the short version, I say: no regression on retrieval recall, no unacceptable drop in faithfulness, neutral or better latency at the percentile that matters, then ramp traffic gradually with rollback signals already defined.
02Weak vs Strong Answer
Weak answer
"We just benchmark final answers with an LLM judge and iterate until the outputs look better."
Strong answer
"I split retrieval evals from generation evals because they fail for different reasons. First I prove the right evidence is being retrieved. Then I score whether the answer used that evidence faithfully. After that I run a shadow rollout and only ramp once the candidate clears an explicit quality and latency gate."
03Why two layers
The common mistake is treating the final answer as the only thing worth scoring. That hides the actual fault line. Retrieval can fail by missing the right passage, returning noisy neighbors, or ranking the right chunk too low. Generation can fail even when retrieval is correct, because the prompt is weak, the model overgeneralizes, or the answer ignores the cited evidence.
If you conflate those layers, debugging turns into guessing.