← All cheatsheets
Cheatsheet
Evaluation
Rigorous methods for measuring LLM system quality. Interviewers probe your ability to design reliable evals, choose appropriate metrics, and avoid evaluation pitfalls.
01
Golden sets
Golden Dataset Construction
- Curate 200–1000 examples covering typical, edge, and adversarial cases
- Annotate expected outputs or grading rubrics (not always a single correct answer)
- Stratify by task type, difficulty, and sensitive topic categories
- Treat the golden set as a product artifact: version-controlled, reviewed, and maintained
✓
Build golden sets incrementally from real user queries — synthetic data misses real failure modes
✗
Let the golden set go stale — a 2-year-old eval set may not reflect current user behavior
Contamination Avoidance
- Never use golden-set examples in prompt few-shot examples or fine-tuning data
- Hold out a secret test set separate from the development eval set
- Rotate test examples periodically to prevent implicit leakage through prompt engineering
✓
Maintain separate dev (iteration) and test (final gate) splits with strict access controls
02
LLM-as-judge
LLM-as-Judge Evaluation
- Use a capable LLM (e.g., GPT-4, Claude) to grade model outputs on rubric criteria
- Provide detailed scoring rubrics — vague prompts produce high-variance scores
- Validate judge agreement with human labels (Cohen's kappa > 0.6 is acceptable)
- Judge bias: LLMs prefer their own outputs and verbose answers — control for this
✓
Calibrate your LLM judge against human annotations before relying on it for production decisions
✗
Use the same model as both the system under test and the judge — self-serving bias is severe
Pairwise vs Pointwise Evaluation
- Pointwise: score each output on a scale (e.g., 1–5) independently
- Pairwise: show two outputs and ask which is better — often higher agreement with human preference
- Pairwise comparisons can be aggregated with ELO ratings across many comparisons
- Pointwise is easier to parallelize; pairwise requires O(n²) comparisons for n candidates
✓
Use pairwise evaluation for relative quality comparisons between model versions
03
A/B tests
Online A/B Testing for LLMs
- Randomly assign users or sessions to control (current) and treatment (new) model
- Primary metrics: task success rate, user engagement, satisfaction score
- Guard metrics: latency, cost, safety classifier flags — must not regress
- Minimum sample size determined by effect size, significance level (α=0.05), and power (β=0.8)
✓
Define primary and guard metrics before running the A/B test — don't decide them after seeing results
✗
Stop A/B tests early when you see a positive result — wait for the pre-specified sample size
04
Offline metrics
Offline Evaluation Metrics
- Exact match (EM): binary; fails to capture partial credit — use for classification, not generation
- ROUGE / BLEU: n-gram overlap; poor correlation with human quality for open-ended generation
- BERTScore: semantic similarity via embeddings — better correlation than n-gram metrics
- Task-specific metrics: F1 for NER, NDCG for ranking, pass@K for code generation
✓
Choose metrics that correlate with your actual business outcome — measure correlation explicitly
✗
Report BLEU as your primary quality metric for LLM generation — interviewers will push back
Correlation Between Offline and Online Metrics
- Offline metric improvement does not guarantee online quality improvement
- Validate offline-online correlation periodically with A/B tests
- Goodhart's Law: when an offline metric becomes a target, it ceases to be a good measure
✓
Always close the loop: run A/B tests to verify offline improvements translate to user outcomes
05
Error taxonomy
Failure Mode Classification
- Factual errors: claims that contradict ground truth (hallucination)
- Format errors: output doesn't match required schema or structure
- Instruction following errors: model ignores part of the prompt
- Safety errors: model generates policy-violating content
✓
Build an error taxonomy early; label each failure into a category to guide prioritization
✗
Lump all failures into 'model made a mistake' — you need error categories to know what to fix
06
Sample size
Eval Sample Size
- For a 5% effect size at 80% power, you need ~600 examples per condition
- Smaller golden sets detect only large regressions; invest in 500+ examples for production systems
- Stratified sampling: ensure rare categories have enough examples to be statistically detectable
- Bootstrap confidence intervals on eval metrics help communicate uncertainty to stakeholders
✓
Report confidence intervals alongside eval scores — 'we improved 2 points' means less without them
✗
Draw conclusions from a 30-example eval set — noise will dominate the signal