← All cheatsheets
Cheatsheet

Evaluation

Rigorous methods for measuring LLM system quality. Interviewers probe your ability to design reliable evals, choose appropriate metrics, and avoid evaluation pitfalls.

01

Golden sets

Golden Dataset Construction

  • Curate 200–1000 examples covering typical, edge, and adversarial cases
  • Annotate expected outputs or grading rubrics (not always a single correct answer)
  • Stratify by task type, difficulty, and sensitive topic categories
  • Treat the golden set as a product artifact: version-controlled, reviewed, and maintained
Build golden sets incrementally from real user queries — synthetic data misses real failure modes
Let the golden set go stale — a 2-year-old eval set may not reflect current user behavior

Contamination Avoidance

  • Never use golden-set examples in prompt few-shot examples or fine-tuning data
  • Hold out a secret test set separate from the development eval set
  • Rotate test examples periodically to prevent implicit leakage through prompt engineering
Maintain separate dev (iteration) and test (final gate) splits with strict access controls
02

LLM-as-judge

LLM-as-Judge Evaluation

  • Use a capable LLM (e.g., GPT-4, Claude) to grade model outputs on rubric criteria
  • Provide detailed scoring rubrics — vague prompts produce high-variance scores
  • Validate judge agreement with human labels (Cohen's kappa > 0.6 is acceptable)
  • Judge bias: LLMs prefer their own outputs and verbose answers — control for this
Calibrate your LLM judge against human annotations before relying on it for production decisions
Use the same model as both the system under test and the judge — self-serving bias is severe

Pairwise vs Pointwise Evaluation

  • Pointwise: score each output on a scale (e.g., 1–5) independently
  • Pairwise: show two outputs and ask which is better — often higher agreement with human preference
  • Pairwise comparisons can be aggregated with ELO ratings across many comparisons
  • Pointwise is easier to parallelize; pairwise requires O(n²) comparisons for n candidates
Use pairwise evaluation for relative quality comparisons between model versions
03

A/B tests

Online A/B Testing for LLMs

  • Randomly assign users or sessions to control (current) and treatment (new) model
  • Primary metrics: task success rate, user engagement, satisfaction score
  • Guard metrics: latency, cost, safety classifier flags — must not regress
  • Minimum sample size determined by effect size, significance level (α=0.05), and power (β=0.8)
Define primary and guard metrics before running the A/B test — don't decide them after seeing results
Stop A/B tests early when you see a positive result — wait for the pre-specified sample size
04

Offline metrics

Offline Evaluation Metrics

  • Exact match (EM): binary; fails to capture partial credit — use for classification, not generation
  • ROUGE / BLEU: n-gram overlap; poor correlation with human quality for open-ended generation
  • BERTScore: semantic similarity via embeddings — better correlation than n-gram metrics
  • Task-specific metrics: F1 for NER, NDCG for ranking, pass@K for code generation
Choose metrics that correlate with your actual business outcome — measure correlation explicitly
Report BLEU as your primary quality metric for LLM generation — interviewers will push back

Correlation Between Offline and Online Metrics

  • Offline metric improvement does not guarantee online quality improvement
  • Validate offline-online correlation periodically with A/B tests
  • Goodhart's Law: when an offline metric becomes a target, it ceases to be a good measure
Always close the loop: run A/B tests to verify offline improvements translate to user outcomes
05

Error taxonomy

Failure Mode Classification

  • Factual errors: claims that contradict ground truth (hallucination)
  • Format errors: output doesn't match required schema or structure
  • Instruction following errors: model ignores part of the prompt
  • Safety errors: model generates policy-violating content
Build an error taxonomy early; label each failure into a category to guide prioritization
Lump all failures into 'model made a mistake' — you need error categories to know what to fix
06

Sample size

Eval Sample Size

  • For a 5% effect size at 80% power, you need ~600 examples per condition
  • Smaller golden sets detect only large regressions; invest in 500+ examples for production systems
  • Stratified sampling: ensure rare categories have enough examples to be statistically detectable
  • Bootstrap confidence intervals on eval metrics help communicate uncertainty to stakeholders
Report confidence intervals alongside eval scores — 'we improved 2 points' means less without them
Draw conclusions from a 30-example eval set — noise will dominate the signal