Evaluation Cheatsheet — AI Engineer Interview Prep

01

Golden sets

Golden Dataset Construction

Curate 200–1000 examples covering typical, edge, and adversarial cases
Annotate expected outputs or grading rubrics (not always a single correct answer)
Stratify by task type, difficulty, and sensitive topic categories
Treat the golden set as a product artifact: version-controlled, reviewed, and maintained

✓ Build golden sets incrementally from real user queries — synthetic data misses real failure modes

✗ Let the golden set go stale — a 2-year-old eval set may not reflect current user behavior

Contamination Avoidance

Never use golden-set examples in prompt few-shot examples or fine-tuning data
Hold out a secret test set separate from the development eval set
Rotate test examples periodically to prevent implicit leakage through prompt engineering

✓ Maintain separate dev (iteration) and test (final gate) splits with strict access controls

02

LLM-as-Judge Evaluation

Use a capable LLM (e.g., GPT-4, Claude) to grade model outputs on rubric criteria
Provide detailed scoring rubrics — vague prompts produce high-variance scores
Validate judge agreement with human labels (Cohen's kappa > 0.6 is acceptable)
Judge bias: LLMs prefer their own outputs and verbose answers — control for this

✓ Calibrate your LLM judge against human annotations before relying on it for production decisions

✗ Use the same model as both the system under test and the judge — self-serving bias is severe

Pairwise vs Pointwise Evaluation

Pointwise: score each output on a scale (e.g., 1–5) independently
Pairwise: show two outputs and ask which is better — often higher agreement with human preference
Pairwise comparisons can be aggregated with ELO ratings across many comparisons
Pointwise is easier to parallelize; pairwise requires O(n²) comparisons for n candidates

✓ Use pairwise evaluation for relative quality comparisons between model versions

03

Online A/B Testing for LLMs

Randomly assign users or sessions to control (current) and treatment (new) model
Primary metrics: task success rate, user engagement, satisfaction score
Guard metrics: latency, cost, safety classifier flags — must not regress
Minimum sample size determined by effect size, significance level (α=0.05), and power (β=0.8)

✓ Define primary and guard metrics before running the A/B test — don't decide them after seeing results

✗ Stop A/B tests early when you see a positive result — wait for the pre-specified sample size

04

Offline Evaluation Metrics

Exact match (EM): binary; fails to capture partial credit — use for classification, not generation
ROUGE / BLEU: n-gram overlap; poor correlation with human quality for open-ended generation
BERTScore: semantic similarity via embeddings — better correlation than n-gram metrics
Task-specific metrics: F1 for NER, NDCG for ranking, pass@K for code generation

✓ Choose metrics that correlate with your actual business outcome — measure correlation explicitly

✗ Report BLEU as your primary quality metric for LLM generation — interviewers will push back

Correlation Between Offline and Online Metrics

Offline metric improvement does not guarantee online quality improvement
Validate offline-online correlation periodically with A/B tests
Goodhart's Law: when an offline metric becomes a target, it ceases to be a good measure

✓ Always close the loop: run A/B tests to verify offline improvements translate to user outcomes

05

Failure Mode Classification

✓ Build an error taxonomy early; label each failure into a category to guide prioritization

✗ Lump all failures into 'model made a mistake' — you need error categories to know what to fix

06

Eval Sample Size

For a 5% effect size at 80% power, you need ~600 examples per condition
Smaller golden sets detect only large regressions; invest in 500+ examples for production systems
Stratified sampling: ensure rare categories have enough examples to be statistically detectable
Bootstrap confidence intervals on eval metrics help communicate uncertainty to stakeholders

✓ Report confidence intervals alongside eval scores — 'we improved 2 points' means less without them

✗ Draw conclusions from a 30-example eval set — noise will dominate the signal