Fine-Tuning Cheatsheet — AI Engineer Interview Prep

01

SFT

Supervised Fine-Tuning (SFT)

Train on (input, output) pairs using next-token prediction loss on the output tokens
Typically uses a much smaller dataset than pretraining (hundreds to tens of thousands of examples)
Learning rate should be 1–2 orders of magnitude lower than pretraining LR
SFT teaches format and behavior; it does not reliably teach novel knowledge

✓ Use SFT to teach consistent output format, style, and task-specific behavior

✗ Expect SFT to inject new factual knowledge reliably — use RAG for knowledge grounding

Instruction Tuning

SFT on diverse instruction-following examples (FLAN, Alpaca, Dolly formats)
Dramatically improves zero-shot task generalization on instruction-following benchmarks
Quality and diversity of instruction set matters more than raw quantity
Self-instruct: use an LLM to generate instruction-response pairs for bootstrapping

✓ Cite instruction tuning as the step that turns a raw pretrained model into a useful assistant

02

LoRA

LoRA (Low-Rank Adaptation)

Adds trainable low-rank decomposition matrices (A·B) alongside frozen pretrained weights
Only A and B are trained; rank r is typically 4–64 depending on task complexity
Reduces trainable parameters by 10–10,000× vs full fine-tuning
At inference, LoRA weights can be merged into the base model with zero added latency

✓ Recommend LoRA as the default PEFT method for most fine-tuning tasks

✗ Set rank too high (r > 128) without a reason — it defeats the parameter efficiency purpose

Which Layers to Apply LoRA

Typically applied to query and value projection matrices in attention layers
Also applying to key, output, and FFN layers can improve performance for harder tasks
Higher layers (closer to output) are generally more task-specific; lower layers encode general syntax
Sweep across layer targets as part of hyperparameter search

✓ Start with q_proj + v_proj; expand to more layers if accuracy is insufficient

03

QLoRA

Quantizes the base model to 4-bit NF4 (NormalFloat4) to reduce memory footprint
LoRA adapters are trained in full precision on top of the quantized backbone
Enables fine-tuning 65B+ parameter models on a single 48 GB GPU
Slight quality degradation vs full-precision LoRA; usually acceptable for most tasks

✓ Use QLoRA when GPU memory is the bottleneck for fine-tuning large models

✗ Use QLoRA when maximum fine-tune quality is required — full-precision LoRA is preferred when memory allows

04

Dataset quality

Fine-Tuning Data Principles

Quality beats quantity: 1K high-quality examples often outperform 100K noisy ones
Ensure training data matches the exact format and style the model will see at inference
Deduplicate to prevent memorization of repeated examples
Balance classes and task types to avoid model skewing toward frequent examples

✓ Do a data audit (sample 100 examples manually) before every fine-tuning run

✗ Start fine-tuning without spot-checking the training data — garbage in, garbage out

Prompt–Completion Format Consistency

Training format must exactly match inference-time format including special tokens
Chat models require the correct conversation template (ChatML, Llama format, etc.)
Loss masking: compute loss only on completion tokens, not input/instruction tokens

✓ Verify the tokenizer chat template is applied identically during training and inference

✗ Forget loss masking on the prompt — training on prompt tokens causes the model to repeat instructions

05

Hyperparameters

Key Fine-Tuning Hyperparameters

Learning rate: 1e-5 to 5e-4 for LoRA; lower for larger models
Batch size: larger batches stabilize training; gradient accumulation simulates larger batches
Epochs: 1–3 epochs is typically sufficient; more risks overfitting
Warmup: 3–10% of total steps prevents large early gradient updates

✓ Run a small sweep over LR (3 values) and rank before committing to a full training run

Learning Rate Scheduling

Cosine decay with warmup is the most common schedule for fine-tuning
Linear warmup + cosine decay: prevents instability at the start and overshooting at the end
Constant LR without decay can cause the model to forget pretrained capabilities

✓ Always include warmup steps — they prevent loss spikes in early training

06

Overfitting

Overfitting Indicators

Training loss continues to fall while validation loss plateaus or rises
Model memorizes training examples and fails to generalize to paraphrased inputs
Catastrophic forgetting: the model loses general capabilities after over-training on a small dataset
Eval metrics (BLEU, exact match) improve on training set but drop on held-out test set

✓ Monitor train and eval loss curves throughout training; stop early if eval loss diverges

✗ Train to zero training loss on a small dataset — guaranteed overfitting

Regularization for Fine-Tuning

Early stopping based on eval loss is the simplest and most effective regularizer
Dropout on LoRA layers (lora_dropout=0.05–0.1) adds regularization
Data augmentation: paraphrase or back-translate examples to increase diversity
Replay buffer: mix a small fraction of pretraining data to mitigate catastrophic forgetting

✓ Include a replay buffer from pretraining data when fine-tuning on narrow domain data