← All cheatsheets
Cheatsheet
Fine-Tuning
Techniques for adapting pretrained models to new tasks or domains. Interviewers test your ability to choose the right adaptation strategy and diagnose training problems.
01
SFT
Supervised Fine-Tuning (SFT)
- Train on (input, output) pairs using next-token prediction loss on the output tokens
- Typically uses a much smaller dataset than pretraining (hundreds to tens of thousands of examples)
- Learning rate should be 1–2 orders of magnitude lower than pretraining LR
- SFT teaches format and behavior; it does not reliably teach novel knowledge
✓
Use SFT to teach consistent output format, style, and task-specific behavior
✗
Expect SFT to inject new factual knowledge reliably — use RAG for knowledge grounding
Instruction Tuning
- SFT on diverse instruction-following examples (FLAN, Alpaca, Dolly formats)
- Dramatically improves zero-shot task generalization on instruction-following benchmarks
- Quality and diversity of instruction set matters more than raw quantity
- Self-instruct: use an LLM to generate instruction-response pairs for bootstrapping
✓
Cite instruction tuning as the step that turns a raw pretrained model into a useful assistant
02
LoRA
LoRA (Low-Rank Adaptation)
- Adds trainable low-rank decomposition matrices (A·B) alongside frozen pretrained weights
- Only A and B are trained; rank r is typically 4–64 depending on task complexity
- Reduces trainable parameters by 10–10,000× vs full fine-tuning
- At inference, LoRA weights can be merged into the base model with zero added latency
✓
Recommend LoRA as the default PEFT method for most fine-tuning tasks
✗
Set rank too high (r > 128) without a reason — it defeats the parameter efficiency purpose
Which Layers to Apply LoRA
- Typically applied to query and value projection matrices in attention layers
- Also applying to key, output, and FFN layers can improve performance for harder tasks
- Higher layers (closer to output) are generally more task-specific; lower layers encode general syntax
- Sweep across layer targets as part of hyperparameter search
✓
Start with q_proj + v_proj; expand to more layers if accuracy is insufficient
03
QLoRA
QLoRA
- Quantizes the base model to 4-bit NF4 (NormalFloat4) to reduce memory footprint
- LoRA adapters are trained in full precision on top of the quantized backbone
- Enables fine-tuning 65B+ parameter models on a single 48 GB GPU
- Slight quality degradation vs full-precision LoRA; usually acceptable for most tasks
✓
Use QLoRA when GPU memory is the bottleneck for fine-tuning large models
✗
Use QLoRA when maximum fine-tune quality is required — full-precision LoRA is preferred when memory allows
04
Dataset quality
Fine-Tuning Data Principles
- Quality beats quantity: 1K high-quality examples often outperform 100K noisy ones
- Ensure training data matches the exact format and style the model will see at inference
- Deduplicate to prevent memorization of repeated examples
- Balance classes and task types to avoid model skewing toward frequent examples
✓
Do a data audit (sample 100 examples manually) before every fine-tuning run
✗
Start fine-tuning without spot-checking the training data — garbage in, garbage out
Prompt–Completion Format Consistency
- Training format must exactly match inference-time format including special tokens
- Chat models require the correct conversation template (ChatML, Llama format, etc.)
- Loss masking: compute loss only on completion tokens, not input/instruction tokens
✓
Verify the tokenizer chat template is applied identically during training and inference
✗
Forget loss masking on the prompt — training on prompt tokens causes the model to repeat instructions
05
Hyperparameters
Key Fine-Tuning Hyperparameters
- Learning rate: 1e-5 to 5e-4 for LoRA; lower for larger models
- Batch size: larger batches stabilize training; gradient accumulation simulates larger batches
- Epochs: 1–3 epochs is typically sufficient; more risks overfitting
- Warmup: 3–10% of total steps prevents large early gradient updates
✓
Run a small sweep over LR (3 values) and rank before committing to a full training run
Learning Rate Scheduling
- Cosine decay with warmup is the most common schedule for fine-tuning
- Linear warmup + cosine decay: prevents instability at the start and overshooting at the end
- Constant LR without decay can cause the model to forget pretrained capabilities
✓
Always include warmup steps — they prevent loss spikes in early training
06
Overfitting
Overfitting Indicators
- Training loss continues to fall while validation loss plateaus or rises
- Model memorizes training examples and fails to generalize to paraphrased inputs
- Catastrophic forgetting: the model loses general capabilities after over-training on a small dataset
- Eval metrics (BLEU, exact match) improve on training set but drop on held-out test set
✓
Monitor train and eval loss curves throughout training; stop early if eval loss diverges
✗
Train to zero training loss on a small dataset — guaranteed overfitting
Regularization for Fine-Tuning
- Early stopping based on eval loss is the simplest and most effective regularizer
- Dropout on LoRA layers (lora_dropout=0.05–0.1) adds regularization
- Data augmentation: paraphrase or back-translate examples to increase diversity
- Replay buffer: mix a small fraction of pretraining data to mitigate catastrophic forgetting
✓
Include a replay buffer from pretraining data when fine-tuning on narrow domain data