← All cheatsheets
Cheatsheet

Fine-Tuning

Techniques for adapting pretrained models to new tasks or domains. Interviewers test your ability to choose the right adaptation strategy and diagnose training problems.

01

SFT

Supervised Fine-Tuning (SFT)

  • Train on (input, output) pairs using next-token prediction loss on the output tokens
  • Typically uses a much smaller dataset than pretraining (hundreds to tens of thousands of examples)
  • Learning rate should be 1–2 orders of magnitude lower than pretraining LR
  • SFT teaches format and behavior; it does not reliably teach novel knowledge
Use SFT to teach consistent output format, style, and task-specific behavior
Expect SFT to inject new factual knowledge reliably — use RAG for knowledge grounding

Instruction Tuning

  • SFT on diverse instruction-following examples (FLAN, Alpaca, Dolly formats)
  • Dramatically improves zero-shot task generalization on instruction-following benchmarks
  • Quality and diversity of instruction set matters more than raw quantity
  • Self-instruct: use an LLM to generate instruction-response pairs for bootstrapping
Cite instruction tuning as the step that turns a raw pretrained model into a useful assistant
02

LoRA

LoRA (Low-Rank Adaptation)

  • Adds trainable low-rank decomposition matrices (A·B) alongside frozen pretrained weights
  • Only A and B are trained; rank r is typically 4–64 depending on task complexity
  • Reduces trainable parameters by 10–10,000× vs full fine-tuning
  • At inference, LoRA weights can be merged into the base model with zero added latency
Recommend LoRA as the default PEFT method for most fine-tuning tasks
Set rank too high (r > 128) without a reason — it defeats the parameter efficiency purpose

Which Layers to Apply LoRA

  • Typically applied to query and value projection matrices in attention layers
  • Also applying to key, output, and FFN layers can improve performance for harder tasks
  • Higher layers (closer to output) are generally more task-specific; lower layers encode general syntax
  • Sweep across layer targets as part of hyperparameter search
Start with q_proj + v_proj; expand to more layers if accuracy is insufficient
03

QLoRA

QLoRA

  • Quantizes the base model to 4-bit NF4 (NormalFloat4) to reduce memory footprint
  • LoRA adapters are trained in full precision on top of the quantized backbone
  • Enables fine-tuning 65B+ parameter models on a single 48 GB GPU
  • Slight quality degradation vs full-precision LoRA; usually acceptable for most tasks
Use QLoRA when GPU memory is the bottleneck for fine-tuning large models
Use QLoRA when maximum fine-tune quality is required — full-precision LoRA is preferred when memory allows
04

Dataset quality

Fine-Tuning Data Principles

  • Quality beats quantity: 1K high-quality examples often outperform 100K noisy ones
  • Ensure training data matches the exact format and style the model will see at inference
  • Deduplicate to prevent memorization of repeated examples
  • Balance classes and task types to avoid model skewing toward frequent examples
Do a data audit (sample 100 examples manually) before every fine-tuning run
Start fine-tuning without spot-checking the training data — garbage in, garbage out

Prompt–Completion Format Consistency

  • Training format must exactly match inference-time format including special tokens
  • Chat models require the correct conversation template (ChatML, Llama format, etc.)
  • Loss masking: compute loss only on completion tokens, not input/instruction tokens
Verify the tokenizer chat template is applied identically during training and inference
Forget loss masking on the prompt — training on prompt tokens causes the model to repeat instructions
05

Hyperparameters

Key Fine-Tuning Hyperparameters

  • Learning rate: 1e-5 to 5e-4 for LoRA; lower for larger models
  • Batch size: larger batches stabilize training; gradient accumulation simulates larger batches
  • Epochs: 1–3 epochs is typically sufficient; more risks overfitting
  • Warmup: 3–10% of total steps prevents large early gradient updates
Run a small sweep over LR (3 values) and rank before committing to a full training run

Learning Rate Scheduling

  • Cosine decay with warmup is the most common schedule for fine-tuning
  • Linear warmup + cosine decay: prevents instability at the start and overshooting at the end
  • Constant LR without decay can cause the model to forget pretrained capabilities
Always include warmup steps — they prevent loss spikes in early training
06

Overfitting

Overfitting Indicators

  • Training loss continues to fall while validation loss plateaus or rises
  • Model memorizes training examples and fails to generalize to paraphrased inputs
  • Catastrophic forgetting: the model loses general capabilities after over-training on a small dataset
  • Eval metrics (BLEU, exact match) improve on training set but drop on held-out test set
Monitor train and eval loss curves throughout training; stop early if eval loss diverges
Train to zero training loss on a small dataset — guaranteed overfitting

Regularization for Fine-Tuning

  • Early stopping based on eval loss is the simplest and most effective regularizer
  • Dropout on LoRA layers (lora_dropout=0.05–0.1) adds regularization
  • Data augmentation: paraphrase or back-translate examples to increase diversity
  • Replay buffer: mix a small fraction of pretraining data to mitigate catastrophic forgetting
Include a replay buffer from pretraining data when fine-tuning on narrow domain data