← All cheatsheets
Cheatsheet
LLM Fundamentals
Core concepts behind modern large language models, from transformer architecture to inference. Interviewers test whether you can explain what happens inside the model at a mechanistic level.
01
Transformers
Transformer Architecture
- Encoder-decoder or decoder-only; GPT-style LLMs are decoder-only
- Each layer: multi-head self-attention → feed-forward network (MLP) → residual + LayerNorm
- Depth (layers) and width (hidden dim) jointly determine parameter count
- Residual connections allow gradients to flow through many layers without vanishing
✓
Sketch the layer stack when asked to explain transformers; mention residuals and LayerNorm
✗
Conflate encoder-only (BERT) with decoder-only (GPT) architectures
Positional Encoding
- Attention has no built-in sense of order; positional signals must be injected
- Sinusoidal (absolute) encoding used in original paper; learned embeddings common in GPT
- RoPE (Rotary Position Embedding) encodes relative positions and extrapolates to longer contexts
- ALiBi biases attention scores by distance instead of modifying embeddings
✓
Mention RoPE when discussing long-context or context-length extension
✗
Say positional encoding is optional — without it tokens are order-agnostic
Feed-Forward Network (FFN)
- Two-layer MLP applied per position independently after attention
- Hidden dim is typically 4× the model dim (e.g., 4096 → 16384)
- Mixture-of-Experts (MoE) replaces the dense FFN with sparse expert routing
- FFN parameters dominate total model size in large transformers
✓
Explain MoE as a way to scale parameters without scaling compute proportionally
Layer Normalization
- Normalizes activations across the feature dimension (not batch)
- Pre-LN (before attention) is more training-stable than Post-LN (original paper)
- RMSNorm drops the mean-centering term — used in Llama and Mistral for efficiency
✓
Note that most modern open models use Pre-LN with RMSNorm
02
Attention
Scaled Dot-Product Attention
- Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
- Scaling by √d_k prevents vanishing gradients from large dot products
- O(n²) complexity in sequence length — the main bottleneck for long contexts
- Causal masking in decoder prevents attending to future tokens
✓
Write out the formula; interviewers often ask you to derive it
✗
Forget the square-root scaling — a common interview slip
Multi-Head Attention (MHA)
- Runs H attention heads in parallel, each with its own Q/K/V projections
- Outputs are concatenated then projected back to model dim
- Multiple heads let the model attend to different positions for different reasons
- Grouped Query Attention (GQA) and Multi-Query Attention (MQA) share K/V heads to cut KV cache size
✓
Mention GQA when discussing inference efficiency; it's used in Llama 2/3 and Mistral
✗
Say more heads always means better — beyond a point, gains diminish
KV Cache
- Stores past K and V tensors during autoregressive generation
- Avoids recomputing K/V for every new token — critical for low latency
- Memory grows linearly with seq length × batch size × layers × head dim
- Eviction strategies (sliding window, H2O) are needed for very long sequences
✓
Mention KV cache when discussing inference cost and latency
✗
Forget KV cache eviction strategies for long contexts
Flash Attention
- Reorders attention computation to minimize HBM (GPU DRAM) reads/writes
- Uses tiling and online softmax to keep intermediates in fast SRAM
- Same outputs as standard attention, but significantly faster and memory-efficient
- Flash Attention 2/3 further improves parallelism across the sequence dimension
✓
Bring up Flash Attention as the go-to implementation for production LLM training and inference
03
Context windows
Context Window
- Maximum number of tokens the model can process in one forward pass (input + output)
- Limited by quadratic attention cost and KV cache memory
- Modern models: 8K–200K+ tokens; extended via RoPE scaling or sliding window attention
- Lost-in-the-middle effect: retrieval accuracy drops for content placed in the middle of long contexts
✓
Discuss context length tradeoffs: cost, latency, and the lost-in-the-middle pitfall
✗
Assume bigger context always beats RAG — long contexts are slower and costlier
Context Length Extension
- RoPE scaling (linear, NTK, YaRN) lets models generalize beyond training context length
- Sliding window / sparse attention restricts each token to a local window
- Longformer and BigBird combine local + global attention for O(n) complexity
- Fine-tuning on longer sequences is still often needed for stable performance
✓
Name at least one concrete technique (YaRN, sliding window) with its tradeoff
Prompt and Completion Budget
- Input (prompt) tokens and output (completion) tokens share the context window
- Output tokens are generated autoregressively — each costs one forward pass
- Price is usually higher per output token than per input token
- Prefill (prompt processing) is batched; decode (generation) is typically sequential per step
✓
Distinguish prefill vs decode phases when asked about inference cost
04
Pretraining
Next-Token Prediction (CLM)
- Standard pretraining objective for decoder-only LLMs
- Minimizes cross-entropy loss over all tokens in the sequence
- Trains on web-scale corpora (Common Crawl, books, code, etc.)
- Scaling laws (Chinchilla) suggest compute-optimal: ~20 training tokens per parameter
✓
Cite Chinchilla scaling law when asked about compute-optimal training
✗
Conflate pretraining (CLM) with masked language modeling (BERT's MLM)
Data Quality and Curation
- Deduplication (MinHash, exact n-gram) prevents memorization and improves generalization
- Quality filters: language ID, perplexity filtering, classifier-based heuristics
- Data mixture (code, math, multilingual) strongly affects downstream capability
- Repeating data more than ~4× can degrade performance
✓
Emphasize that data quality beats raw quantity for efficient training
Scaling Laws
- Loss decreases as a power law of model size (N) and training tokens (D)
- Chinchilla: optimal N and D scale together proportionally for a fixed compute budget
- Emergent abilities appear unpredictably at certain scales
- Inference costs scale with N, not D — so smaller models trained longer are often preferred
✓
Mention that current frontier models are often undertrained relative to Chinchilla optimal, for inference savings
05
Tokenization
BPE Tokenization
- Iteratively merges the most frequent adjacent byte pairs to build vocabulary
- Vocabulary size typically 32K–100K tokens
- Unknown words decompose into subwords, never OOV
- Average ~1.3 tokens per English word; code and non-Latin scripts can be much higher
✓
Explain BPE when asked why tokens are subwords
✗
Assume 1 word = 1 token — this leads to wrong cost and latency estimates
Tokenization Artifacts
- Leading-space sensitivity: ' apple' and 'apple' may map to different tokens
- Numbers are split unpredictably — '1234' might be 2–4 tokens
- Rare languages and code with uncommon symbols get long token sequences
- Tokenizer mismatch between training and fine-tuning can degrade performance
✓
Mention token-efficiency concerns when designing prompts for non-English or structured content
Special Tokens
- <|endoftext|>, <|im_start|>, <s>, </s> signal turn boundaries or sequence ends
- Chat models use structured templates (ChatML, Llama-3 format) with special tokens
- Mismatched chat templates between training and inference cause subtle degradation
✓
Always apply the model's documented chat template — not just raw prompt concatenation
✗
Ignore special tokens when switching between model providers
06
Inference
Autoregressive Decoding
- Tokens are generated one at a time, each conditioned on all previous tokens
- Greedy decoding: pick the highest-probability token at each step
- Sampling: temperature, top-k, top-p (nucleus) control randomness
- Beam search: keeps top-B candidates; improves quality but increases latency
✓
Explain temperature > 1 increases diversity, < 1 makes output more deterministic
✗
Recommend beam search for latency-sensitive production systems without caveats
Speculative Decoding
- A small draft model generates k tokens; the target model verifies all in one forward pass
- Accepted tokens are kept; first rejection truncates the batch
- Can achieve 2–4× speedup with no change in output distribution
- Draft model quality and alignment to target model determines acceptance rate
✓
Pitch speculative decoding when asked how to reduce decode latency without quality loss
Quantization
- Reduces model weight precision: FP16 → INT8 or INT4
- Post-training quantization (PTQ): GPTQ, AWQ — no retraining needed
- QAT (quantization-aware training) recovers accuracy lost by PTQ
- INT4 roughly halves memory vs INT8; quality drop is usually small for 7B+ models
✓
Recommend INT4/INT8 quantization as first step when GPU memory is constrained
✗
Apply aggressive quantization (< INT4) without running evals — quality can collapse
Batching Strategies
- Static batching: pad all sequences to max length — wastes GPU for variable-length requests
- Continuous batching (iteration-level): evicts finished sequences mid-batch to fill GPUs
- PagedAttention (vLLM) manages KV cache as virtual memory pages to reduce fragmentation
- Throughput vs latency tradeoff: larger batches improve GPU utilization but increase time-to-first-token
✓
Name continuous batching as the standard for production LLM serving