← All cheatsheets
Cheatsheet

LLM Fundamentals

Core concepts behind modern large language models, from transformer architecture to inference. Interviewers test whether you can explain what happens inside the model at a mechanistic level.

01

Transformers

Transformer Architecture

  • Encoder-decoder or decoder-only; GPT-style LLMs are decoder-only
  • Each layer: multi-head self-attention → feed-forward network (MLP) → residual + LayerNorm
  • Depth (layers) and width (hidden dim) jointly determine parameter count
  • Residual connections allow gradients to flow through many layers without vanishing
Sketch the layer stack when asked to explain transformers; mention residuals and LayerNorm
Conflate encoder-only (BERT) with decoder-only (GPT) architectures

Positional Encoding

  • Attention has no built-in sense of order; positional signals must be injected
  • Sinusoidal (absolute) encoding used in original paper; learned embeddings common in GPT
  • RoPE (Rotary Position Embedding) encodes relative positions and extrapolates to longer contexts
  • ALiBi biases attention scores by distance instead of modifying embeddings
Mention RoPE when discussing long-context or context-length extension
Say positional encoding is optional — without it tokens are order-agnostic

Feed-Forward Network (FFN)

  • Two-layer MLP applied per position independently after attention
  • Hidden dim is typically 4× the model dim (e.g., 4096 → 16384)
  • Mixture-of-Experts (MoE) replaces the dense FFN with sparse expert routing
  • FFN parameters dominate total model size in large transformers
Explain MoE as a way to scale parameters without scaling compute proportionally

Layer Normalization

  • Normalizes activations across the feature dimension (not batch)
  • Pre-LN (before attention) is more training-stable than Post-LN (original paper)
  • RMSNorm drops the mean-centering term — used in Llama and Mistral for efficiency
Note that most modern open models use Pre-LN with RMSNorm
02

Attention

Scaled Dot-Product Attention

  • Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
  • Scaling by √d_k prevents vanishing gradients from large dot products
  • O(n²) complexity in sequence length — the main bottleneck for long contexts
  • Causal masking in decoder prevents attending to future tokens
Write out the formula; interviewers often ask you to derive it
Forget the square-root scaling — a common interview slip

Multi-Head Attention (MHA)

  • Runs H attention heads in parallel, each with its own Q/K/V projections
  • Outputs are concatenated then projected back to model dim
  • Multiple heads let the model attend to different positions for different reasons
  • Grouped Query Attention (GQA) and Multi-Query Attention (MQA) share K/V heads to cut KV cache size
Mention GQA when discussing inference efficiency; it's used in Llama 2/3 and Mistral
Say more heads always means better — beyond a point, gains diminish

KV Cache

  • Stores past K and V tensors during autoregressive generation
  • Avoids recomputing K/V for every new token — critical for low latency
  • Memory grows linearly with seq length × batch size × layers × head dim
  • Eviction strategies (sliding window, H2O) are needed for very long sequences
Mention KV cache when discussing inference cost and latency
Forget KV cache eviction strategies for long contexts

Flash Attention

  • Reorders attention computation to minimize HBM (GPU DRAM) reads/writes
  • Uses tiling and online softmax to keep intermediates in fast SRAM
  • Same outputs as standard attention, but significantly faster and memory-efficient
  • Flash Attention 2/3 further improves parallelism across the sequence dimension
Bring up Flash Attention as the go-to implementation for production LLM training and inference
03

Context windows

Context Window

  • Maximum number of tokens the model can process in one forward pass (input + output)
  • Limited by quadratic attention cost and KV cache memory
  • Modern models: 8K–200K+ tokens; extended via RoPE scaling or sliding window attention
  • Lost-in-the-middle effect: retrieval accuracy drops for content placed in the middle of long contexts
Discuss context length tradeoffs: cost, latency, and the lost-in-the-middle pitfall
Assume bigger context always beats RAG — long contexts are slower and costlier

Context Length Extension

  • RoPE scaling (linear, NTK, YaRN) lets models generalize beyond training context length
  • Sliding window / sparse attention restricts each token to a local window
  • Longformer and BigBird combine local + global attention for O(n) complexity
  • Fine-tuning on longer sequences is still often needed for stable performance
Name at least one concrete technique (YaRN, sliding window) with its tradeoff

Prompt and Completion Budget

  • Input (prompt) tokens and output (completion) tokens share the context window
  • Output tokens are generated autoregressively — each costs one forward pass
  • Price is usually higher per output token than per input token
  • Prefill (prompt processing) is batched; decode (generation) is typically sequential per step
Distinguish prefill vs decode phases when asked about inference cost
04

Pretraining

Next-Token Prediction (CLM)

  • Standard pretraining objective for decoder-only LLMs
  • Minimizes cross-entropy loss over all tokens in the sequence
  • Trains on web-scale corpora (Common Crawl, books, code, etc.)
  • Scaling laws (Chinchilla) suggest compute-optimal: ~20 training tokens per parameter
Cite Chinchilla scaling law when asked about compute-optimal training
Conflate pretraining (CLM) with masked language modeling (BERT's MLM)

Data Quality and Curation

  • Deduplication (MinHash, exact n-gram) prevents memorization and improves generalization
  • Quality filters: language ID, perplexity filtering, classifier-based heuristics
  • Data mixture (code, math, multilingual) strongly affects downstream capability
  • Repeating data more than ~4× can degrade performance
Emphasize that data quality beats raw quantity for efficient training

Scaling Laws

  • Loss decreases as a power law of model size (N) and training tokens (D)
  • Chinchilla: optimal N and D scale together proportionally for a fixed compute budget
  • Emergent abilities appear unpredictably at certain scales
  • Inference costs scale with N, not D — so smaller models trained longer are often preferred
Mention that current frontier models are often undertrained relative to Chinchilla optimal, for inference savings
05

Tokenization

BPE Tokenization

  • Iteratively merges the most frequent adjacent byte pairs to build vocabulary
  • Vocabulary size typically 32K–100K tokens
  • Unknown words decompose into subwords, never OOV
  • Average ~1.3 tokens per English word; code and non-Latin scripts can be much higher
Explain BPE when asked why tokens are subwords
Assume 1 word = 1 token — this leads to wrong cost and latency estimates

Tokenization Artifacts

  • Leading-space sensitivity: ' apple' and 'apple' may map to different tokens
  • Numbers are split unpredictably — '1234' might be 2–4 tokens
  • Rare languages and code with uncommon symbols get long token sequences
  • Tokenizer mismatch between training and fine-tuning can degrade performance
Mention token-efficiency concerns when designing prompts for non-English or structured content

Special Tokens

  • <|endoftext|>, <|im_start|>, <s>, </s> signal turn boundaries or sequence ends
  • Chat models use structured templates (ChatML, Llama-3 format) with special tokens
  • Mismatched chat templates between training and inference cause subtle degradation
Always apply the model's documented chat template — not just raw prompt concatenation
Ignore special tokens when switching between model providers
06

Inference

Autoregressive Decoding

  • Tokens are generated one at a time, each conditioned on all previous tokens
  • Greedy decoding: pick the highest-probability token at each step
  • Sampling: temperature, top-k, top-p (nucleus) control randomness
  • Beam search: keeps top-B candidates; improves quality but increases latency
Explain temperature > 1 increases diversity, < 1 makes output more deterministic
Recommend beam search for latency-sensitive production systems without caveats

Speculative Decoding

  • A small draft model generates k tokens; the target model verifies all in one forward pass
  • Accepted tokens are kept; first rejection truncates the batch
  • Can achieve 2–4× speedup with no change in output distribution
  • Draft model quality and alignment to target model determines acceptance rate
Pitch speculative decoding when asked how to reduce decode latency without quality loss

Quantization

  • Reduces model weight precision: FP16 → INT8 or INT4
  • Post-training quantization (PTQ): GPTQ, AWQ — no retraining needed
  • QAT (quantization-aware training) recovers accuracy lost by PTQ
  • INT4 roughly halves memory vs INT8; quality drop is usually small for 7B+ models
Recommend INT4/INT8 quantization as first step when GPU memory is constrained
Apply aggressive quantization (< INT4) without running evals — quality can collapse

Batching Strategies

  • Static batching: pad all sequences to max length — wastes GPU for variable-length requests
  • Continuous batching (iteration-level): evicts finished sequences mid-batch to fill GPUs
  • PagedAttention (vLLM) manages KV cache as virtual memory pages to reduce fragmentation
  • Throughput vs latency tradeoff: larger batches improve GPU utilization but increase time-to-first-token
Name continuous batching as the standard for production LLM serving