LLM Fundamentals Cheatsheet — AI Engineer Interview Prep

01

Transformers

Transformer Architecture

Encoder-decoder or decoder-only; GPT-style LLMs are decoder-only
Each layer: multi-head self-attention → feed-forward network (MLP) → residual + LayerNorm
Depth (layers) and width (hidden dim) jointly determine parameter count
Residual connections allow gradients to flow through many layers without vanishing

✓ Sketch the layer stack when asked to explain transformers; mention residuals and LayerNorm

✗ Conflate encoder-only (BERT) with decoder-only (GPT) architectures

Positional Encoding

Attention has no built-in sense of order; positional signals must be injected
Sinusoidal (absolute) encoding used in original paper; learned embeddings common in GPT
RoPE (Rotary Position Embedding) encodes relative positions and extrapolates to longer contexts
ALiBi biases attention scores by distance instead of modifying embeddings

✓ Mention RoPE when discussing long-context or context-length extension

✗ Say positional encoding is optional — without it tokens are order-agnostic

Feed-Forward Network (FFN)

Two-layer MLP applied per position independently after attention
Hidden dim is typically 4× the model dim (e.g., 4096 → 16384)
Mixture-of-Experts (MoE) replaces the dense FFN with sparse expert routing
FFN parameters dominate total model size in large transformers

✓ Explain MoE as a way to scale parameters without scaling compute proportionally

Layer Normalization

Normalizes activations across the feature dimension (not batch)
Pre-LN (before attention) is more training-stable than Post-LN (original paper)
RMSNorm drops the mean-centering term — used in Llama and Mistral for efficiency

✓ Note that most modern open models use Pre-LN with RMSNorm

02

Attention

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
Scaling by √d_k prevents vanishing gradients from large dot products
O(n²) complexity in sequence length — the main bottleneck for long contexts
Causal masking in decoder prevents attending to future tokens

✓ Write out the formula; interviewers often ask you to derive it

✗ Forget the square-root scaling — a common interview slip

Multi-Head Attention (MHA)

Runs H attention heads in parallel, each with its own Q/K/V projections
Outputs are concatenated then projected back to model dim
Multiple heads let the model attend to different positions for different reasons
Grouped Query Attention (GQA) and Multi-Query Attention (MQA) share K/V heads to cut KV cache size

✓ Mention GQA when discussing inference efficiency; it's used in Llama 2/3 and Mistral

✗ Say more heads always means better — beyond a point, gains diminish

KV Cache

Stores past K and V tensors during autoregressive generation
Avoids recomputing K/V for every new token — critical for low latency
Memory grows linearly with seq length × batch size × layers × head dim
Eviction strategies (sliding window, H2O) are needed for very long sequences

✓ Mention KV cache when discussing inference cost and latency

✗ Forget KV cache eviction strategies for long contexts

Flash Attention

Reorders attention computation to minimize HBM (GPU DRAM) reads/writes
Uses tiling and online softmax to keep intermediates in fast SRAM
Same outputs as standard attention, but significantly faster and memory-efficient
Flash Attention 2/3 further improves parallelism across the sequence dimension

✓ Bring up Flash Attention as the go-to implementation for production LLM training and inference

03

Context windows

Context Window

Maximum number of tokens the model can process in one forward pass (input + output)
Limited by quadratic attention cost and KV cache memory
Modern models: 8K–200K+ tokens; extended via RoPE scaling or sliding window attention
Lost-in-the-middle effect: retrieval accuracy drops for content placed in the middle of long contexts

✓ Discuss context length tradeoffs: cost, latency, and the lost-in-the-middle pitfall

✗ Assume bigger context always beats RAG — long contexts are slower and costlier

Context Length Extension

RoPE scaling (linear, NTK, YaRN) lets models generalize beyond training context length
Sliding window / sparse attention restricts each token to a local window
Longformer and BigBird combine local + global attention for O(n) complexity
Fine-tuning on longer sequences is still often needed for stable performance

✓ Name at least one concrete technique (YaRN, sliding window) with its tradeoff

Prompt and Completion Budget

Input (prompt) tokens and output (completion) tokens share the context window
Output tokens are generated autoregressively — each costs one forward pass
Price is usually higher per output token than per input token
Prefill (prompt processing) is batched; decode (generation) is typically sequential per step

✓ Distinguish prefill vs decode phases when asked about inference cost

04

Pretraining

Next-Token Prediction (CLM)

Standard pretraining objective for decoder-only LLMs
Minimizes cross-entropy loss over all tokens in the sequence
Trains on web-scale corpora (Common Crawl, books, code, etc.)
Scaling laws (Chinchilla) suggest compute-optimal: ~20 training tokens per parameter

✓ Cite Chinchilla scaling law when asked about compute-optimal training

✗ Conflate pretraining (CLM) with masked language modeling (BERT's MLM)

Data Quality and Curation

Deduplication (MinHash, exact n-gram) prevents memorization and improves generalization
Quality filters: language ID, perplexity filtering, classifier-based heuristics
Data mixture (code, math, multilingual) strongly affects downstream capability
Repeating data more than ~4× can degrade performance

✓ Emphasize that data quality beats raw quantity for efficient training

Scaling Laws

Loss decreases as a power law of model size (N) and training tokens (D)
Chinchilla: optimal N and D scale together proportionally for a fixed compute budget
Emergent abilities appear unpredictably at certain scales
Inference costs scale with N, not D — so smaller models trained longer are often preferred

✓ Mention that current frontier models are often undertrained relative to Chinchilla optimal, for inference savings

05

Tokenization

BPE Tokenization

Iteratively merges the most frequent adjacent byte pairs to build vocabulary
Vocabulary size typically 32K–100K tokens
Unknown words decompose into subwords, never OOV
Average ~1.3 tokens per English word; code and non-Latin scripts can be much higher

✓ Explain BPE when asked why tokens are subwords

✗ Assume 1 word = 1 token — this leads to wrong cost and latency estimates

Tokenization Artifacts

Leading-space sensitivity: ' apple' and 'apple' may map to different tokens
Numbers are split unpredictably — '1234' might be 2–4 tokens
Rare languages and code with uncommon symbols get long token sequences
Tokenizer mismatch between training and fine-tuning can degrade performance

✓ Mention token-efficiency concerns when designing prompts for non-English or structured content

Special Tokens

<|endoftext|>, <|im_start|>, <s>, </s> signal turn boundaries or sequence ends
Chat models use structured templates (ChatML, Llama-3 format) with special tokens
Mismatched chat templates between training and inference cause subtle degradation

✓ Always apply the model's documented chat template — not just raw prompt concatenation

✗ Ignore special tokens when switching between model providers

06

Inference

Autoregressive Decoding

Tokens are generated one at a time, each conditioned on all previous tokens
Greedy decoding: pick the highest-probability token at each step
Sampling: temperature, top-k, top-p (nucleus) control randomness
Beam search: keeps top-B candidates; improves quality but increases latency

✓ Explain temperature > 1 increases diversity, < 1 makes output more deterministic

✗ Recommend beam search for latency-sensitive production systems without caveats

Speculative Decoding

A small draft model generates k tokens; the target model verifies all in one forward pass
Accepted tokens are kept; first rejection truncates the batch
Can achieve 2–4× speedup with no change in output distribution
Draft model quality and alignment to target model determines acceptance rate

✓ Pitch speculative decoding when asked how to reduce decode latency without quality loss

Quantization

Reduces model weight precision: FP16 → INT8 or INT4
Post-training quantization (PTQ): GPTQ, AWQ — no retraining needed
QAT (quantization-aware training) recovers accuracy lost by PTQ
INT4 roughly halves memory vs INT8; quality drop is usually small for 7B+ models

✓ Recommend INT4/INT8 quantization as first step when GPU memory is constrained

✗ Apply aggressive quantization (< INT4) without running evals — quality can collapse

Batching Strategies

Static batching: pad all sequences to max length — wastes GPU for variable-length requests
Continuous batching (iteration-level): evicts finished sequences mid-batch to fill GPUs
PagedAttention (vLLM) manages KV cache as virtual memory pages to reduce fragmentation
Throughput vs latency tradeoff: larger batches improve GPU utilization but increase time-to-first-token

✓ Name continuous batching as the standard for production LLM serving