← All cheatsheets
Cheatsheet
AI Infrastructure
Hardware, orchestration, and cost strategies for running AI workloads at scale. Interviewers test your ability to design systems that remain performant and cost-efficient under load.
01
GPU scheduling
GPU Resource Management
- LLM inference is memory-bandwidth bound, not compute bound — maximize GPU memory utilization
- Pack multiple LoRA adapters on a single GPU using shared base model weights
- GPU time-slicing (MIG on A100/H100) partitions one GPU for multiple smaller workloads
- Scheduler should co-locate prefill-heavy and decode-heavy requests to balance memory bandwidth
✓
Know the difference between compute-bound (training) and memory-bandwidth-bound (inference) regimes
Model Parallelism Strategies
- Tensor parallelism (TP): split a single layer's weight matrix across multiple GPUs
- Pipeline parallelism (PP): split the model layer-by-layer across GPUs; introduces pipeline bubbles
- Data parallelism (DP): replicate the model, split the batch — standard for training
- TP reduces per-layer latency; PP reduces per-GPU memory; combine for large model serving
✓
Describe TP + PP as the standard combination for serving 70B+ models across multiple GPUs
02
Batching
Continuous Batching
- Evicts completed sequences and inserts new ones within the same iteration
- Eliminates GPU idle time from sequences completing at different lengths
- Implemented in vLLM, TGI, and TensorRT-LLM — industry standard for production LLM serving
- Requires a scheduler to manage KV cache allocation across batch members
✓
Name continuous batching (and vLLM) as the default answer for LLM serving throughput
Chunked Prefill
- Split long prompts into chunks and process one chunk per iteration
- Prevents long prompts from monopolizing GPU memory and delaying shorter requests
- Interleaves prefill and decode work across iterations for better latency distribution
- Available in vLLM v0.4+ as a first-class feature
✓
Mention chunked prefill when discussing TTFT fairness for mixed long/short prompt workloads
03
Rate limits
Token-Based Rate Limiting
- Rate limit on tokens per minute (TPM), not just requests per minute (RPM)
- Token-bucket algorithm: allows bursts up to bucket capacity; refills at a constant rate
- Per-tier limits: free, pro, and enterprise users get different buckets
- Expose rate limit headers (X-RateLimit-Remaining, Retry-After) so clients can back off gracefully
✓
Limit on TPM in addition to RPM — a single large request can exhaust capacity as much as many small ones
04
Autoscaling
GPU Autoscaling
- Scale on GPU memory utilization and request queue depth, not CPU metrics
- Scale-out lag for GPUs is 2–5 minutes (cold start); maintain a warm minimum replica count
- Kubernetes with KEDA or custom HPA using GPU metrics is a common pattern
- Preemptible / spot GPU instances reduce cost but require request replay on eviction
✓
Keep a minimum warm replica count to absorb traffic spikes while scale-out completes
✗
Scale on CPU/memory alone — GPU metrics are the true bottleneck for LLM workloads
Scale-to-Zero for Batch Workloads
- Non-interactive batch jobs can tolerate cold-start latency — scale to zero between jobs
- Store model weights in high-speed object storage (S3 with S3 Express or FSx) for fast loading
- Prefetch model weights before the scheduled job start to reduce observed cold-start
✓
Use scale-to-zero for overnight batch pipelines to eliminate idle GPU cost
05
Queues
Request Queue Architecture
- Queue inference requests during traffic spikes to avoid dropping them
- Priority queues: premium users and synchronous requests ahead of batch async requests
- Queue depth is a leading indicator of capacity issues — alert before requests time out
- Dead-letter queues capture failed requests for investigation and retry
✓
Separate synchronous (low-latency) and asynchronous (batch) request queues with different SLAs
✗
Use a single FIFO queue for all requests — a slow batch job will delay interactive users
06
Cost per request
GPU Cost Attribution
- Cost per request = GPU hour cost × time on GPU / requests per hour
- GPU utilization directly drives cost efficiency — idle GPUs are wasted spend
- Compare cost per 1M tokens: self-hosted (at scale) is typically 5–20× cheaper than managed API
- Track cost by model, user tier, and feature to identify expensive outliers
✓
Build a cost-per-request dashboard from day one; it informs model routing and pricing decisions
Cost Optimization Levers
- Quantization (INT4/INT8): 2–4× memory reduction → more requests per GPU → lower cost per request
- Speculative decoding: higher throughput at same quality → lower cost per token
- Spot / preemptible instances: 60–80% discount vs on-demand for fault-tolerant batch workloads
- Reserved instances: commit to 1–3 year terms for predictable baseline load
✓
Stack cost levers (quantization + continuous batching + spot instances) for maximum savings