← All cheatsheets
Cheatsheet

AI Infrastructure

Hardware, orchestration, and cost strategies for running AI workloads at scale. Interviewers test your ability to design systems that remain performant and cost-efficient under load.

01

GPU scheduling

GPU Resource Management

  • LLM inference is memory-bandwidth bound, not compute bound — maximize GPU memory utilization
  • Pack multiple LoRA adapters on a single GPU using shared base model weights
  • GPU time-slicing (MIG on A100/H100) partitions one GPU for multiple smaller workloads
  • Scheduler should co-locate prefill-heavy and decode-heavy requests to balance memory bandwidth
Know the difference between compute-bound (training) and memory-bandwidth-bound (inference) regimes

Model Parallelism Strategies

  • Tensor parallelism (TP): split a single layer's weight matrix across multiple GPUs
  • Pipeline parallelism (PP): split the model layer-by-layer across GPUs; introduces pipeline bubbles
  • Data parallelism (DP): replicate the model, split the batch — standard for training
  • TP reduces per-layer latency; PP reduces per-GPU memory; combine for large model serving
Describe TP + PP as the standard combination for serving 70B+ models across multiple GPUs
02

Batching

Continuous Batching

  • Evicts completed sequences and inserts new ones within the same iteration
  • Eliminates GPU idle time from sequences completing at different lengths
  • Implemented in vLLM, TGI, and TensorRT-LLM — industry standard for production LLM serving
  • Requires a scheduler to manage KV cache allocation across batch members
Name continuous batching (and vLLM) as the default answer for LLM serving throughput

Chunked Prefill

  • Split long prompts into chunks and process one chunk per iteration
  • Prevents long prompts from monopolizing GPU memory and delaying shorter requests
  • Interleaves prefill and decode work across iterations for better latency distribution
  • Available in vLLM v0.4+ as a first-class feature
Mention chunked prefill when discussing TTFT fairness for mixed long/short prompt workloads
03

Rate limits

Token-Based Rate Limiting

  • Rate limit on tokens per minute (TPM), not just requests per minute (RPM)
  • Token-bucket algorithm: allows bursts up to bucket capacity; refills at a constant rate
  • Per-tier limits: free, pro, and enterprise users get different buckets
  • Expose rate limit headers (X-RateLimit-Remaining, Retry-After) so clients can back off gracefully
Limit on TPM in addition to RPM — a single large request can exhaust capacity as much as many small ones
04

Autoscaling

GPU Autoscaling

  • Scale on GPU memory utilization and request queue depth, not CPU metrics
  • Scale-out lag for GPUs is 2–5 minutes (cold start); maintain a warm minimum replica count
  • Kubernetes with KEDA or custom HPA using GPU metrics is a common pattern
  • Preemptible / spot GPU instances reduce cost but require request replay on eviction
Keep a minimum warm replica count to absorb traffic spikes while scale-out completes
Scale on CPU/memory alone — GPU metrics are the true bottleneck for LLM workloads

Scale-to-Zero for Batch Workloads

  • Non-interactive batch jobs can tolerate cold-start latency — scale to zero between jobs
  • Store model weights in high-speed object storage (S3 with S3 Express or FSx) for fast loading
  • Prefetch model weights before the scheduled job start to reduce observed cold-start
Use scale-to-zero for overnight batch pipelines to eliminate idle GPU cost
05

Queues

Request Queue Architecture

  • Queue inference requests during traffic spikes to avoid dropping them
  • Priority queues: premium users and synchronous requests ahead of batch async requests
  • Queue depth is a leading indicator of capacity issues — alert before requests time out
  • Dead-letter queues capture failed requests for investigation and retry
Separate synchronous (low-latency) and asynchronous (batch) request queues with different SLAs
Use a single FIFO queue for all requests — a slow batch job will delay interactive users
06

Cost per request

GPU Cost Attribution

  • Cost per request = GPU hour cost × time on GPU / requests per hour
  • GPU utilization directly drives cost efficiency — idle GPUs are wasted spend
  • Compare cost per 1M tokens: self-hosted (at scale) is typically 5–20× cheaper than managed API
  • Track cost by model, user tier, and feature to identify expensive outliers
Build a cost-per-request dashboard from day one; it informs model routing and pricing decisions

Cost Optimization Levers

  • Quantization (INT4/INT8): 2–4× memory reduction → more requests per GPU → lower cost per request
  • Speculative decoding: higher throughput at same quality → lower cost per token
  • Spot / preemptible instances: 60–80% discount vs on-demand for fault-tolerant batch workloads
  • Reserved instances: commit to 1–3 year terms for predictable baseline load
Stack cost levers (quantization + continuous batching + spot instances) for maximum savings