AI Infrastructure Cheatsheet — AI Engineer Interview Prep

01

GPU scheduling

GPU Resource Management

LLM inference is memory-bandwidth bound, not compute bound — maximize GPU memory utilization
Pack multiple LoRA adapters on a single GPU using shared base model weights
GPU time-slicing (MIG on A100/H100) partitions one GPU for multiple smaller workloads
Scheduler should co-locate prefill-heavy and decode-heavy requests to balance memory bandwidth

✓ Know the difference between compute-bound (training) and memory-bandwidth-bound (inference) regimes

Model Parallelism Strategies

Tensor parallelism (TP): split a single layer's weight matrix across multiple GPUs
Pipeline parallelism (PP): split the model layer-by-layer across GPUs; introduces pipeline bubbles
Data parallelism (DP): replicate the model, split the batch — standard for training
TP reduces per-layer latency; PP reduces per-GPU memory; combine for large model serving

✓ Describe TP + PP as the standard combination for serving 70B+ models across multiple GPUs

02

Continuous Batching

Evicts completed sequences and inserts new ones within the same iteration
Eliminates GPU idle time from sequences completing at different lengths
Implemented in vLLM, TGI, and TensorRT-LLM — industry standard for production LLM serving
Requires a scheduler to manage KV cache allocation across batch members

✓ Name continuous batching (and vLLM) as the default answer for LLM serving throughput

Chunked Prefill

Split long prompts into chunks and process one chunk per iteration
Prevents long prompts from monopolizing GPU memory and delaying shorter requests
Interleaves prefill and decode work across iterations for better latency distribution
Available in vLLM v0.4+ as a first-class feature

✓ Mention chunked prefill when discussing TTFT fairness for mixed long/short prompt workloads

03

Token-Based Rate Limiting

Rate limit on tokens per minute (TPM), not just requests per minute (RPM)
Token-bucket algorithm: allows bursts up to bucket capacity; refills at a constant rate
Per-tier limits: free, pro, and enterprise users get different buckets
Expose rate limit headers (X-RateLimit-Remaining, Retry-After) so clients can back off gracefully

✓ Limit on TPM in addition to RPM — a single large request can exhaust capacity as much as many small ones

04

GPU Autoscaling

Scale on GPU memory utilization and request queue depth, not CPU metrics
Scale-out lag for GPUs is 2–5 minutes (cold start); maintain a warm minimum replica count
Kubernetes with KEDA or custom HPA using GPU metrics is a common pattern
Preemptible / spot GPU instances reduce cost but require request replay on eviction

✓ Keep a minimum warm replica count to absorb traffic spikes while scale-out completes

✗ Scale on CPU/memory alone — GPU metrics are the true bottleneck for LLM workloads

Scale-to-Zero for Batch Workloads

Non-interactive batch jobs can tolerate cold-start latency — scale to zero between jobs
Store model weights in high-speed object storage (S3 with S3 Express or FSx) for fast loading
Prefetch model weights before the scheduled job start to reduce observed cold-start

✓ Use scale-to-zero for overnight batch pipelines to eliminate idle GPU cost

05

Request Queue Architecture

Queue inference requests during traffic spikes to avoid dropping them
Priority queues: premium users and synchronous requests ahead of batch async requests
Queue depth is a leading indicator of capacity issues — alert before requests time out
Dead-letter queues capture failed requests for investigation and retry

✓ Separate synchronous (low-latency) and asynchronous (batch) request queues with different SLAs

✗ Use a single FIFO queue for all requests — a slow batch job will delay interactive users

06

GPU Cost Attribution

Cost per request = GPU hour cost × time on GPU / requests per hour
GPU utilization directly drives cost efficiency — idle GPUs are wasted spend
Compare cost per 1M tokens: self-hosted (at scale) is typically 5–20× cheaper than managed API
Track cost by model, user tier, and feature to identify expensive outliers

✓ Build a cost-per-request dashboard from day one; it informs model routing and pricing decisions

Cost Optimization Levers

Quantization (INT4/INT8): 2–4× memory reduction → more requests per GPU → lower cost per request
Speculative decoding: higher throughput at same quality → lower cost per token
Spot / preemptible instances: 60–80% discount vs on-demand for fault-tolerant batch workloads
Reserved instances: commit to 1–3 year terms for predictable baseline load

✓ Stack cost levers (quantization + continuous batching + spot instances) for maximum savings