← All cheatsheets
Cheatsheet
AI System Design
End-to-end architecture of AI-powered systems at production scale. Interviewers expect you to reason about serving, caching, multi-tenancy, fallbacks, and observability.
01
Serving
LLM Serving Architectures
- Managed API (OpenAI, Anthropic): easiest, no infra, but limited control and higher per-token cost
- Self-hosted inference: vLLM, TGI, TensorRT-LLM — lower cost at scale, higher ops burden
- Disaggregated serving: separate prefill and decode phases on different hardware for efficiency
- Model parallelism: tensor parallel (TP) splits a layer across GPUs; pipeline parallel (PP) splits layers
✓
Know when self-hosted inference becomes cost-justified over managed API (typically >10M tokens/day)
Throughput and Latency SLAs
- TTFT (time to first token): dominated by prompt processing (prefill) latency
- TPOT (time per output token): dominated by decode speed and batch contention
- SLA targets vary by use case: chatbots need low TTFT; batch jobs can tolerate high latency
- Load testing should cover burst traffic at 2–5× average request rate
✓
Define TTFT and TPOT separately when presenting latency requirements
✗
Report only average latency — p95 and p99 matter for user experience and SLA compliance
02
RAG topology
Naive vs Advanced RAG
- Naive RAG: embed query → top-K ANN → stuff chunks into prompt → generate
- Advanced RAG adds: query rewriting, re-ranking, iterative retrieval, and answer verification
- Modular RAG: treat each stage as a swappable component with its own latency and quality SLA
- Agentic RAG: the model decides when to retrieve and what to search for
✓
Start with naive RAG, measure quality gaps, then add complexity one component at a time
✗
Build advanced RAG from day one without establishing a naive RAG baseline to compare against
Multi-Source RAG
- Retrieve from multiple indexes (web, docs, SQL, APIs) in parallel
- Fusion: merge ranked lists from each source using RRF or a learned combiner
- Source weighting: apply per-source trust or recency weights before fusion
- Citation tracking must be maintained across sources for transparency
✓
Parallelize multi-source retrieval; merging is cheap compared to sequential fetch
03
Caching
Prompt Caching
- Cache repeated prompt prefixes (system prompt, few-shot examples) at the KV-cache level
- Providers (Anthropic, OpenAI) offer native prompt caching with cost discounts on cache hits
- Prefix must be stable and appear at the start of the prompt for cache hits to trigger
- Significant cost and latency savings for high-traffic applications with shared prefixes
✓
Put stable system prompt and few-shot examples first to maximize prefix cache hit rates
✗
Randomize or timestamp system prompts — it defeats prefix caching entirely
Semantic Response Caching
- Cache LLM responses keyed by a semantic hash of the query embedding
- Use cosine similarity threshold to serve cached responses to near-duplicate queries
- TTL should reflect content freshness requirements; shorter for dynamic data
- Cache-aside pattern: check cache → if miss, call LLM, store response, return to user
✓
Implement exact-match cache first; add semantic cache only if you have data showing high query repetition
04
Multi-tenant
Multi-Tenant AI Systems
- Namespace isolation: partition vector indexes, document stores, and logs per tenant
- Rate limiting per tenant prevents one customer's load from starving others
- Shared model serving with per-tenant system prompts is the most cost-efficient architecture
- Data isolation requirements (GDPR, SOC2) may mandate separate storage per tenant
✓
Ask about tenant isolation requirements (data, model, compliance) early in any system design discussion
✗
Mix tenant data in the same vector namespace without metadata filtering — cross-tenant leakage is a serious bug
05
Fallbacks
LLM Fallback Strategy
- Primary → fallback: try primary model, fall back to secondary on timeout or error
- Model routing: use a fast small model for simple queries; large model only for complex ones
- Graceful degradation: return cached or templated response if all models are down
- Circuit breaker: stop calling a failing provider for a cooldown period to prevent cascading load
✓
Design fallback chains and test them explicitly — primary provider outages do happen
✗
Treat multi-provider routing as a day-two concern — model API outages are common
06
Observability
LLM Observability Pillars
- Traces: full request path including retrieval, model calls, and tool use with timing
- Metrics: TTFT, TPOT, token usage, cost per request, error rate, cache hit rate
- Logs: prompt/response pairs (sampled) for debugging and audit
- Evaluations: automated quality checks run against sampled production traffic
✓
Name at least one observability tool (LangSmith, Langfuse, Helicone, Arize) in system design answers
Cost Observability
- Track token usage (input + output) per request, user, and feature
- Set per-user and per-tenant spending limits to prevent cost spikes
- Prompt efficiency metrics: tokens-per-useful-output signals prompt bloat
- Alert on unexpected cost increases — they often indicate prompt injection or runaway agents
✓
Instrument token cost at the request level from day one — retrofitting is painful