← All cheatsheets
Cheatsheet

AI System Design

End-to-end architecture of AI-powered systems at production scale. Interviewers expect you to reason about serving, caching, multi-tenancy, fallbacks, and observability.

01

Serving

LLM Serving Architectures

  • Managed API (OpenAI, Anthropic): easiest, no infra, but limited control and higher per-token cost
  • Self-hosted inference: vLLM, TGI, TensorRT-LLM — lower cost at scale, higher ops burden
  • Disaggregated serving: separate prefill and decode phases on different hardware for efficiency
  • Model parallelism: tensor parallel (TP) splits a layer across GPUs; pipeline parallel (PP) splits layers
Know when self-hosted inference becomes cost-justified over managed API (typically >10M tokens/day)

Throughput and Latency SLAs

  • TTFT (time to first token): dominated by prompt processing (prefill) latency
  • TPOT (time per output token): dominated by decode speed and batch contention
  • SLA targets vary by use case: chatbots need low TTFT; batch jobs can tolerate high latency
  • Load testing should cover burst traffic at 2–5× average request rate
Define TTFT and TPOT separately when presenting latency requirements
Report only average latency — p95 and p99 matter for user experience and SLA compliance
02

RAG topology

Naive vs Advanced RAG

  • Naive RAG: embed query → top-K ANN → stuff chunks into prompt → generate
  • Advanced RAG adds: query rewriting, re-ranking, iterative retrieval, and answer verification
  • Modular RAG: treat each stage as a swappable component with its own latency and quality SLA
  • Agentic RAG: the model decides when to retrieve and what to search for
Start with naive RAG, measure quality gaps, then add complexity one component at a time
Build advanced RAG from day one without establishing a naive RAG baseline to compare against

Multi-Source RAG

  • Retrieve from multiple indexes (web, docs, SQL, APIs) in parallel
  • Fusion: merge ranked lists from each source using RRF or a learned combiner
  • Source weighting: apply per-source trust or recency weights before fusion
  • Citation tracking must be maintained across sources for transparency
Parallelize multi-source retrieval; merging is cheap compared to sequential fetch
03

Caching

Prompt Caching

  • Cache repeated prompt prefixes (system prompt, few-shot examples) at the KV-cache level
  • Providers (Anthropic, OpenAI) offer native prompt caching with cost discounts on cache hits
  • Prefix must be stable and appear at the start of the prompt for cache hits to trigger
  • Significant cost and latency savings for high-traffic applications with shared prefixes
Put stable system prompt and few-shot examples first to maximize prefix cache hit rates
Randomize or timestamp system prompts — it defeats prefix caching entirely

Semantic Response Caching

  • Cache LLM responses keyed by a semantic hash of the query embedding
  • Use cosine similarity threshold to serve cached responses to near-duplicate queries
  • TTL should reflect content freshness requirements; shorter for dynamic data
  • Cache-aside pattern: check cache → if miss, call LLM, store response, return to user
Implement exact-match cache first; add semantic cache only if you have data showing high query repetition
04

Multi-tenant

Multi-Tenant AI Systems

  • Namespace isolation: partition vector indexes, document stores, and logs per tenant
  • Rate limiting per tenant prevents one customer's load from starving others
  • Shared model serving with per-tenant system prompts is the most cost-efficient architecture
  • Data isolation requirements (GDPR, SOC2) may mandate separate storage per tenant
Ask about tenant isolation requirements (data, model, compliance) early in any system design discussion
Mix tenant data in the same vector namespace without metadata filtering — cross-tenant leakage is a serious bug
05

Fallbacks

LLM Fallback Strategy

  • Primary → fallback: try primary model, fall back to secondary on timeout or error
  • Model routing: use a fast small model for simple queries; large model only for complex ones
  • Graceful degradation: return cached or templated response if all models are down
  • Circuit breaker: stop calling a failing provider for a cooldown period to prevent cascading load
Design fallback chains and test them explicitly — primary provider outages do happen
Treat multi-provider routing as a day-two concern — model API outages are common
06

Observability

LLM Observability Pillars

  • Traces: full request path including retrieval, model calls, and tool use with timing
  • Metrics: TTFT, TPOT, token usage, cost per request, error rate, cache hit rate
  • Logs: prompt/response pairs (sampled) for debugging and audit
  • Evaluations: automated quality checks run against sampled production traffic
Name at least one observability tool (LangSmith, Langfuse, Helicone, Arize) in system design answers

Cost Observability

  • Track token usage (input + output) per request, user, and feature
  • Set per-user and per-tenant spending limits to prevent cost spikes
  • Prompt efficiency metrics: tokens-per-useful-output signals prompt bloat
  • Alert on unexpected cost increases — they often indicate prompt injection or runaway agents
Instrument token cost at the request level from day one — retrofitting is painful