AI System Design Cheatsheet — AI Engineer Interview Prep

01

Serving

LLM Serving Architectures

Managed API (OpenAI, Anthropic): easiest, no infra, but limited control and higher per-token cost
Self-hosted inference: vLLM, TGI, TensorRT-LLM — lower cost at scale, higher ops burden
Disaggregated serving: separate prefill and decode phases on different hardware for efficiency
Model parallelism: tensor parallel (TP) splits a layer across GPUs; pipeline parallel (PP) splits layers

✓ Know when self-hosted inference becomes cost-justified over managed API (typically >10M tokens/day)

Throughput and Latency SLAs

TTFT (time to first token): dominated by prompt processing (prefill) latency
TPOT (time per output token): dominated by decode speed and batch contention
SLA targets vary by use case: chatbots need low TTFT; batch jobs can tolerate high latency
Load testing should cover burst traffic at 2–5× average request rate

✓ Define TTFT and TPOT separately when presenting latency requirements

✗ Report only average latency — p95 and p99 matter for user experience and SLA compliance

02

Naive vs Advanced RAG

Naive RAG: embed query → top-K ANN → stuff chunks into prompt → generate
Advanced RAG adds: query rewriting, re-ranking, iterative retrieval, and answer verification
Modular RAG: treat each stage as a swappable component with its own latency and quality SLA
Agentic RAG: the model decides when to retrieve and what to search for

✓ Start with naive RAG, measure quality gaps, then add complexity one component at a time

✗ Build advanced RAG from day one without establishing a naive RAG baseline to compare against

Multi-Source RAG

✓ Parallelize multi-source retrieval; merging is cheap compared to sequential fetch

03

Prompt Caching

Cache repeated prompt prefixes (system prompt, few-shot examples) at the KV-cache level
Providers (Anthropic, OpenAI) offer native prompt caching with cost discounts on cache hits
Prefix must be stable and appear at the start of the prompt for cache hits to trigger
Significant cost and latency savings for high-traffic applications with shared prefixes

✓ Put stable system prompt and few-shot examples first to maximize prefix cache hit rates

✗ Randomize or timestamp system prompts — it defeats prefix caching entirely

Semantic Response Caching

Cache LLM responses keyed by a semantic hash of the query embedding
Use cosine similarity threshold to serve cached responses to near-duplicate queries
TTL should reflect content freshness requirements; shorter for dynamic data
Cache-aside pattern: check cache → if miss, call LLM, store response, return to user

✓ Implement exact-match cache first; add semantic cache only if you have data showing high query repetition

04

Multi-Tenant AI Systems

Namespace isolation: partition vector indexes, document stores, and logs per tenant
Rate limiting per tenant prevents one customer's load from starving others
Shared model serving with per-tenant system prompts is the most cost-efficient architecture
Data isolation requirements (GDPR, SOC2) may mandate separate storage per tenant

✓ Ask about tenant isolation requirements (data, model, compliance) early in any system design discussion

✗ Mix tenant data in the same vector namespace without metadata filtering — cross-tenant leakage is a serious bug

05

LLM Fallback Strategy

Primary → fallback: try primary model, fall back to secondary on timeout or error
Model routing: use a fast small model for simple queries; large model only for complex ones
Graceful degradation: return cached or templated response if all models are down
Circuit breaker: stop calling a failing provider for a cooldown period to prevent cascading load

✓ Design fallback chains and test them explicitly — primary provider outages do happen

✗ Treat multi-provider routing as a day-two concern — model API outages are common

06

LLM Observability Pillars

Traces: full request path including retrieval, model calls, and tool use with timing
Metrics: TTFT, TPOT, token usage, cost per request, error rate, cache hit rate
Logs: prompt/response pairs (sampled) for debugging and audit
Evaluations: automated quality checks run against sampled production traffic

✓ Name at least one observability tool (LangSmith, Langfuse, Helicone, Arize) in system design answers

Cost Observability

Track token usage (input + output) per request, user, and feature
Set per-user and per-tenant spending limits to prevent cost spikes
Prompt efficiency metrics: tokens-per-useful-output signals prompt bloat
Alert on unexpected cost increases — they often indicate prompt injection or runaway agents

✓ Instrument token cost at the request level from day one — retrofitting is painful