← All cheatsheets
Cheatsheet

LLMOps

Operational practices for shipping and maintaining LLM-powered systems reliably. Interviewers look for experience with rollout safety, cost control, and regression prevention.

01

Rollouts

Canary and Shadow Deployments

  • Canary: send a small percentage of live traffic to the new model version; monitor before full rollout
  • Shadow mode: duplicate requests to new version in parallel; compare outputs without affecting users
  • Feature flags allow instant rollback without redeployment
  • Auto-rollback: trigger on quality metric drop (e.g., >5% increase in safety classifier flags)
Default to canary + shadow before any model or prompt change touches >1% of users
Big-bang deploy a new model version without a staged rollout and rollback plan

Prompt Versioning and Deployment

  • Treat prompts as code: version-control in git, code review, CI eval pipeline
  • Immutable prompt versions: tag with a hash or semver; never mutate a deployed version in place
  • A/B test prompt changes on live traffic with statistical significance checks
  • Rollback should be one command: flip the active prompt version in your config store
Enforce peer review for prompt changes the same as for code changes
02

Tracing

Distributed Tracing for LLM Pipelines

  • Propagate a trace ID through every step: retrieval, model call, tool use, response
  • Record latency, token counts, and error codes at each span
  • OpenTelemetry + an LLM-aware backend (Langfuse, Arize, LangSmith) is the standard stack
  • Sample 100% during development; reduce to 1–10% in production based on cost
Attach trace IDs to user-reported issues so you can replay the exact pipeline execution
Log only the final output — without the full trace, debugging multi-step failures is nearly impossible
03

Cost control

Token Cost Optimization

  • Shorten prompts: remove redundant instructions, examples, and padding
  • Route simple queries to smaller, cheaper models; use large models only when needed
  • Batch non-urgent requests using async batch API (e.g., Anthropic Batch API — 50% cost reduction)
  • Caching (prompt + response) eliminates cost entirely for repeated queries
Establish a cost-per-useful-output metric and track it alongside quality
Over-engineer prompt compression — measure token counts before and after to confirm savings

Cost Anomaly Detection

  • Set per-request token budget limits to cap runaway completions
  • Alert when daily spend exceeds 2× the rolling 7-day average
  • Prompt injection and agent loops are common causes of unexpected cost spikes
Add max_tokens and a cost alert from day one — cost surprises happen at the worst times
04

Prompt versioning

Prompt as Code

  • Store prompts in version-controlled files (YAML, TOML, or Python) alongside the application code
  • Use semantic versioning or content hashes to identify prompt versions in logs
  • Link each deployed prompt version to the eval results that qualified it for production
  • Prompt registries (LangSmith, Langfuse, PromptLayer) provide audit trails and A/B testing
Require a passing eval suite to promote any prompt version from staging to production
05

Safety checks

Production Safety Pipeline

  • Input classifier: block or flag policy-violating input before it reaches the LLM
  • Output classifier: check model response for policy violations before returning to user
  • PII detection: scan outputs for exposed personal data and redact before delivery
  • Rate limit on safety classification cost — lightweight models for high-volume checks
Layer multiple safety checks: input filter + model alignment + output filter
Run a heavy LLM-based safety check on every single token — cost will be prohibitive at scale
06

Regression suites

LLM Regression Testing

  • Curate a golden test set covering critical behaviors: format, accuracy, safety, and edge cases
  • Run the suite on every prompt change and model version update
  • Regression threshold: define acceptable degradation (e.g., ≤2% drop in pass rate) before blocking a release
  • Capture failures from production into the regression suite — real failure modes are the best test cases
Build the regression suite from production incidents — mine your failure logs for edge cases
Only test on synthetic or easy examples — the suite won't catch real-world regressions