← All cheatsheets
Cheatsheet
LLMOps
Operational practices for shipping and maintaining LLM-powered systems reliably. Interviewers look for experience with rollout safety, cost control, and regression prevention.
01
Rollouts
Canary and Shadow Deployments
- Canary: send a small percentage of live traffic to the new model version; monitor before full rollout
- Shadow mode: duplicate requests to new version in parallel; compare outputs without affecting users
- Feature flags allow instant rollback without redeployment
- Auto-rollback: trigger on quality metric drop (e.g., >5% increase in safety classifier flags)
✓
Default to canary + shadow before any model or prompt change touches >1% of users
✗
Big-bang deploy a new model version without a staged rollout and rollback plan
Prompt Versioning and Deployment
- Treat prompts as code: version-control in git, code review, CI eval pipeline
- Immutable prompt versions: tag with a hash or semver; never mutate a deployed version in place
- A/B test prompt changes on live traffic with statistical significance checks
- Rollback should be one command: flip the active prompt version in your config store
✓
Enforce peer review for prompt changes the same as for code changes
02
Tracing
Distributed Tracing for LLM Pipelines
- Propagate a trace ID through every step: retrieval, model call, tool use, response
- Record latency, token counts, and error codes at each span
- OpenTelemetry + an LLM-aware backend (Langfuse, Arize, LangSmith) is the standard stack
- Sample 100% during development; reduce to 1–10% in production based on cost
✓
Attach trace IDs to user-reported issues so you can replay the exact pipeline execution
✗
Log only the final output — without the full trace, debugging multi-step failures is nearly impossible
03
Cost control
Token Cost Optimization
- Shorten prompts: remove redundant instructions, examples, and padding
- Route simple queries to smaller, cheaper models; use large models only when needed
- Batch non-urgent requests using async batch API (e.g., Anthropic Batch API — 50% cost reduction)
- Caching (prompt + response) eliminates cost entirely for repeated queries
✓
Establish a cost-per-useful-output metric and track it alongside quality
✗
Over-engineer prompt compression — measure token counts before and after to confirm savings
Cost Anomaly Detection
- Set per-request token budget limits to cap runaway completions
- Alert when daily spend exceeds 2× the rolling 7-day average
- Prompt injection and agent loops are common causes of unexpected cost spikes
✓
Add max_tokens and a cost alert from day one — cost surprises happen at the worst times
04
Prompt versioning
Prompt as Code
- Store prompts in version-controlled files (YAML, TOML, or Python) alongside the application code
- Use semantic versioning or content hashes to identify prompt versions in logs
- Link each deployed prompt version to the eval results that qualified it for production
- Prompt registries (LangSmith, Langfuse, PromptLayer) provide audit trails and A/B testing
✓
Require a passing eval suite to promote any prompt version from staging to production
05
Safety checks
Production Safety Pipeline
- Input classifier: block or flag policy-violating input before it reaches the LLM
- Output classifier: check model response for policy violations before returning to user
- PII detection: scan outputs for exposed personal data and redact before delivery
- Rate limit on safety classification cost — lightweight models for high-volume checks
✓
Layer multiple safety checks: input filter + model alignment + output filter
✗
Run a heavy LLM-based safety check on every single token — cost will be prohibitive at scale
06
Regression suites
LLM Regression Testing
- Curate a golden test set covering critical behaviors: format, accuracy, safety, and edge cases
- Run the suite on every prompt change and model version update
- Regression threshold: define acceptable degradation (e.g., ≤2% drop in pass rate) before blocking a release
- Capture failures from production into the regression suite — real failure modes are the best test cases
✓
Build the regression suite from production incidents — mine your failure logs for edge cases
✗
Only test on synthetic or easy examples — the suite won't catch real-world regressions