LLMOps Cheatsheet — AI Engineer Interview Prep

01

Rollouts

Canary and Shadow Deployments

Canary: send a small percentage of live traffic to the new model version; monitor before full rollout
Shadow mode: duplicate requests to new version in parallel; compare outputs without affecting users
Feature flags allow instant rollback without redeployment
Auto-rollback: trigger on quality metric drop (e.g., >5% increase in safety classifier flags)

✓ Default to canary + shadow before any model or prompt change touches >1% of users

✗ Big-bang deploy a new model version without a staged rollout and rollback plan

Prompt Versioning and Deployment

Treat prompts as code: version-control in git, code review, CI eval pipeline
Immutable prompt versions: tag with a hash or semver; never mutate a deployed version in place
A/B test prompt changes on live traffic with statistical significance checks
Rollback should be one command: flip the active prompt version in your config store

✓ Enforce peer review for prompt changes the same as for code changes

02

Distributed Tracing for LLM Pipelines

Propagate a trace ID through every step: retrieval, model call, tool use, response
Record latency, token counts, and error codes at each span
OpenTelemetry + an LLM-aware backend (Langfuse, Arize, LangSmith) is the standard stack
Sample 100% during development; reduce to 1–10% in production based on cost

✓ Attach trace IDs to user-reported issues so you can replay the exact pipeline execution

✗ Log only the final output — without the full trace, debugging multi-step failures is nearly impossible

03

Token Cost Optimization

Shorten prompts: remove redundant instructions, examples, and padding
Route simple queries to smaller, cheaper models; use large models only when needed
Batch non-urgent requests using async batch API (e.g., Anthropic Batch API — 50% cost reduction)
Caching (prompt + response) eliminates cost entirely for repeated queries

✓ Establish a cost-per-useful-output metric and track it alongside quality

✗ Over-engineer prompt compression — measure token counts before and after to confirm savings

Cost Anomaly Detection

✓ Add max_tokens and a cost alert from day one — cost surprises happen at the worst times

04

Prompt as Code

Store prompts in version-controlled files (YAML, TOML, or Python) alongside the application code
Use semantic versioning or content hashes to identify prompt versions in logs
Link each deployed prompt version to the eval results that qualified it for production
Prompt registries (LangSmith, Langfuse, PromptLayer) provide audit trails and A/B testing

✓ Require a passing eval suite to promote any prompt version from staging to production

05

Production Safety Pipeline

Input classifier: block or flag policy-violating input before it reaches the LLM
Output classifier: check model response for policy violations before returning to user
PII detection: scan outputs for exposed personal data and redact before delivery
Rate limit on safety classification cost — lightweight models for high-volume checks

✓ Layer multiple safety checks: input filter + model alignment + output filter

✗ Run a heavy LLM-based safety check on every single token — cost will be prohibitive at scale

06

LLM Regression Testing

Curate a golden test set covering critical behaviors: format, accuracy, safety, and edge cases
Run the suite on every prompt change and model version update
Regression threshold: define acceptable degradation (e.g., ≤2% drop in pass rate) before blocking a release
Capture failures from production into the regression suite — real failure modes are the best test cases

✓ Build the regression suite from production incidents — mine your failure logs for edge cases

✗ Only test on synthetic or easy examples — the suite won't catch real-world regressions