← All cheatsheets
Cheatsheet
Prompt Engineering
Techniques for reliably controlling LLM behavior through prompt design. Interviewers look for systematic thinking about instruction clarity, robustness, and measurable evaluation.
01
Instruction design
System vs User Prompt Separation
- System prompt sets persona, constraints, and output format before user content
- Keeps policy rules out of the user turn so they are harder to override
- Some models weight system prompt instructions more than user turn by default
- Privileged instructions in system prompt are still vulnerable to prompt injection
✓
Put invariant rules (tone, format, safety) in the system prompt; keep user content in the user turn
✗
Assume system prompt is a security boundary — it is a convention, not a hard sandbox
Clarity and Specificity
- Ambiguous instructions cause high output variance — be explicit about format and scope
- Enumerate steps for multi-part tasks rather than using vague verbs like 'analyze'
- Negative constraints ('do not include…') are often less reliable than positive framing
- Providing output examples in the prompt is one of the highest-ROI improvements
✓
Show the model exactly what the output should look like rather than only describing it
Chain-of-Thought (CoT)
- 'Think step by step' or explicit reasoning instructions unlock latent model reasoning ability
- Zero-shot CoT: just adding 'think step by step' to the prompt
- Few-shot CoT: include worked examples with reasoning traces
- Longer thinking increases accuracy but also increases token cost and latency
✓
Use CoT for multi-step reasoning tasks (math, logic, code); skip it for simple classification
✗
Always force CoT — it wastes tokens on trivial tasks and can overthink simple answers
Persona and Role Assignment
- Assigning a role ('You are an expert security engineer') can improve domain-specific outputs
- Effectiveness varies across models; some are more responsive to role assignment than others
- Over-indexed personas can cause the model to refuse reasonable requests
- Combine role assignment with concrete output instructions for best results
02
Few-shot prompting
Few-Shot Examples
- Include 3–8 input/output pairs in the prompt to demonstrate expected behavior
- Example quality matters more than quantity — one bad example degrades the pattern
- Order effects: the last few examples have disproportionate influence
- Diverse examples covering edge cases outperform near-duplicate easy cases
✓
Curate and version-control your few-shot examples as you would training data
✗
Use examples that all look alike — the model won't learn the general pattern
In-Context Learning (ICL) Mechanics
- LLMs infer task format and rules from examples without updating weights
- Performance scales with model size — smaller models benefit less from many shots
- ICL is sensitive to label noise; even one wrong example can shift predictions significantly
- Dynamic example retrieval (selecting similar examples per query) often outperforms fixed sets
✓
Consider dynamic few-shot retrieval when you have a large labeled example library
Zero-Shot vs Few-Shot Decision
- Zero-shot: rely entirely on the model's pretrained knowledge and instruction following
- Few-shot: provide exemplars when task format is non-standard or accuracy is critical
- Few-shot costs more tokens and increases latency; justify with measured accuracy gain
- Fine-tuning outperforms few-shot prompting when labeled data is plentiful and format is stable
✓
Start with zero-shot, measure performance, then add few-shot or fine-tuning only if needed
03
Guardrails
Input and Output Guardrails
- Input guardrails: classify or filter user input before it reaches the model
- Output guardrails: validate or rewrite model responses before returning to the user
- NLP classifiers, regex rules, and secondary LLM judges are common guard implementations
- Defense in depth: stack multiple lightweight checks rather than one heavy one
✓
Describe guardrails as a separate layer, not just prompt instructions, when asked about safety
✗
Rely solely on system prompt instructions to block harmful content
Jailbreak Resistance
- Role-play, hypothetical framing, and base64 encoding are common jailbreak vectors
- Prompt injection can override system instructions via user-provided content
- RLHF alignment helps but is not a complete defense — treat all user input as untrusted
- Red-teaming your prompts before production is standard practice
✓
Treat the system prompt as policy documentation, not a security boundary
Refusal Calibration
- Over-refusal (false positives) degrades user experience and can be a liability
- Under-refusal (false negatives) creates safety and reputational risk
- Calibrate by measuring refusal rate on a policy-negative test set vs a benign test set
- Per-use-case thresholds are often better than a single global threshold
✓
Report both false-positive and false-negative refusal rates when evaluating safety
✗
Only measure harmful content miss rate — ignoring over-refusal harms product quality
04
Output schemas
Structured Output / JSON Mode
- Constrained decoding or post-processing forces model output to match a schema
- Most frontier APIs support native JSON mode or function-calling schemas
- Using Pydantic / TypeScript interfaces as schema source reduces schema drift
- Validate schema compliance programmatically; don't trust the model never to deviate
✓
Use the API's native structured output feature over ad-hoc parsing when available
✗
Parse free-text output with regex in production — schema validation is more robust
Grammar-Constrained Generation
- Libraries (Outlines, Guidance, LMQL) constrain tokens to a formal grammar at decode time
- Guarantees schema compliance at the token level, not just as a post-process
- Adds negligible latency; useful when JSON mode isn't available via API
- Works well for SQL, code generation, and classification tasks with fixed output sets
✓
Mention grammar-constrained decoding when discussing locally hosted model deployments
05
Prompt evals
Prompt Regression Testing
- Run a fixed eval suite before and after every prompt change
- Even 'obviously better' prompt edits can regress edge cases
- Golden test sets: curated examples with expected outputs or labeled criteria
- CI pipeline integration ensures evals block bad prompt deploys
✓
Treat prompt changes like code changes: version-controlled, peer-reviewed, and tested
✗
Ship prompt changes based on manual spot-checking alone
Prompt Sensitivity Analysis
- Small wording changes can cause large output distribution shifts
- Test paraphrased versions of your prompt to measure variance
- Capitalization, punctuation, and whitespace can all affect outputs
- Model updates from providers can silently change prompt behavior
✓
Schedule periodic re-evaluation after model version updates from your API provider
06
Tool prompts
Tool / Function Calling
- Model emits a structured tool call (name + arguments) instead of free text
- The host application executes the tool and returns results to the model
- Tool descriptions in the prompt heavily influence when and how tools are called
- Too many tools in one prompt degrades selection accuracy; keep the tool list focused
✓
Write tool descriptions as if documenting a public API: clear name, purpose, and param types
✗
Register 20+ tools in one prompt without measuring selection accuracy degradation
Parallel and Sequential Tool Calls
- Modern APIs support parallel tool calls in one response — reduces round-trip latency
- Sequential calls (tool output feeds next call) require multiple model turns
- Design tools to be idempotent where possible — agents may call them multiple times
- Surface tool errors clearly so the model can recover or escalate
✓
Prefer parallel tool calls for independent data fetches; use sequential only when there is a data dependency