← All cheatsheets
Cheatsheet

Prompt Engineering

Techniques for reliably controlling LLM behavior through prompt design. Interviewers look for systematic thinking about instruction clarity, robustness, and measurable evaluation.

01

Instruction design

System vs User Prompt Separation

  • System prompt sets persona, constraints, and output format before user content
  • Keeps policy rules out of the user turn so they are harder to override
  • Some models weight system prompt instructions more than user turn by default
  • Privileged instructions in system prompt are still vulnerable to prompt injection
Put invariant rules (tone, format, safety) in the system prompt; keep user content in the user turn
Assume system prompt is a security boundary — it is a convention, not a hard sandbox

Clarity and Specificity

  • Ambiguous instructions cause high output variance — be explicit about format and scope
  • Enumerate steps for multi-part tasks rather than using vague verbs like 'analyze'
  • Negative constraints ('do not include…') are often less reliable than positive framing
  • Providing output examples in the prompt is one of the highest-ROI improvements
Show the model exactly what the output should look like rather than only describing it

Chain-of-Thought (CoT)

  • 'Think step by step' or explicit reasoning instructions unlock latent model reasoning ability
  • Zero-shot CoT: just adding 'think step by step' to the prompt
  • Few-shot CoT: include worked examples with reasoning traces
  • Longer thinking increases accuracy but also increases token cost and latency
Use CoT for multi-step reasoning tasks (math, logic, code); skip it for simple classification
Always force CoT — it wastes tokens on trivial tasks and can overthink simple answers

Persona and Role Assignment

  • Assigning a role ('You are an expert security engineer') can improve domain-specific outputs
  • Effectiveness varies across models; some are more responsive to role assignment than others
  • Over-indexed personas can cause the model to refuse reasonable requests
  • Combine role assignment with concrete output instructions for best results
02

Few-shot prompting

Few-Shot Examples

  • Include 3–8 input/output pairs in the prompt to demonstrate expected behavior
  • Example quality matters more than quantity — one bad example degrades the pattern
  • Order effects: the last few examples have disproportionate influence
  • Diverse examples covering edge cases outperform near-duplicate easy cases
Curate and version-control your few-shot examples as you would training data
Use examples that all look alike — the model won't learn the general pattern

In-Context Learning (ICL) Mechanics

  • LLMs infer task format and rules from examples without updating weights
  • Performance scales with model size — smaller models benefit less from many shots
  • ICL is sensitive to label noise; even one wrong example can shift predictions significantly
  • Dynamic example retrieval (selecting similar examples per query) often outperforms fixed sets
Consider dynamic few-shot retrieval when you have a large labeled example library

Zero-Shot vs Few-Shot Decision

  • Zero-shot: rely entirely on the model's pretrained knowledge and instruction following
  • Few-shot: provide exemplars when task format is non-standard or accuracy is critical
  • Few-shot costs more tokens and increases latency; justify with measured accuracy gain
  • Fine-tuning outperforms few-shot prompting when labeled data is plentiful and format is stable
Start with zero-shot, measure performance, then add few-shot or fine-tuning only if needed
03

Guardrails

Input and Output Guardrails

  • Input guardrails: classify or filter user input before it reaches the model
  • Output guardrails: validate or rewrite model responses before returning to the user
  • NLP classifiers, regex rules, and secondary LLM judges are common guard implementations
  • Defense in depth: stack multiple lightweight checks rather than one heavy one
Describe guardrails as a separate layer, not just prompt instructions, when asked about safety
Rely solely on system prompt instructions to block harmful content

Jailbreak Resistance

  • Role-play, hypothetical framing, and base64 encoding are common jailbreak vectors
  • Prompt injection can override system instructions via user-provided content
  • RLHF alignment helps but is not a complete defense — treat all user input as untrusted
  • Red-teaming your prompts before production is standard practice
Treat the system prompt as policy documentation, not a security boundary

Refusal Calibration

  • Over-refusal (false positives) degrades user experience and can be a liability
  • Under-refusal (false negatives) creates safety and reputational risk
  • Calibrate by measuring refusal rate on a policy-negative test set vs a benign test set
  • Per-use-case thresholds are often better than a single global threshold
Report both false-positive and false-negative refusal rates when evaluating safety
Only measure harmful content miss rate — ignoring over-refusal harms product quality
04

Output schemas

Structured Output / JSON Mode

  • Constrained decoding or post-processing forces model output to match a schema
  • Most frontier APIs support native JSON mode or function-calling schemas
  • Using Pydantic / TypeScript interfaces as schema source reduces schema drift
  • Validate schema compliance programmatically; don't trust the model never to deviate
Use the API's native structured output feature over ad-hoc parsing when available
Parse free-text output with regex in production — schema validation is more robust

Grammar-Constrained Generation

  • Libraries (Outlines, Guidance, LMQL) constrain tokens to a formal grammar at decode time
  • Guarantees schema compliance at the token level, not just as a post-process
  • Adds negligible latency; useful when JSON mode isn't available via API
  • Works well for SQL, code generation, and classification tasks with fixed output sets
Mention grammar-constrained decoding when discussing locally hosted model deployments
05

Prompt evals

Prompt Regression Testing

  • Run a fixed eval suite before and after every prompt change
  • Even 'obviously better' prompt edits can regress edge cases
  • Golden test sets: curated examples with expected outputs or labeled criteria
  • CI pipeline integration ensures evals block bad prompt deploys
Treat prompt changes like code changes: version-controlled, peer-reviewed, and tested
Ship prompt changes based on manual spot-checking alone

Prompt Sensitivity Analysis

  • Small wording changes can cause large output distribution shifts
  • Test paraphrased versions of your prompt to measure variance
  • Capitalization, punctuation, and whitespace can all affect outputs
  • Model updates from providers can silently change prompt behavior
Schedule periodic re-evaluation after model version updates from your API provider
06

Tool prompts

Tool / Function Calling

  • Model emits a structured tool call (name + arguments) instead of free text
  • The host application executes the tool and returns results to the model
  • Tool descriptions in the prompt heavily influence when and how tools are called
  • Too many tools in one prompt degrades selection accuracy; keep the tool list focused
Write tool descriptions as if documenting a public API: clear name, purpose, and param types
Register 20+ tools in one prompt without measuring selection accuracy degradation

Parallel and Sequential Tool Calls

  • Modern APIs support parallel tool calls in one response — reduces round-trip latency
  • Sequential calls (tool output feeds next call) require multiple model turns
  • Design tools to be idempotent where possible — agents may call them multiple times
  • Surface tool errors clearly so the model can recover or escalate
Prefer parallel tool calls for independent data fetches; use sequential only when there is a data dependency