Prompt Engineering Cheatsheet — AI Engineer Interview Prep

01

Instruction design

System vs User Prompt Separation

System prompt sets persona, constraints, and output format before user content
Keeps policy rules out of the user turn so they are harder to override
Some models weight system prompt instructions more than user turn by default
Privileged instructions in system prompt are still vulnerable to prompt injection

✓ Put invariant rules (tone, format, safety) in the system prompt; keep user content in the user turn

✗ Assume system prompt is a security boundary — it is a convention, not a hard sandbox

Clarity and Specificity

Ambiguous instructions cause high output variance — be explicit about format and scope
Enumerate steps for multi-part tasks rather than using vague verbs like 'analyze'
Negative constraints ('do not include…') are often less reliable than positive framing
Providing output examples in the prompt is one of the highest-ROI improvements

✓ Show the model exactly what the output should look like rather than only describing it

Chain-of-Thought (CoT)

'Think step by step' or explicit reasoning instructions unlock latent model reasoning ability
Zero-shot CoT: just adding 'think step by step' to the prompt
Few-shot CoT: include worked examples with reasoning traces
Longer thinking increases accuracy but also increases token cost and latency

✓ Use CoT for multi-step reasoning tasks (math, logic, code); skip it for simple classification

✗ Always force CoT — it wastes tokens on trivial tasks and can overthink simple answers

Persona and Role Assignment

Assigning a role ('You are an expert security engineer') can improve domain-specific outputs
Effectiveness varies across models; some are more responsive to role assignment than others
Over-indexed personas can cause the model to refuse reasonable requests
Combine role assignment with concrete output instructions for best results

02

Few-shot prompting

Few-Shot Examples

Include 3–8 input/output pairs in the prompt to demonstrate expected behavior
Example quality matters more than quantity — one bad example degrades the pattern
Order effects: the last few examples have disproportionate influence
Diverse examples covering edge cases outperform near-duplicate easy cases

✓ Curate and version-control your few-shot examples as you would training data

✗ Use examples that all look alike — the model won't learn the general pattern

In-Context Learning (ICL) Mechanics

LLMs infer task format and rules from examples without updating weights
Performance scales with model size — smaller models benefit less from many shots
ICL is sensitive to label noise; even one wrong example can shift predictions significantly
Dynamic example retrieval (selecting similar examples per query) often outperforms fixed sets

✓ Consider dynamic few-shot retrieval when you have a large labeled example library

Zero-Shot vs Few-Shot Decision

Zero-shot: rely entirely on the model's pretrained knowledge and instruction following
Few-shot: provide exemplars when task format is non-standard or accuracy is critical
Few-shot costs more tokens and increases latency; justify with measured accuracy gain
Fine-tuning outperforms few-shot prompting when labeled data is plentiful and format is stable

✓ Start with zero-shot, measure performance, then add few-shot or fine-tuning only if needed

03

Guardrails

Input and Output Guardrails

Input guardrails: classify or filter user input before it reaches the model
Output guardrails: validate or rewrite model responses before returning to the user
NLP classifiers, regex rules, and secondary LLM judges are common guard implementations
Defense in depth: stack multiple lightweight checks rather than one heavy one

✓ Describe guardrails as a separate layer, not just prompt instructions, when asked about safety

✗ Rely solely on system prompt instructions to block harmful content

Jailbreak Resistance

Role-play, hypothetical framing, and base64 encoding are common jailbreak vectors
Prompt injection can override system instructions via user-provided content
RLHF alignment helps but is not a complete defense — treat all user input as untrusted
Red-teaming your prompts before production is standard practice

✓ Treat the system prompt as policy documentation, not a security boundary

Refusal Calibration

Over-refusal (false positives) degrades user experience and can be a liability
Under-refusal (false negatives) creates safety and reputational risk
Calibrate by measuring refusal rate on a policy-negative test set vs a benign test set
Per-use-case thresholds are often better than a single global threshold

✓ Report both false-positive and false-negative refusal rates when evaluating safety

✗ Only measure harmful content miss rate — ignoring over-refusal harms product quality

04

Output schemas

Structured Output / JSON Mode

Constrained decoding or post-processing forces model output to match a schema
Most frontier APIs support native JSON mode or function-calling schemas
Using Pydantic / TypeScript interfaces as schema source reduces schema drift
Validate schema compliance programmatically; don't trust the model never to deviate

✓ Use the API's native structured output feature over ad-hoc parsing when available

✗ Parse free-text output with regex in production — schema validation is more robust

Grammar-Constrained Generation

Libraries (Outlines, Guidance, LMQL) constrain tokens to a formal grammar at decode time
Guarantees schema compliance at the token level, not just as a post-process
Adds negligible latency; useful when JSON mode isn't available via API
Works well for SQL, code generation, and classification tasks with fixed output sets

✓ Mention grammar-constrained decoding when discussing locally hosted model deployments

05

Prompt evals

Prompt Regression Testing

Run a fixed eval suite before and after every prompt change
Even 'obviously better' prompt edits can regress edge cases
Golden test sets: curated examples with expected outputs or labeled criteria
CI pipeline integration ensures evals block bad prompt deploys

✓ Treat prompt changes like code changes: version-controlled, peer-reviewed, and tested

✗ Ship prompt changes based on manual spot-checking alone

Prompt Sensitivity Analysis

Small wording changes can cause large output distribution shifts
Test paraphrased versions of your prompt to measure variance
Capitalization, punctuation, and whitespace can all affect outputs
Model updates from providers can silently change prompt behavior

✓ Schedule periodic re-evaluation after model version updates from your API provider

06

Tool prompts

Tool / Function Calling

Model emits a structured tool call (name + arguments) instead of free text
The host application executes the tool and returns results to the model
Tool descriptions in the prompt heavily influence when and how tools are called
Too many tools in one prompt degrades selection accuracy; keep the tool list focused

✓ Write tool descriptions as if documenting a public API: clear name, purpose, and param types

✗ Register 20+ tools in one prompt without measuring selection accuracy degradation

Parallel and Sequential Tool Calls

Modern APIs support parallel tool calls in one response — reduces round-trip latency
Sequential calls (tool output feeds next call) require multiple model turns
Design tools to be idempotent where possible — agents may call them multiple times
Surface tool errors clearly so the model can recover or escalate

✓ Prefer parallel tool calls for independent data fetches; use sequential only when there is a data dependency