← All cheatsheets
Cheatsheet
AI Agents
Design patterns for LLM-driven agents that plan, use tools, and complete multi-step tasks. Interviewers focus on reliability, failure modes, and evaluation of open-ended loops.
01
Planning
ReAct (Reason + Act)
- Interleaves reasoning traces (Thought) with tool calls (Action) and results (Observation)
- Grounding reasoning in tool outputs reduces hallucination compared to pure chain-of-thought
- Stopping condition: agent generates a final answer or reaches a maximum step count
- ReAct is the baseline pattern behind most production agents
✓
Describe ReAct when asked how an agent decides which tool to call and when to stop
✗
Assume ReAct agents always converge — they can loop or get stuck without explicit step limits
Plan-and-Execute
- Planner model generates a high-level task list; executor model carries out individual steps
- Separating planning from execution allows specialized prompts and smaller executor models
- Dynamic replanning: executor can request revised plan if an earlier step fails
- Suitable for long-horizon tasks where the full plan can be specified upfront
✓
Use plan-and-execute for complex tasks; cite LangGraph or Autogen as frameworks that support it
Task Decomposition
- Break large tasks into atomic subtasks the model can reliably complete
- Decomposition reduces per-step ambiguity and makes evals easier
- Tree-of-Thought explores multiple decompositions and selects the best branch
- Subtask granularity should match tool capabilities — too coarse → tool misuse, too fine → overhead
✓
Decompose tasks explicitly in your prompt or orchestration logic rather than hoping the model handles it
02
Tool use
Tool Design Principles
- Each tool should do one thing well — single responsibility reduces misuse
- Return structured data from tools so the model can parse results reliably
- Include explicit error messages and suggested recovery actions in tool outputs
- Idempotent tools are safer — agents may call them multiple times on retries
✓
Treat tool interfaces as a contract; document input schema, output schema, and error modes
✗
Return raw HTML or unstructured text from tools — the model will waste tokens parsing it
Tool Selection Accuracy
- Accuracy degrades as tool count increases beyond ~10–15 tools in one prompt
- Categorize tools and dynamically load only relevant ones per query
- Fine-tuning the model on tool-use examples significantly improves selection accuracy
- Measure tool selection precision and recall on a labeled evaluation set
✓
Implement a tool routing layer if you have more than ~15 tools
03
Memory
Agent Memory Types
- In-context: full conversation history in the prompt; limited by context window
- External (episodic): retrieve past interactions from a vector store
- Semantic: distilled summaries or knowledge stored in structured form
- Procedural: learned behaviors encoded in fine-tuned weights or system prompt instructions
✓
Distinguish memory types when asked; in-context is easiest, external scales farther
✗
Treat in-context memory as sufficient for multi-session agents — it resets each conversation
Context Window Management
- Summarize or truncate older turns when context fills up
- Keep the most recent N turns and a running summary of earlier history
- Moving summary window: update the summary incrementally as new turns arrive
- Retrieving relevant past episodes on demand outperforms naive full-history truncation
✓
Describe sliding window summary + episodic retrieval as a production-grade memory strategy
04
Multi-step loops
Agentic Loop Control
- Every agent loop must have a maximum step count / timeout to prevent infinite loops
- Termination conditions: task complete, explicit STOP tool call, or error threshold exceeded
- Checkpointing intermediate state lets you resume or replay from a known good point
- Human-in-the-loop gates can interrupt the loop for high-stakes decisions
✓
Always define termination criteria and max steps before deploying an agent loop
✗
Ship an agent with unbounded loops — it will eventually loop forever on edge cases
Determinism and Reproducibility
- Set temperature=0 for agents where reproducibility matters more than diversity
- Log every tool call and model response for debugging and replay
- Seed randomness in any sampling steps for reproducible test runs
✓
Log the full agent trajectory (thoughts, tool calls, results) to a structured store for debugging
05
Failure recovery
Error Handling Patterns
- Return typed errors from tools with a 'how to fix this' field in the schema
- Retry with exponential backoff on transient tool failures (network, rate limits)
- If retries fail, escalate to a different tool or ask for human clarification
- Track consecutive failures; abort and alert after N failed attempts
✓
Design agent error handling as a state machine: try → retry → fallback → escalate → abort
✗
Pass raw exception stack traces to the model — it often hallucinates a fix
Sandboxing and Side-Effect Management
- Run code execution tools in isolated environments (Docker, Firecracker microVMs)
- Limit tool permissions to least privilege: read-only file access unless write is needed
- Dry-run mode: simulate irreversible actions before committing
- Audit log every write operation for compliance and rollback
✓
Scope tool permissions tightly; interviewers look for security awareness in agentic system design
06
Agent evals
Agent Evaluation Challenges
- Final-answer accuracy alone misses planning and tool-use quality
- Trajectory evaluation: score each step (correct tool?, correct args?, correct handling of result?)
- Task completion rate and steps-to-completion are key throughput metrics
- Regression suites must cover diverse task types to catch capability drift
✓
Measure trajectory quality and final outcome separately — both matter in production
✗
Only eval on easy tasks; agents fail at edge cases so adversarial evals are essential
Offline vs Online Agent Evals
- Offline: replay recorded tool responses so evals run fast and cheaply
- Online: run against real tools; more realistic but slower and costlier
- Shadow mode: run new agent version in parallel with production; compare outcomes
- LLM-as-judge can score trajectory quality at scale when human annotation is too expensive
✓
Build an offline eval harness with recorded tool responses before launching online eval