← All cheatsheets
Cheatsheet

AI Agents

Design patterns for LLM-driven agents that plan, use tools, and complete multi-step tasks. Interviewers focus on reliability, failure modes, and evaluation of open-ended loops.

01

Planning

ReAct (Reason + Act)

  • Interleaves reasoning traces (Thought) with tool calls (Action) and results (Observation)
  • Grounding reasoning in tool outputs reduces hallucination compared to pure chain-of-thought
  • Stopping condition: agent generates a final answer or reaches a maximum step count
  • ReAct is the baseline pattern behind most production agents
Describe ReAct when asked how an agent decides which tool to call and when to stop
Assume ReAct agents always converge — they can loop or get stuck without explicit step limits

Plan-and-Execute

  • Planner model generates a high-level task list; executor model carries out individual steps
  • Separating planning from execution allows specialized prompts and smaller executor models
  • Dynamic replanning: executor can request revised plan if an earlier step fails
  • Suitable for long-horizon tasks where the full plan can be specified upfront
Use plan-and-execute for complex tasks; cite LangGraph or Autogen as frameworks that support it

Task Decomposition

  • Break large tasks into atomic subtasks the model can reliably complete
  • Decomposition reduces per-step ambiguity and makes evals easier
  • Tree-of-Thought explores multiple decompositions and selects the best branch
  • Subtask granularity should match tool capabilities — too coarse → tool misuse, too fine → overhead
Decompose tasks explicitly in your prompt or orchestration logic rather than hoping the model handles it
02

Tool use

Tool Design Principles

  • Each tool should do one thing well — single responsibility reduces misuse
  • Return structured data from tools so the model can parse results reliably
  • Include explicit error messages and suggested recovery actions in tool outputs
  • Idempotent tools are safer — agents may call them multiple times on retries
Treat tool interfaces as a contract; document input schema, output schema, and error modes
Return raw HTML or unstructured text from tools — the model will waste tokens parsing it

Tool Selection Accuracy

  • Accuracy degrades as tool count increases beyond ~10–15 tools in one prompt
  • Categorize tools and dynamically load only relevant ones per query
  • Fine-tuning the model on tool-use examples significantly improves selection accuracy
  • Measure tool selection precision and recall on a labeled evaluation set
Implement a tool routing layer if you have more than ~15 tools
03

Memory

Agent Memory Types

  • In-context: full conversation history in the prompt; limited by context window
  • External (episodic): retrieve past interactions from a vector store
  • Semantic: distilled summaries or knowledge stored in structured form
  • Procedural: learned behaviors encoded in fine-tuned weights or system prompt instructions
Distinguish memory types when asked; in-context is easiest, external scales farther
Treat in-context memory as sufficient for multi-session agents — it resets each conversation

Context Window Management

  • Summarize or truncate older turns when context fills up
  • Keep the most recent N turns and a running summary of earlier history
  • Moving summary window: update the summary incrementally as new turns arrive
  • Retrieving relevant past episodes on demand outperforms naive full-history truncation
Describe sliding window summary + episodic retrieval as a production-grade memory strategy
04

Multi-step loops

Agentic Loop Control

  • Every agent loop must have a maximum step count / timeout to prevent infinite loops
  • Termination conditions: task complete, explicit STOP tool call, or error threshold exceeded
  • Checkpointing intermediate state lets you resume or replay from a known good point
  • Human-in-the-loop gates can interrupt the loop for high-stakes decisions
Always define termination criteria and max steps before deploying an agent loop
Ship an agent with unbounded loops — it will eventually loop forever on edge cases

Determinism and Reproducibility

  • Set temperature=0 for agents where reproducibility matters more than diversity
  • Log every tool call and model response for debugging and replay
  • Seed randomness in any sampling steps for reproducible test runs
Log the full agent trajectory (thoughts, tool calls, results) to a structured store for debugging
05

Failure recovery

Error Handling Patterns

  • Return typed errors from tools with a 'how to fix this' field in the schema
  • Retry with exponential backoff on transient tool failures (network, rate limits)
  • If retries fail, escalate to a different tool or ask for human clarification
  • Track consecutive failures; abort and alert after N failed attempts
Design agent error handling as a state machine: try → retry → fallback → escalate → abort
Pass raw exception stack traces to the model — it often hallucinates a fix

Sandboxing and Side-Effect Management

  • Run code execution tools in isolated environments (Docker, Firecracker microVMs)
  • Limit tool permissions to least privilege: read-only file access unless write is needed
  • Dry-run mode: simulate irreversible actions before committing
  • Audit log every write operation for compliance and rollback
Scope tool permissions tightly; interviewers look for security awareness in agentic system design
06

Agent evals

Agent Evaluation Challenges

  • Final-answer accuracy alone misses planning and tool-use quality
  • Trajectory evaluation: score each step (correct tool?, correct args?, correct handling of result?)
  • Task completion rate and steps-to-completion are key throughput metrics
  • Regression suites must cover diverse task types to catch capability drift
Measure trajectory quality and final outcome separately — both matter in production
Only eval on easy tasks; agents fail at edge cases so adversarial evals are essential

Offline vs Online Agent Evals

  • Offline: replay recorded tool responses so evals run fast and cheaply
  • Online: run against real tools; more realistic but slower and costlier
  • Shadow mode: run new agent version in parallel with production; compare outcomes
  • LLM-as-judge can score trajectory quality at scale when human annotation is too expensive
Build an offline eval harness with recorded tool responses before launching online eval