AI Agents Cheatsheet — AI Engineer Interview Prep

01

Planning

ReAct (Reason + Act)

Interleaves reasoning traces (Thought) with tool calls (Action) and results (Observation)
Grounding reasoning in tool outputs reduces hallucination compared to pure chain-of-thought
Stopping condition: agent generates a final answer or reaches a maximum step count
ReAct is the baseline pattern behind most production agents

✓ Describe ReAct when asked how an agent decides which tool to call and when to stop

✗ Assume ReAct agents always converge — they can loop or get stuck without explicit step limits

Plan-and-Execute

Planner model generates a high-level task list; executor model carries out individual steps
Separating planning from execution allows specialized prompts and smaller executor models
Dynamic replanning: executor can request revised plan if an earlier step fails
Suitable for long-horizon tasks where the full plan can be specified upfront

✓ Use plan-and-execute for complex tasks; cite LangGraph or Autogen as frameworks that support it

Task Decomposition

Break large tasks into atomic subtasks the model can reliably complete
Decomposition reduces per-step ambiguity and makes evals easier
Tree-of-Thought explores multiple decompositions and selects the best branch
Subtask granularity should match tool capabilities — too coarse → tool misuse, too fine → overhead

✓ Decompose tasks explicitly in your prompt or orchestration logic rather than hoping the model handles it

02

Tool use

Tool Design Principles

Each tool should do one thing well — single responsibility reduces misuse
Return structured data from tools so the model can parse results reliably
Include explicit error messages and suggested recovery actions in tool outputs
Idempotent tools are safer — agents may call them multiple times on retries

✓ Treat tool interfaces as a contract; document input schema, output schema, and error modes

✗ Return raw HTML or unstructured text from tools — the model will waste tokens parsing it

Tool Selection Accuracy

Accuracy degrades as tool count increases beyond ~10–15 tools in one prompt
Categorize tools and dynamically load only relevant ones per query
Fine-tuning the model on tool-use examples significantly improves selection accuracy
Measure tool selection precision and recall on a labeled evaluation set

✓ Implement a tool routing layer if you have more than ~15 tools

03

Memory

Agent Memory Types

In-context: full conversation history in the prompt; limited by context window
External (episodic): retrieve past interactions from a vector store
Semantic: distilled summaries or knowledge stored in structured form
Procedural: learned behaviors encoded in fine-tuned weights or system prompt instructions

✓ Distinguish memory types when asked; in-context is easiest, external scales farther

✗ Treat in-context memory as sufficient for multi-session agents — it resets each conversation

Context Window Management

Summarize or truncate older turns when context fills up
Keep the most recent N turns and a running summary of earlier history
Moving summary window: update the summary incrementally as new turns arrive
Retrieving relevant past episodes on demand outperforms naive full-history truncation

✓ Describe sliding window summary + episodic retrieval as a production-grade memory strategy

04

Multi-step loops

Agentic Loop Control

Every agent loop must have a maximum step count / timeout to prevent infinite loops
Termination conditions: task complete, explicit STOP tool call, or error threshold exceeded
Checkpointing intermediate state lets you resume or replay from a known good point
Human-in-the-loop gates can interrupt the loop for high-stakes decisions

✓ Always define termination criteria and max steps before deploying an agent loop

✗ Ship an agent with unbounded loops — it will eventually loop forever on edge cases

Determinism and Reproducibility

Set temperature=0 for agents where reproducibility matters more than diversity
Log every tool call and model response for debugging and replay
Seed randomness in any sampling steps for reproducible test runs

✓ Log the full agent trajectory (thoughts, tool calls, results) to a structured store for debugging

05

Failure recovery

Error Handling Patterns

Return typed errors from tools with a 'how to fix this' field in the schema
Retry with exponential backoff on transient tool failures (network, rate limits)
If retries fail, escalate to a different tool or ask for human clarification
Track consecutive failures; abort and alert after N failed attempts

✓ Design agent error handling as a state machine: try → retry → fallback → escalate → abort

✗ Pass raw exception stack traces to the model — it often hallucinates a fix

Sandboxing and Side-Effect Management

Run code execution tools in isolated environments (Docker, Firecracker microVMs)
Limit tool permissions to least privilege: read-only file access unless write is needed
Dry-run mode: simulate irreversible actions before committing
Audit log every write operation for compliance and rollback

✓ Scope tool permissions tightly; interviewers look for security awareness in agentic system design

06

Agent evals

Agent Evaluation Challenges

Final-answer accuracy alone misses planning and tool-use quality
Trajectory evaluation: score each step (correct tool?, correct args?, correct handling of result?)
Task completion rate and steps-to-completion are key throughput metrics
Regression suites must cover diverse task types to catch capability drift

✓ Measure trajectory quality and final outcome separately — both matter in production

✗ Only eval on easy tasks; agents fail at edge cases so adversarial evals are essential

Offline vs Online Agent Evals

Offline: replay recorded tool responses so evals run fast and cheaply
Online: run against real tools; more realistic but slower and costlier
Shadow mode: run new agent version in parallel with production; compare outcomes
LLM-as-judge can score trajectory quality at scale when human annotation is too expensive

✓ Build an offline eval harness with recorded tool responses before launching online eval