All playbooks / AI Agents

Playbook · AI Agents

How do you build effective AI agents?

The interviewer is watching for the opposite of what most candidates do. Most candidates open with complex multi-agent systems and framework names. The signal they are looking for is restraint: do you know when not to build an agent? Do you start with the simplest thing that could work and add complexity only when forced to? The trap is…

Senior High frequency 14 min read Free
Practical answer framework for AI engineer interview loops.

01Interview Context

The interviewer is watching for the opposite of what most candidates do. Most candidates open with complex multi-agent systems and framework names. The signal they are looking for is restraint: do you know when not to build an agent? Do you start with the simplest thing that could work and add complexity only when forced to? The trap is treating "agent" as the default architecture instead of the last resort.

02The 90-second answer

I start by asking whether I even need an agent. A single well-tuned LLM call with retrieval handles more use cases than people expect. If the task has steps that cannot be hardcoded — because the required actions depend on what the model discovers at runtime — that is when I move to an agent loop. Everything in between has a simpler name: prompt chaining, routing, parallelization, or an orchestrator-worker setup. I pick the least powerful pattern that solves the problem, then add autonomy only when the simpler options break.

03Weak vs Strong Answer

Weak answer

"I'd use an agent framework, connect it to tools, and let the model decide what to do."

Strong answer

"The first question is whether I need an agent at all. For most tasks, a workflow — fixed steps coordinated by code — is cheaper, more reliable, and easier to debug than a fully autonomous loop. I only reach for true agent autonomy when the problem is genuinely open-ended and I can't predict what steps will be needed."

04The five patterns, in order of complexity

There are five workflow patterns, and they form a ladder. I pick the lowest rung that works.

Prompt chaining — decompose a task into sequential steps, each one feeding the next. I use this when the subtasks are fixed and predictable. The failure mode is fragile handoffs: if one step produces noisy output, downstream steps amplify the noise.

Routing — classify the input first, then direct it to a specialized prompt or model. I use this when different query types genuinely need different handling — routing customer complaints separately from billing questions, or sending simple queries to a cheaper model. The failure mode is a router that is too coarse and lumps together things that need separate treatment.

Parallelization — run independent subtasks at the same time, or run the same task multiple times and aggregate results. I use the voting variant when I want higher confidence on a safety check. The failure mode is parallelizing the wrong step — if synthesis is the bottleneck, speeding up retrieval does not help much.

Orchestrator-workers — a central model figures out what subtasks are needed, dispatches workers, and synthesizes results. I reach for this when I cannot predict the subtasks upfront, like complex search tasks or code changes that touch many files. The failure mode is an orchestrator that plans poorly and dispatches redundant or contradictory work.

Evaluator-optimizer — one model generates, another evaluates and feeds back, in a loop. I use this when I have clear evaluation criteria and iterative refinement demonstrably improves the output. The failure mode is looping without convergence — the evaluator keeps flagging problems the generator cannot fix, burning time and tokens.

The upgrade path matters. Before I reach for an orchestrator, I ask whether a prompt chain would work. Before I reach for an evaluator-optimizer loop, I ask whether a single generation with a better prompt would do. Complexity is only justified when the simpler version measurably fails.

05Tool design is not an afterthought

The place where agent implementations break most often is not the orchestration logic — it is the tool definitions. A tool with an ambiguous schema or missing edge case documentation will cause the model to call it wrong, and the model will do so confidently.

I treat tool documentation the way I treat system prompts: it needs the same level of care and the same iterative testing. That means including example inputs and edge cases in the description, not just parameter types. It also means designing arguments to prevent mistakes rather than just documenting them. If the model keeps passing relative file paths when you need absolute ones, change the argument to require absolute paths. Documenting the rule is not enough — the model will still get it wrong under pressure.

The test I run: call each tool in isolation with realistic inputs, including bad inputs, before wiring it into the agent loop. Most tool bugs show up in five minutes of manual testing that never happened.

06What keeps agents from going off the rails

There are two things I always build in before shipping any agent.

The first is stopping conditions. A max step count, a token budget, or a confidence threshold that halts the loop and escalates to a human. An agent without a hard stop is an incident waiting to happen. The model will not decide on its own that it has been stuck for too long.

The second is ground truth at each step. The model should not be reasoning from its memory of what it did — it should be reading the actual tool output before deciding what to do next. Agents go wrong when they hallucinate the result of a tool call instead of observing it. Treating the environment as the source of truth, not the model's internal state, is the single most reliable way to keep loops honest.

For anything destructive — writing to a production system, sending a message, deleting data — I require an explicit human checkpoint. That is not safety theater. It is the thing that makes the system recoverable when the model reasons incorrectly.

07Follow-up questions to expect

  1. When would you use a workflow instead of a true agent, and how do you decide?
  2. Walk me through how you would design the tools for a coding agent.
  3. What do you do when the agent loops without making progress?
  4. How would you evaluate an agent's performance end-to-end?
  5. What is the failure mode for the evaluator-optimizer pattern, and when would you abandon it?
  6. How do you handle prompt injection through tool results?
Next playbook

What is an AI agent, and how does it differ from a simple LLM call?

14 min · AI Agents