Playbook · System Design

Design an AI-powered customer support chatbot.

The interviewer is probing three things, usually in this order: do you understand the trust boundary between informational and account-changing actions, do you know what makes a support bot fail in operations (not just in a demo), and can you design for the human handoff. The trap is treating this as a RAG question. RAG is just the read…

Senior High frequency 18 min read Free

Practical answer framework for AI engineer interview loops.

01Interview Context

02The 90-second answer

Open by splitting the bot into two paths: a read path for informational help and a write path for account-changing actions. They have different safety requirements, different latency tolerances, and different audit needs, so lumping them together is the first design mistake.

On the read path, I use RAG: retrieve from the support knowledge base, score retrieval confidence, generate a grounded answer, and gate on confidence before responding. On the write path, I require identity verification, scoped tool permissions, and rollback-safe API calls — the bot should never write to a production system without being able to name what it changed and undo it.

Both paths share escalation logic: if intent confidence is low, sentiment signals frustration, the user hits three failed turns, or the action carries account risk, route to a human agent.

03Weak vs Strong Answer

Weak answer

"I would use an LLM with a vector database and connect it to Zendesk."

Strong answer

"The architecture has two distinct paths with different safety postures. Read-only informational answers use RAG with a confidence gate — if retrieval confidence is low, I do not let the LLM hallucinate an answer, I escalate or surface a fallback. Account-changing actions need identity verification, scoped tool permissions, audit logs, and rollback-safe APIs. Blurring those two paths is where support bots go wrong in production."

04The architecture I would actually describe

User Channel (chat, email, voice)
  └─ API Gateway (auth, rate limit, session)
      └─ Orchestrator
          ├─ Intent + Policy Classifier
          ├─ [Read path]  RAG Retriever → Confidence Gate → LLM Answer
          ├─ [Write path] Identity Check → Scoped Action Tools → Order/CRM APIs
          └─ Escalation Queue → Human Agent CRM

Logs + Traces → Monitoring + Eval Pipeline

The orchestrator is the system. The LLM is one component of it. A strong answer makes that distinction explicit.

05The read path: RAG with confidence gates

The retrieval side works exactly like a production RAG system. I want chunked knowledge base docs indexed by semantic and keyword signals. The retriever returns a ranked set. Before the LLM generates an answer, I score retrieval confidence — specifically whether the top-ranked passages are actually relevant to the query.

The trap here is letting the LLM answer when retrieval is empty or noisy. That is where hallucinated policies and fake refund windows appear.

# Retrieval confidence gate — don't let the LLM fill gaps with invention
hits = retriever.search(query, k=5)
top_score = hits[0].score if hits else 0.0

if top_score < CONFIDENCE_THRESHOLD:
    return escalate(reason="low_retrieval_confidence")

answer = llm.generate(query, context=hits)

The threshold is calibrated on a golden set of representative support queries. I also track empty-handed rate — how often retrieval comes back with nothing useful — as an ongoing eval signal. If that rate climbs after a knowledge base refresh, the retriever regressed.

06The write path: safety posture for account actions

This is the section most candidates skip. The LLM can understand a user intent like "cancel my subscription." The dangerous part is executing it.

A scoped tool design means the bot has explicit, narrow permissions — it can call cancel_subscription(account_id, plan_id) but not delete_account(). Every tool invocation is audited: who called it, with what parameters, and what the system state was before and after.

I also want rollback-safe APIs on the downstream systems. If the bot cancels a subscription and the user meant to pause it, a human agent should be able to undo that call. An agentic bot writing to systems that have no undo path is the design that will cause an incident.

# Explicit tool scope — not open API access
ALLOWED_TOOLS = {
    "cancel_subscription",
    "get_order_status",
    "submit_return_request",
}

def execute_action(tool_name, params, user_identity):
    if tool_name not in ALLOWED_TOOLS:
        raise PermissionError(f"tool {tool_name!r} is not in bot scope")
    audit_log(tool_name, params, user_identity)
    return crm_client.call(tool_name, params)

07Escalation logic

The weak answer is "escalate when confidence is low." The strong answer names the specific signals:

Retrieval confidence below threshold on a repeated query — the user is not getting help
Sentiment score indicating frustration or urgency
Three or more failed turns in the same session
User explicitly asks for a human
Intent maps to an action outside the bot's permitted tool scope
Account risk flag set on the customer record (VIP, at-risk, open dispute)

Escalation should pass context — the full session, the intent classification, and the last retrieval trace — so the human agent does not re-ask everything the bot already collected.

08Metrics that matter

A support bot is an operations system. If you cannot inspect these, you cannot improve it safely:

Metric	What it catches
Containment rate	Sessions resolved without human handoff — primary success signal
Escalation rate	How often the bot punts — high is not always bad if escalation rules are correct
Bad-answer rate	Answers flagged unhelpful or wrong by users or QA sampling
Failed action rate	Write-path failures — API errors, permission rejections, rollback events
Retrieval empty-handed rate	Queries where retrieval found nothing useful
Latency p95	Most support UIs have strict latency expectations; measure the full path

I also run a rolling eval on a golden set of support queries — the same pattern as retrieval evals in a production RAG system. If bad-answer rate on the frozen eval set ticks up, I want that signal before users report it.

09Tradeoffs interviewers probe

Hallucination on policy questions: the most common production failure. A customer asks "can I return after 60 days?" and the bot invents a window. Guard: retrieval confidence gate plus a citation requirement in the prompt. The answer must point to a retrieved passage, not generate from model memory.
Over-escalation vs under-escalation: a bot that escalates too easily is useless. One that escalates too rarely causes costly errors. Tune thresholds on real session data, not assumed failure rates.
Session state bloat: long support conversations accumulate context that overflows the window. Decide early whether you summarize, truncate, or hand off sessions past a turn limit. Unbounded growth is the agentic version of a memory leak.
Multilingual users: if the user base is multilingual, retrieval must handle cross-lingual queries. Embedding models trained primarily on English degrade on other languages. Either use multilingual embeddings or maintain per-language indexes.
Compliance and audit: support bots frequently touch PII, payment data, or health information. Every action on the write path needs an audit trail. A log that says "the bot did something" is useless; you need the parameters, the pre-state, and the user identity.

10Follow-up questions to expect

How would you decide when to retry with a different retrieval strategy versus escalating immediately?
What happens when a user tries to manipulate the bot into taking an action outside its permitted scope?
How would you evaluate whether the bot's answers stay accurate after a knowledge base update?
How would you detect and handle a knowledge base refresh that introduced retrieval regressions?
What would make you roll back a bot version after it has already gone live?
How would you design the bot to handle multilingual support without duplicating the knowledge base?
How do you prevent prompt injection through user-submitted content from reaching the write path?
What is your strategy for building the golden eval set when the product is new and there is no historical session data?

Next playbook

How do you design an AI agent system design?

20 min · System Design

→

Playbook stats

DifficultySenior

FrequencyHigh

Time to learn18 min

CategorySystem Design

Best for

Who should study this.

AI Engineer, Backend Engineer, Solutions Architect

Run a mock on this exact topic.

Spoken answers, follow-ups, and the same kind of structure this playbook is teaching.

Start a session →