Playbook · Production AI

Explain the AI product lifecycle from ideation to production.

Senior Medium frequency 18 min read Free

Practical answer framework for AI engineer interview loops.

01Interview Context

This question sounds broad because the interviewer wants to see whether you can impose release discipline on a probabilistic system. A weak answer gives them the standard software delivery lifecycle with "AI" sprinkled in. A strong answer names the stages where AI products need extra gates — evaluation before hardening, rollout control before exposure, and continuous learning after launch — and explains why those gates exist specifically because LLM output quality is probabilistic, not deterministic.

02The 90-second answer

Open by naming the five stages that matter: problem framing, thin-slice prototype, eval discipline, production hardening, and continuous improvement. Then make the pivot: the AI-specific work happens at the boundary between prototype and production. In a standard product, passing tests and meeting spec gets you close to done. In an AI product, the output quality is probabilistic, so the main job is proving the system is good enough under realistic traffic and then keeping it good enough after release.

My release gate before any launch: the candidate beats or holds baseline quality on a frozen eval set, stays within latency and cost budgets, has a tested fallback path, and rolls out behind a flag or narrow cohort. Without those four, it is a demo with a deployment pipeline attached.

03Weak vs Strong Answer

Weak answer

"You ideate, build an MVP, test it, deploy it, and monitor it."

Strong answer

"The lifecycle looks standard until you hit the prototype-to-production boundary. That is where AI products need extra discipline: a representative eval set, a baseline to compare against, versioned prompts or models, a fallback path, and staged rollout. Skip any of those and you are shipping a demo, not a product."

04Stage 1: Problem framing

The trap here is starting with model selection. Before picking anything, I want to know two things: what does a wrong answer cost, and what does the right answer look like in production.

Failure cost asymmetry shapes the whole design. A false positive in a content moderation system is different from a false positive in a medical diagnosis system. The acceptable error budget determines the eval bar, the fallback posture, and how aggressively you roll out. If you skip this, you will build against the wrong success metric.

I also want to know whether AI is necessary at all. A rule-based system that is 95% accurate and always explainable is often better than a model that is 97% accurate and opaque. Make that tradeoff explicitly, not by default.

05Stage 2: Thin-slice prototype

The prototype exists to answer one question: is this feature useful enough to invest in? Not "does it work" — usefulness is the only signal worth optimizing for at this stage.

The trap is confusing "impressive in a demo" with "safe to ship." A model that generates fluent text is not the same as a model whose outputs are trustworthy at scale. Stop prototyping as soon as you have enough signal to decide whether to invest in eval discipline. Every hour past that point is waste.

06Stage 3: Eval discipline

This is the AI-specific gate that most product teams underinvest in.

The foundation is a golden dataset: a frozen, representative set of inputs with known-good expected outputs. Not a random sample. I want long-tail inputs, adversarial edge cases, inputs where the correct answer is to abstain, and the high-value tasks users actually repeat. I version this dataset — when I refresh it, I keep the prior version so I can compare over time and detect if the new additions made the bar easier.

# Every prompt or model change must clear this gate before staging
for input, expected in golden_set:
    actual = system.run(input)
    metrics.quality.update(actual, expected)

delta = metrics.quality.score() - baseline.quality_score
assert delta >= -ALLOWED_REGRESSION, f"quality regressed by {delta:.3f}"
assert metrics.latency_p95 <= LATENCY_BUDGET_MS
assert metrics.cost_per_query <= COST_BUDGET_USD

Version prompts and models the same way you version code. A prompt change that looks small can shift output quality significantly across input distributions. You cannot measure that without a baseline.

07Stage 4: Production hardening

Three things that most candidates omit:

Fallback behavior. Every path the system can take must have a graceful failure mode. If the model times out, if retrieval returns nothing, if the output fails a safety filter — what happens? The answer cannot be "an error page." Design the fallback before launch, test it explicitly, and make it observable.

try:
    answer = model.generate(prompt, timeout=3.0)
    if safety_filter.reject(answer):
        return fallback_response("content_policy")
except TimeoutError:
    return fallback_response("timeout")
except ModelError as e:
    log_failure(e)
    return fallback_response("model_error")

Staged rollout. Launch behind a feature flag to a narrow cohort — internal users, opted-in beta users, or a small traffic slice. Define the rollback trigger before you ramp, not after you see the dashboards move.

Cost and latency gates. LLM inference costs scale with usage in ways that are easy to underestimate. Know your per-query cost before launch. Set a hard budget and alert when it drifts.

08Stage 5: Continuous improvement

The job after launch is not maintenance — it is learning. The best teams do two things standard software teams often skip.

First, they feed production failures back into the eval set. When a user reports a bad answer, or QA sampling catches a regression, that input goes into the golden set so the bar becomes harder to regress over time. Teams that do not do this watch their quality drift without knowing it.

Second, they separate signal from noise in their production metrics. A latency spike is not the same as a quality regression. Cost drift is not the same as a safety incident. Instrument each dimension separately so you can diagnose without guessing.

09Tradeoffs interviewers probe

Speed vs eval rigor: the pressure to ship fast is real, but teams that skip the eval set tend to ship regressions and build against the wrong quality metric. The eval set pays for itself within weeks if you are iterating on prompts or models.
LLM judge vs human review: LLM judges are cheap and fast but can be correlated with the model being evaluated and are bad at catching subtle factuality errors. Human review is the ground truth but expensive. Run both: automate coverage, audit a sample with humans, especially for high-risk domains.
Prompt engineering vs fine-tuning: fine-tuning adds cost and version-management complexity. Exhaust prompt engineering first. The bar for fine-tuning is: prompt engineering is not reaching the quality bar on your eval set.
Cohort rollout vs dark launch: dark launching (mirror traffic, do not show results) tests real query distribution without user exposure. It is more expensive but catches edge cases a synthetic eval set misses. Worth it for high-stakes or high-traffic features.
Model upgrade timing: newer models improve on some dimensions and regress on others. Run the full eval set before adopting a new model, not after. Assume nothing transfers from prior evals.

10Follow-up questions to expect

What should happen at the prototype-to-hardening transition — what evidence do you need to proceed?
How do you build a representative eval set when the product is new and you have no historical traffic?
How would you structure a staged rollout for an LLM feature, and what would trigger a rollback?
What production metrics matter most in the first month after launch?
How do you detect quality drift after a model upgrade that your eval set did not catch?
When would you choose fine-tuning over prompt engineering, and how do you make that call?
How do you balance the cost of a thorough eval run with the pressure to ship quickly?
What does your monitoring setup look like for an LLM feature — what signals would wake you up at 2am?

Next playbook

How do you evaluate a RAG system before shipping it to production?

18 min · Retrieval & RAG

→

Playbook stats

DifficultySenior

FrequencyMedium

Time to learn18 min

CategoryProduction AI

Best for

Who should study this.

AI Engineer, Product Engineer, ML Engineer

Run a mock on this exact topic.

Spoken answers, follow-ups, and the same kind of structure this playbook is teaching.

Start a session →