All playbooks / Retrieval & RAG

Playbook · Retrieval & RAG

How would you implement a RAG pipeline from scratch?

The interviewer is not checking whether you can name the libraries. They want to see if you understand where each stage of the pipeline breaks and what production concerns look like before you have even added complexity. The trap is to list components: embeddings, vector store, LLM. That reads as surface knowledge. The strong answer…

Senior High frequency 15 min read Free
Practical answer framework for AI engineer interview loops.

01Interview Context

The interviewer is not checking whether you can name the libraries. They want to see if you understand where each stage of the pipeline breaks and what production concerns look like before you have even added complexity. The trap is to list components: embeddings, vector store, LLM. That reads as surface knowledge. The strong answer explains the judgment calls at each stage — chunking granularity, retrieval confidence, generation constraints — and names the failure modes before being asked.

02The 90-second answer

A basic RAG pipeline has five stages: ingestion, chunking, embedding, indexing, and then retrieval and generation at query time. The implementation is not the hard part. The judgment is. I would start with fixed-size chunking with overlap, pick a sentence-transformer for embeddings, and use any vector store with approximate nearest neighbor search. Before I trust any output, I write a small eval to check whether retrieval actually returns the right passages. Everything else follows from what that eval shows.

03Weak vs Strong Answer

Weak answer

"I would use ChromaDB for the vector store, a sentence-transformer for embeddings, and OpenAI for generation."

Strong answer

"The stack is the easy part. The judgment calls are chunk size, retrieval confidence gates, and what happens when retrieval fails. A prototype that hits the happy path is not the same as one that fails cleanly when the query has no good answer in the index."

04The five stages and where each one breaks

Chunking

Fixed-size chunking with overlap is fine to start:

def chunk_text(text, chunk_size=500, overlap=100):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + chunk_size])
        start += chunk_size - overlap
    return chunks

It breaks at the extremes. Chunks too large: the embedding averages over multiple concepts and retrieval gets fuzzy — the right document scores well but the wrong passage gets pulled. Chunks too small: each chunk loses surrounding context, so the answer is grounded on paper but missing the part that actually matters. In production I switch to sentence-boundary or semantic chunking once I have an eval that shows where retrieval is misfiring, not before.

Embedding

all-MiniLM-L6-v2 is a reasonable starting point. It runs locally, doesn't need an API call, and is fast enough to measure whether retrieval is working at all before committing to a more powerful model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(chunks).tolist()

The trap is upgrading the embedding model before proving the pipeline structure works. Model swaps are cheap. Fixing a broken chunking or retrieval architecture after the fact is not. Measure Recall@k first. Only swap the model if recall is actually the bottleneck.

Indexing

Store the chunks, embeddings, and enough metadata to trace each chunk back to where it came from:

collection.add(
    ids=[f"doc_{doc_id}_chunk_{i}" for i, _ in enumerate(chunks)],
    documents=chunks,
    embeddings=embeddings,
    metadatas=[{"document_id": doc_id, "chunk_id": i} for i, _ in enumerate(chunks)],
)

The metadata is not optional. If retrieval returns a wrong answer and you cannot say which document it came from, you cannot debug or improve anything. Attribution is also the first thing enterprise users ask for.

Retrieval

At query time, I embed the question with the same model, search for nearest neighbors, and return the top-k chunks.

query_embedding = model.encode([query]).tolist()[0]
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)

The common mistake is treating top-k as a cutoff you always hit regardless of match quality. If the top result is a poor match, you should not be generating from it. Add a confidence gate:

# Refuse to generate when retrieval found nothing credible
if not results["distances"][0] or results["distances"][0][0] > DISTANCE_THRESHOLD:
    return {"answer": "I don't have a reliable answer for that.", "retrieved": []}

Without this gate, the LLM confidently fills gaps from parametric memory. The answer looks grounded but is not. That is the root cause of most hallucinations in early RAG implementations.

Generation

Put the retrieved chunks into a prompt and tell the LLM to stay within what's there:

context = "\n\n".join(results["documents"][0])
prompt = f"""Answer the question using only the context below.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {query}
"""

The phrase "using only the context below" matters. Without it, the model fills gaps from training memory. The answer might look right, but it is no longer grounded in what you actually retrieved — you can't point to a source, can't debug it, and can't trust it stays correct when the documents change.

05Tradeoffs interviewers probe

Decision Simple default When to revisit
Chunking strategy Fixed-size with overlap Retrieval misses facts that span a chunk boundary
Embedding model sentence-transformers locally Recall@5 plateaus even after tuning chunk size
Vector DB ChromaDB in-process Index size exceeds memory, or you need multi-tenant isolation
top-k Fixed integer (3–5) Long-tail queries need more context; simple queries get confused by too much
Confidence gate Distance threshold Calibrate on a golden eval set — a gut-feel number will drift

06Follow-up questions to expect

  1. What would you measure to know whether retrieval is actually working before you trust generation?
  2. How would you choose between fixed-size and semantic chunking?
  3. When does adding a reranker on top of vector search pay off?
  4. What breaks in this pipeline when documents are updated frequently?
  5. How would you handle queries that require reasoning across multiple documents?
  6. What would make you replace the embedding model, and how would you validate the swap?
Next playbook

What is Retrieval-Augmented Generation (RAG), and why is it important?

10 min · Retrieval & RAG