Design a RAG Knowledge Assistant

The prompt. "Design a question-answering assistant over a company's internal knowledge base — wikis, docs, tickets. Employees ask natural-language questions and get cited answers."

This is the single most-asked agentic design question, and the one most candidates fumble — not because the architecture is hard, but because they jump to "vector DB plus an LLM" before they have quantified anything. This page runs the whole thing end to end on the ADEPT framework: Align, Design the knowledge layer, Engineer the loop, Protect and optimize, Test and evolve. Treat it as the answer you model your own on.

First, anchor the numbers. The prompt gives you none, so you state the constraints you'd clarify and proceed against concrete targets — designing in a vacuum is the fastest way to lose the room.

~5M

documents in corpus

~300

QPS at peak

≤ 2s

p95 latency

$0.002–0.01

budget / query

~5 min

freshness SLA

Constraints I'd confirm up front. Multi-tenant with per-document ACLs. Citations mandatory. A confident wrong answer is worse than an honest abstention — so the system must prefer 'I don't know.' Every later tradeoff traces back to one of these numbers.

Phase A — Align

Functional scope, smallest viable first. Single-turn cited Q&A is v1; multi-turn follow-ups (resolving "what about for contractors?" against prior context) come next. Citations are non-negotiable — every answer links to the source chunks it used. The system is read-only, so blast radius is low: the worst case is a wrong answer, not a deleted database. That framing matters — it justifies a lighter autonomy posture than an agent that takes actions.

Non-functionals as numbers. Classic system design does capacity estimation in requests and bytes; here you do it in tokens and dollars. Assume each query assembles ~3k context tokens (a handful of reranked chunks plus the system prompt) and emits ~300 output tokens.

Token-budget estimatethe agentic version of capacity planning

Per query~3,000 input + ~300 output tokens

Throughput at peak3,000 × 300 QPS ≈ 900k input tok/s; 300 × 300 ≈ 90k output tok/s → ~1M tok/s aggregate

Cost — routed (mid-tier)3,000 × $0.25/M + 300 × $1.25/M ≈ $0.0011/query

Cost — frontier-only3,000 × $3/M + 300 × $15/M ≈ $0.0135/query

Daily inference (~50 QPS avg)~4.3M queries/day → ~$4.7k/day routed vs ~$58k/day frontier-only

Representative blended prices, stated as assumptions. The arithmetic does real work: ~1M tok/s sets your rate-limit / provisioned-throughput conversation, and the 10× cost gap between routed and frontier-only is the entire reason Phase E routes by query difficulty. The frontier-only design blows the per-query budget on its own.

The killer clarifying question: what is the cost of a wrong answer versus the cost of an abstention? For an HR-policy or security-runbook assistant, a confident wrong answer is a real liability and abstention is cheap — so you tune hard for faithfulness and accept lower coverage. For a brainstorming aid, the calculus flips. This one question sets the entire guardrail and validation posture downstream; ask it before you draw a single box.

Quantify before you architect

The opening minutes set your level more than any other stretch. A strong candidate turns a vague prompt into a measured problem; a weak one starts drawing boxes around an unquantified void.

Every component choice later cites a number from this phase.
The cost-of-wrong-answer question decides how aggressive your abstention and citation-validation gates are.
Read-only scope is a gift — say so, and design proportionally.

Phase D — Design the knowledge layer

This is the heart of a RAG problem, and where depth earns the most credit. The model knows nothing about the company; retrieval quality is the dominant lever on answer quality. Walk the pipeline.

Chunking

recursive / semantic, parent-child

Split on structure (headings, sections) with recursive fallback, not blind fixed-size windows. Use parent-child: index small, precise child chunks for retrieval, but feed the larger parent section to the model so it has surrounding context. Store chunk → parent → document lineage for citations.

Embeddings

and version them

Pick a strong general embedding model and pin its version. The non-negotiable: version embeddings so you can re-index when you upgrade — a model swap silently invalidates the whole index otherwise. Embedding choice is reversible; treating it as permanent is the trap.

Vector store

fit to scale

At ~5M docs you are in the zone where the choice is real — see the matrix below. Whatever you pick must do metadata-filtered search natively so ACL filtering happens in the query, not after.

Hybrid retrieval

BM25 + dense, fused

Dense vectors catch paraphrase and intent; sparse BM25 catches exact terms — error codes, ticket IDs, acronyms, product SKUs that embeddings smear together. Run both, merge with reciprocal rank fusion. Dense-only RAG quietly fails on the exact-match queries enterprises ask most.

Reranking

precision before the window

Retrieve a wide top-k (say 50), then a cross-encoder reranker scores each against the query and keeps the best ~5. The retriever optimizes recall cheaply; the reranker buys precision before you spend context tokens. This is the highest-leverage quality fix in most RAG systems.

ACL-aware retrieval

filter at retrieval, never after

Tenant and permission filters are applied at query time as metadata constraints, so a user can never retrieve a chunk they cannot read. Post-hoc filtering of results is a data leak waiting to happen — the embedding of a forbidden doc has already shaped the ranking.

The retrieval pipeline. Most of the answer-quality wins live here, not in the model.

	pgvector	Qdrant / Weaviate / Milvus
Already on Postgres	yes	separate service
Comfortable under ~10M vectors	yes	yes
Scales well past 10M+	no	yes
Native hybrid + filtered search	partial	yes
Operational simplicity	yes	new infra to run

At ~5M vectors with ACL filtering and hybrid search on the requirements, a dedicated store earns its keep — but if the org already runs Postgres and the corpus stays under ~10M, pgvector avoids standing up new infrastructure. State the threshold out loud; that's the senior move.

Freshness. The 5-minute SLA rules out nightly batch re-indexing. Stand up an incremental indexing pipeline: document changes emit events, a worker chunks and embeds only the delta and upserts into the store. Deletes and ACL changes must propagate just as fast — a stale chunk for a revoked document is both a wrong answer and a security problem.

Retrieval comes before generation in your mental model. Two tradeoffs worth saying out loud:

Retrieve small / feed large

Retrieve

Small child chunks — sharp embeddings, precise matching, high recall@k.

Feed

The parent section — the model gets enough surrounding context to actually answer.

Cite

Back to the precise child, so citations are specific, not whole-page.

Dense vs sparse retrieval

Dense (vectors)

Wins on paraphrase, synonyms, intent — "how do I expense a flight?" finds the travel-reimbursement policy.

Sparse (BM25)

Wins on exact tokens — error code "E_4012", ticket "JIRA-8821", acronyms. Embeddings blur these.

Hybrid (RRF)

Take both and fuse. Almost always beats either alone on a mixed enterprise query stream.

Two decisions interviewers probe. Naming why hybrid beats dense-only — exact-match queries — signals you've operated RAG, not just read about it.

State the metrics you'll track from day one: recall@k (does the right chunk get retrieved at all), context precision (how much of the retrieved context is actually relevant), and faithfulness (does the answer stay grounded in it). These map cleanly onto the context hierarchy and context assembly — retrieval is just the dynamic tier of a context budget you allocate deliberately.

AI Engineering ↗

Chip Huyen·2025·O'Reilly

Frames RAG evaluation as two separable concerns — retrieval quality (is the right context fetched?) and generation quality (is the answer grounded in it?) — and argues that most production RAG failures are retrieval failures masquerading as model failures. This separation drives the Phase T eval design directly.

Phase E — Engineer the loop

For v1 this is a workflow, not an agent — and you should say so and defend it. The control flow is fixed and known: retrieve → assemble context under budget → generate with citations → validate. There is no open-ended planning, no unpredictable tool use, no reason to hand the model the steering wheel. A deterministic workflow is cheaper, faster, easier to evaluate, and easier to debug. Reaching for an agent here is over-engineering, and a good interviewer is listening for whether you know the difference.

recall

Query rewrite / expansion

Rewrite the raw question for retrieval — expand acronyms, decompose compound questions, add synonyms. Lifts recall, but it's an extra LLM hop with a latency cost, so use a cheap fast model and skip it for short keyword queries.

+recall, +~100ms

fetch

Hybrid retrieve + rerank

The Phase D pipeline: ACL-filtered BM25 + dense, RRF merge, cross-encoder rerank to the top ~5.

recall@k → precision

pack

Assemble context under budget

Fit the reranked chunks into the ~3k-token budget, parents included, with explicit source tags so the model can cite. Drop lowest-ranked chunks first when over budget.

fits the window

answer

Generate with citations

Model answers using only the provided context and must attach a citation to every claim. Stream tokens for perceived latency.

cited draft

check

Validate

Reject any answer with uncited claims; fall back to abstention. (Detail in Phase P.)

cited answer or abstain

A deterministic five-step workflow. Each step is independently measurable — which is exactly what makes Phase T tractable.

When it graduates to agentic. Two triggers: multi-hop questions that need iterative retrieval ("compare last quarter's policy to this quarter's" → retrieve, read, formulate a second retrieval), and live tool use (query a ticketing system or a dashboard rather than indexed text). At that point you adopt an agent loop — but only then, and you say what you'd watch (cost, latency, loop bounds) before flipping the switch.

Model routing. Route by query difficulty: a cheap model handles simple lookups and the query-rewrite hop; a frontier model handles synthesis and multi-source reasoning. A lightweight classifier (or a confidence/complexity heuristic) picks the tier. On timeout or provider error, fall back to a secondary model rather than failing the request. This is what closes the 10× cost gap from the Phase A arithmetic.

Phase P — Protect and optimize

The defining security fact of RAG: retrieved content is untrusted input. The moment a document lands in the context window, anything written in it can try to hijack the model.

Guardrailsretrieved docs are an attack surface

Indirect prompt injectionA poisoned wiki page or ticket ("ignore previous instructions and…") becomes an attack the instant it is retrieved. There is no reliable prompt-only fix — treat retrieved text as data, never as instructions, and isolate it structurally in the prompt.

The lethal trifectaPrivate data + untrusted content + external communication = exfiltration. You already have the first two. Keep the third off: no outbound tools, no rendering of model-emitted links/images. Break the triangle and injection cannot phone home.

PII handlingRedact or tokenize sensitive fields at indexing time; enforce ACLs at retrieval so a user never sees data they lack rights to; scrub PII from logs and traces.

Output validation = hallucination guardEvery claim must map to a retrieved chunk. Reject answers with uncited claims and abstain instead. This single check is simultaneously your citation enforcement and your primary hallucination defense — and it operationalizes the "prefer I don't know" constraint from Phase A.

Citation validation does double duty: it is the guardrail and the quality gate. An uncited claim is an ungrounded claim — reject it.

See prompt injection for why the input/instruction boundary cannot be patched with prompt wording alone.

Cost and latency levers. Prompt caching the stable system prefix cuts input cost and TTFT on every call — the system prompt and tool definitions are identical across queries, so cache them. Semantic caching of repeated/near-duplicate questions ("how do I reset my password?" gets asked hundreds of times) serves a cached answer for a fraction of the cost — with a freshness-aware TTL so cached answers don't outlive the docs behind them. Streaming hides latency: first token in a few hundred ms reads as fast even when full synthesis takes longer, which is how you live under the p95 ≤ 2s bar.

Phase T — Test and evolve

This is the phase that separates levels — and the one candidates most often skip. You cannot unit-test a probabilistic system; you evaluate it. The core move is to separate retrieval metrics from generation metrics, because they fail for different reasons and you fix them in different places.

Start with a golden set of ~100–500 question / answer / source triples, drawn from real query logs and labeled by people who know the domain. This is the asset the whole hill-climb runs on.

Two metric families, measured separatelyretrieval failures masquerade as model failures

recall@kprocess

Is the correct chunk in the top-k retrieved?

If recall@k is low, no model can answer — fix retrieval (chunking, hybrid weights, reranker), not the prompt.

MRRprocess

How high does the right chunk rank?

Rewards getting the answer near the top, where the reranker and context budget can actually use it.

Faithfulness / groundednessruntime

Is every claim supported by the retrieved context?

The headline anti-hallucination metric. Scored by a binary LLM-as-judge, validated against human labels.

Answer relevanceruntime

Does the answer actually address the question?

Catches the grounded-but-useless answer — faithful to context that wasn't what the user asked.

Citation accuracyruntime

Do the cited chunks actually support the cited claims?

A citation to the wrong chunk is worse than none — it manufactures false trust.

Diagnose retrieval and generation independently. A low faithfulness score with high recall@k is a generation/prompt problem; low recall@k is a retrieval problem. Conflating them sends you fixing the wrong layer.

LLM-as-judge scores faithfulness and relevance at scale — but only after you've validated the judge against human labels (binary verdicts agree more reliably than 1–5 scores). Online, collect thumbs up/down, log every low-confidence answer and abstention, and promote production failures into the golden set so the eval suite grows toward where the system actually breaks. Wire a regression gate into CI: no prompt, model, embedding, or chunking change ships without clearing the golden-set bar — a chunking tweak that helps one query class silently regresses another, and only the gate catches it.

This is the metrics that matter and replay-and-evals loop applied to RAG: the team's job is to operate this hill-climb, not to ship once.

Common mistakes

✓What earns the offer

Quantify first — token budget, $/query, the cost-of-wrong-answer question — then architect.

Treat retrieval quality as the dominant lever, with measured recall@k and faithfulness.

Hybrid retrieval + reranking + ACL filtering at query time.

A workflow for v1, with a stated trigger for going agentic.

Citation validation as both guardrail and hallucination gate; a real abstention path.

An eval strategy that separates retrieval from generation metrics, gated in CI.

✕Classic red flags

Jumping to boxes before quantifying anything.

Treating retrieval as an afterthought — "embed it and search."

No eval strategy, or "I'll add evals at the end."

Ignoring ACLs, or filtering results after retrieval instead of during.

Forgetting retrieved docs are untrusted — no indirect-injection defense.

"Just use a bigger context window" instead of retrieving well — and no way to say "I don't know."

The right column is every candidate who has read about RAG but never operated it. The left column is someone who has shipped one.

Next, run the framework on a system that takes actions instead of just answering: the Customer Support Agent.

Phase A — Align​

Phase D — Design the knowledge layer​

Phase E — Engineer the loop​

Phase P — Protect and optimize​

Phase T — Test and evolve​

Common mistakes​