Design a RAG Knowledge Assistant
The prompt. "Design a question-answering assistant over a company's internal knowledge base — wikis, docs, tickets. Employees ask natural-language questions and get cited answers."
This is the single most-asked agentic design question, and the one most candidates fumble — not because the architecture is hard, but because they jump to "vector DB plus an LLM" before they have quantified anything. This page runs the whole thing end to end on the ADEPT framework: Align, Design the knowledge layer, Engineer the loop, Protect and optimize, Test and evolve. Treat it as the answer you model your own on.
First, anchor the numbers. The prompt gives you none, so you state the constraints you'd clarify and proceed against concrete targets — designing in a vacuum is the fastest way to lose the room.
Phase A — Align
Functional scope, smallest viable first. Single-turn cited Q&A is v1; multi-turn follow-ups (resolving "what about for contractors?" against prior context) come next. Citations are non-negotiable — every answer links to the source chunks it used. The system is read-only, so blast radius is low: the worst case is a wrong answer, not a deleted database. That framing matters — it justifies a lighter autonomy posture than an agent that takes actions.
Non-functionals as numbers. Classic system design does capacity estimation in requests and bytes; here you do it in tokens and dollars. Assume each query assembles ~3k context tokens (a handful of reranked chunks plus the system prompt) and emits ~300 output tokens.
The killer clarifying question: what is the cost of a wrong answer versus the cost of an abstention? For an HR-policy or security-runbook assistant, a confident wrong answer is a real liability and abstention is cheap — so you tune hard for faithfulness and accept lower coverage. For a brainstorming aid, the calculus flips. This one question sets the entire guardrail and validation posture downstream; ask it before you draw a single box.
The opening minutes set your level more than any other stretch. A strong candidate turns a vague prompt into a measured problem; a weak one starts drawing boxes around an unquantified void.
- Every component choice later cites a number from this phase.
- The cost-of-wrong-answer question decides how aggressive your abstention and citation-validation gates are.
- Read-only scope is a gift — say so, and design proportionally.
Phase D — Design the knowledge layer
This is the heart of a RAG problem, and where depth earns the most credit. The model knows nothing about the company; retrieval quality is the dominant lever on answer quality. Walk the pipeline.
Chunking
recursive / semantic, parent-child
Split on structure (headings, sections) with recursive fallback, not blind fixed-size windows. Use parent-child: index small, precise child chunks for retrieval, but feed the larger parent section to the model so it has surrounding context. Store chunk → parent → document lineage for citations.
Embeddings
and version them
Pick a strong general embedding model and pin its version. The non-negotiable: version embeddings so you can re-index when you upgrade — a model swap silently invalidates the whole index otherwise. Embedding choice is reversible; treating it as permanent is the trap.
Vector store
fit to scale
At ~5M docs you are in the zone where the choice is real — see the matrix below. Whatever you pick must do metadata-filtered search natively so ACL filtering happens in the query, not after.
Hybrid retrieval
BM25 + dense, fused
Dense vectors catch paraphrase and intent; sparse BM25 catches exact terms — error codes, ticket IDs, acronyms, product SKUs that embeddings smear together. Run both, merge with reciprocal rank fusion. Dense-only RAG quietly fails on the exact-match queries enterprises ask most.
Reranking
precision before the window
Retrieve a wide top-k (say 50), then a cross-encoder reranker scores each against the query and keeps the best ~5. The retriever optimizes recall cheaply; the reranker buys precision before you spend context tokens. This is the highest-leverage quality fix in most RAG systems.
ACL-aware retrieval
filter at retrieval, never after
Tenant and permission filters are applied at query time as metadata constraints, so a user can never retrieve a chunk they cannot read. Post-hoc filtering of results is a data leak waiting to happen — the embedding of a forbidden doc has already shaped the ranking.
| pgvector | Qdrant / Weaviate / Milvus | |
|---|---|---|
| Already on Postgres | yes | separate service |
| Comfortable under ~10M vectors | yes | yes |
| Scales well past 10M+ | no | yes |
| Native hybrid + filtered search | partial | yes |
| Operational simplicity | yes | new infra to run |
Freshness. The 5-minute SLA rules out nightly batch re-indexing. Stand up an incremental indexing pipeline: document changes emit events, a worker chunks and embeds only the delta and upserts into the store. Deletes and ACL changes must propagate just as fast — a stale chunk for a revoked document is both a wrong answer and a security problem.
Retrieval comes before generation in your mental model. Two tradeoffs worth saying out loud:
State the metrics you'll track from day one: recall@k (does the right chunk get retrieved at all), context precision (how much of the retrieved context is actually relevant), and faithfulness (does the answer stay grounded in it). These map cleanly onto the context hierarchy and context assembly — retrieval is just the dynamic tier of a context budget you allocate deliberately.
Phase E — Engineer the loop
For v1 this is a workflow, not an agent — and you should say so and defend it. The control flow is fixed and known: retrieve → assemble context under budget → generate with citations → validate. There is no open-ended planning, no unpredictable tool use, no reason to hand the model the steering wheel. A deterministic workflow is cheaper, faster, easier to evaluate, and easier to debug. Reaching for an agent here is over-engineering, and a good interviewer is listening for whether you know the difference.
When it graduates to agentic. Two triggers: multi-hop questions that need iterative retrieval ("compare last quarter's policy to this quarter's" → retrieve, read, formulate a second retrieval), and live tool use (query a ticketing system or a dashboard rather than indexed text). At that point you adopt an agent loop — but only then, and you say what you'd watch (cost, latency, loop bounds) before flipping the switch.
Model routing. Route by query difficulty: a cheap model handles simple lookups and the query-rewrite hop; a frontier model handles synthesis and multi-source reasoning. A lightweight classifier (or a confidence/complexity heuristic) picks the tier. On timeout or provider error, fall back to a secondary model rather than failing the request. This is what closes the 10× cost gap from the Phase A arithmetic.
Phase P — Protect and optimize
The defining security fact of RAG: retrieved content is untrusted input. The moment a document lands in the context window, anything written in it can try to hijack the model.
See prompt injection for why the input/instruction boundary cannot be patched with prompt wording alone.
Cost and latency levers. Prompt caching the stable system prefix cuts input cost and TTFT on every call — the system prompt and tool definitions are identical across queries, so cache them. Semantic caching of repeated/near-duplicate questions ("how do I reset my password?" gets asked hundreds of times) serves a cached answer for a fraction of the cost — with a freshness-aware TTL so cached answers don't outlive the docs behind them. Streaming hides latency: first token in a few hundred ms reads as fast even when full synthesis takes longer, which is how you live under the p95 ≤ 2s bar.
Phase T — Test and evolve
This is the phase that separates levels — and the one candidates most often skip. You cannot unit-test a probabilistic system; you evaluate it. The core move is to separate retrieval metrics from generation metrics, because they fail for different reasons and you fix them in different places.
Start with a golden set of ~100–500 question / answer / source triples, drawn from real query logs and labeled by people who know the domain. This is the asset the whole hill-climb runs on.
LLM-as-judge scores faithfulness and relevance at scale — but only after you've validated the judge against human labels (binary verdicts agree more reliably than 1–5 scores). Online, collect thumbs up/down, log every low-confidence answer and abstention, and promote production failures into the golden set so the eval suite grows toward where the system actually breaks. Wire a regression gate into CI: no prompt, model, embedding, or chunking change ships without clearing the golden-set bar — a chunking tweak that helps one query class silently regresses another, and only the gate catches it.
This is the metrics that matter and replay-and-evals loop applied to RAG: the team's job is to operate this hill-climb, not to ship once.
Common mistakes
Next, run the framework on a system that takes actions instead of just answering: the Customer Support Agent.