Every great interview-prep system gives you one thing above all: a script you can fall back on when your mind goes blank. HelloInterview has its delivery framework, Grokking has RESHADED. The agentic system-design interview needs its own, because the classic ones optimize for the wrong things — they think in requests and bytes and consistency, and an LLM system lives or dies on tokens, non-determinism, retrieval quality, and a correctness you cannot unit-test.
ADEPT is that script. Five phases, roughly forty-five minutes, each mapping to a cluster of Foundations. The name is the trait it signals: run it well and you look adept; the interviewer sees someone who has done this before.
Classic system design asks "how do you handle ten million requests." Agentic system design asks "how do you keep quality high and cost bounded when every call is probabilistic, your knowledge is stale the moment you index it, and a single poisoned document can exfiltrate your data." Different question. Different script.
A
~5–8 min
Align on the problem
Pin the use case and state the quality bar as a number. Functional scope: single- vs multi-turn, tools, citations, autonomy level. Then the non-functionals unique to LLM systems — latency (TTFT + tokens/sec, streaming?), cost per query (the token budget), faithfulness target, hallucination tolerance, freshness, scale (QPS), multi-tenancy and ACLs. Do a token-budget estimate the way classic design does capacity estimation.
tokens/req × QPS → $/query
D
~8–10 min
Design the knowledge layer
What must the model know that it doesn't? Choose the adaptation strategy on the prompt → RAG → fine-tune ladder; default to RAG and justify anything heavier. For RAG: chunking, embeddings, vector store, hybrid retrieval (BM25 + dense), reranking, freshness, ACL-aware retrieval. Say how you'll measure retrieval quality — recall@k, context precision, faithfulness.
recall@k · faithfulness
E
~10–12 min
Engineer the agent loop
The control flow, and the heart of the round. Decide workflow vs agent — and defend choosing the simpler one. Lay out the loop: tool/function calling, planning, memory (short and long), orchestration (orchestrator-worker when subtasks are unpredictable), error recovery, human-in-the-loop gates. Add model selection and routing — cheap model for easy turns, frontier for hard — with fallbacks.
workflow vs agent · routing
P
~7–9 min
Protect & optimize
Make it safe and affordable. Guardrails: prompt injection and the lethal trifecta, PII, content moderation, output and schema validation, sandboxed tool execution. Cost and latency levers: prompt / KV / semantic caching, batching, context trimming, streaming for perceived speed.
guardrails · caching
T
~7–9 min
Test & evolve
How you know it works and keep it working — the phase that separates levels. Eval strategy: offline golden sets, LLM-as-judge (binary, validated against humans), online/production evals, regression gates wired into CI. Observability: tracing, token accounting, cost per successful outcome. Close by scaling 10×/100× and naming where it breaks first.
evals · tracing · scale
ADEPT — Align, Design the knowledge layer, Engineer the agent loop, Protect & optimize, Test & evolve. Announce the structure at the start (“let me clarify requirements, then knowledge, then the agent, then safety and cost, then evals”) so the interviewer can follow — and so you get credit for the parts you don't reach.
If you already know classic system design, you don't throw it away — you overlay the parts that are genuinely new. These are the axes a classic loop never asks about, and the ones an agentic interviewer is listening for:
The new axeswhat classic SD never asks
Token throughputYou estimate in tokens/sec and $/query, not just RPS. Input and output tokens are priced differently; output length drives latency.
Context budgetThe prompt window is a finite, shared resource you allocate across system prompt, tools, retrieved context, and history — and more is not better.
Retrieval qualityA first-class, measured metric (recall@k, faithfulness), not an afterthought. Bad retrieval is the most common cause of a bad RAG system.
Non-determinismOutputs are probabilistic, so you cannot assert correctness — you evaluate it. Evals replace unit tests as the proof the system works.
Guardrails as a surfacePrompt injection is a security boundary with no reliable prompt-only fix. Untrusted content in the context window is an attack vector.
Cost per outcomeThe metric that matters is dollars per successful task, not raw spend — optimizing tokens alone just makes a worse system cheaper.
Name these unprompted and you signal production experience. Miss them and you sound like someone who has read about LLMs but not operated one.
The opening five minutes set your level more than any other stretch. Weak candidates jump to architecture; strong ones quantify the problem first. The difference, concretely:
✓Strong opening
What is the quality bar — and how is it measured?
p95 latency target? Is streaming acceptable?
Budget per query? ~2k tokens × 300 QPS ≈ what $/day?
How fresh must answers be — seconds, hours, days?
Multi-tenant? Are there per-user ACLs on the data?
What is the cost of a wrong answer vs. an "I don't know"?
✕Weak opening
So I'll use a vector database and an LLM…
Let me draw the boxes first
We'll use GPT-4 for everything
Scale is probably fine, let's move on
Accuracy should be high (no number given)
I'll add evals at the end if there's time
The strong column is a candidate establishing the constraints that will drive every later tradeoff. The weak column is a candidate designing in a vacuum — every decision they make afterward is ungrounded, and the interviewer knows it.
The same answer can be a pass at mid-level and a flag at senior. Interviewers grade you on a small number of dimensions that scale with the level you're targeting — adapted here from HelloInterview's breadth/depth/proactiveness model for the agentic context.
Leveling rubricthe bar rises with the role
Dimension
Mid
Senior
Staff
Principal+
Breadth
Covers the main phases with prompting
Hits all five ADEPT phases unprompted
Connects phases — how retrieval choice changes eval
Frames the whole problem space, names what is out of scope
Depth
Academic understanding of each piece
Goes deep in ~2 areas from real experience
Deep across multiple areas, cites failure modes
Teaches the interviewer something
Proactivity
Leads early, interviewer drives deep dives
Spots what is uniquely hard early
Drives the whole interview, preempts concerns
Reframes the problem to what actually matters
Eval discipline
Mentions evals when asked
Proposes a concrete eval strategy
Distinguishes retrieval vs generation metrics, validates the judge
Designs the hill-climb loop the team will operate
Notice the fourth row. Eval discipline is the dimension that most separates AI-engineer candidates — and the one most under-prepared. A senior+ candidate treats “how do you know it works” as the most important question in the room, not the last.
The framework is a default, not a cage. If the interviewer steers hard into retrieval, follow them — depth where they're curious beats marching through your phases. The point of ADEPT is that when you're not being steered, you always have the next move, and you never forget the phase that wins offers (T).
The framework's real job
A framework's value is not that it produces the perfect design. It is that it keeps you moving, in a defensible order, when forty-five minutes and a watching interviewer make it hard to think.
Announce the structure up front so partial progress still reads as competence.
Quantify in phase A — every later tradeoff should trace back to a number you established.
Always reach phase T. Budget time so you are never cut off before evals; if you are running long, compress E, not T.
The breadth/depth/proactiveness leveling model this rubric adapts. The core insight transfers directly: the same behavior is a pass at one level and a flag at the next, because what shifts is not whether you cover a topic but how much the interviewer has to pull it out of you.