The ADEPT Framework

Every great interview-prep system gives you one thing above all: a script you can fall back on when your mind goes blank. HelloInterview has its delivery framework, Grokking has RESHADED. The agentic system-design interview needs its own, because the classic ones optimize for the wrong things — they think in requests and bytes and consistency, and an LLM system lives or dies on tokens, non-determinism, retrieval quality, and a correctness you cannot unit-test.

ADEPT is that script. Five phases, roughly forty-five minutes, each mapping to a cluster of Foundations. The name is the trait it signals: run it well and you look adept; the interviewer sees someone who has done this before.

Classic system design asks "how do you handle ten million requests." Agentic system design asks "how do you keep quality high and cost bounded when every call is probabilistic, your knowledge is stale the moment you index it, and a single poisoned document can exfiltrate your data." Different question. Different script.

~5–8 min

Align on the problem

Pin the use case and state the quality bar as a number. Functional scope: single- vs multi-turn, tools, citations, autonomy level. Then the non-functionals unique to LLM systems — latency (TTFT + tokens/sec, streaming?), cost per query (the token budget), faithfulness target, hallucination tolerance, freshness, scale (QPS), multi-tenancy and ACLs. Do a token-budget estimate the way classic design does capacity estimation.

tokens/req × QPS → $/query

~8–10 min

Design the knowledge layer

What must the model know that it doesn't? Choose the adaptation strategy on the prompt → RAG → fine-tune ladder; default to RAG and justify anything heavier. For RAG: chunking, embeddings, vector store, hybrid retrieval (BM25 + dense), reranking, freshness, ACL-aware retrieval. Say how you'll measure retrieval quality — recall@k, context precision, faithfulness.

recall@k · faithfulness

~10–12 min

Engineer the agent loop

The control flow, and the heart of the round. Decide workflow vs agent — and defend choosing the simpler one. Lay out the loop: tool/function calling, planning, memory (short and long), orchestration (orchestrator-worker when subtasks are unpredictable), error recovery, human-in-the-loop gates. Add model selection and routing — cheap model for easy turns, frontier for hard — with fallbacks.

workflow vs agent · routing

~7–9 min

Protect & optimize

Make it safe and affordable. Guardrails: prompt injection and the lethal trifecta, PII, content moderation, output and schema validation, sandboxed tool execution. Cost and latency levers: prompt / KV / semantic caching, batching, context trimming, streaming for perceived speed.

guardrails · caching

~7–9 min

Test & evolve

How you know it works and keep it working — the phase that separates levels. Eval strategy: offline golden sets, LLM-as-judge (binary, validated against humans), online/production evals, regression gates wired into CI. Observability: tracing, token accounting, cost per successful outcome. Close by scaling 10×/100× and naming where it breaks first.

evals · tracing · scale

ADEPT — Align, Design the knowledge layer, Engineer the agent loop, Protect & optimize, Test & evolve. Announce the structure at the start (“let me clarify requirements, then knowledge, then the agent, then safety and cost, then evals”) so the interviewer can follow — and so you get credit for the parts you don't reach.

How it differs from classic system design

If you already know classic system design, you don't throw it away — you overlay the parts that are genuinely new. These are the axes a classic loop never asks about, and the ones an agentic interviewer is listening for:

The new axeswhat classic SD never asks

Token throughputYou estimate in tokens/sec and $/query, not just RPS. Input and output tokens are priced differently; output length drives latency.

Context budgetThe prompt window is a finite, shared resource you allocate across system prompt, tools, retrieved context, and history — and more is not better.

Retrieval qualityA first-class, measured metric (recall@k, faithfulness), not an afterthought. Bad retrieval is the most common cause of a bad RAG system.

Non-determinismOutputs are probabilistic, so you cannot assert correctness — you evaluate it. Evals replace unit tests as the proof the system works.

Guardrails as a surfacePrompt injection is a security boundary with no reliable prompt-only fix. Untrusted content in the context window is an attack vector.

Cost per outcomeThe metric that matters is dollars per successful task, not raw spend — optimizing tokens alone just makes a worse system cheaper.

Name these unprompted and you signal production experience. Miss them and you sound like someone who has read about LLMs but not operated one.

Phase A in depth: the questions that earn credit

The opening five minutes set your level more than any other stretch. Weak candidates jump to architecture; strong ones quantify the problem first. The difference, concretely:

✓Strong opening

What is the quality bar — and how is it measured?

p95 latency target? Is streaming acceptable?

Budget per query? ~2k tokens × 300 QPS ≈ what $/day?

How fresh must answers be — seconds, hours, days?

Multi-tenant? Are there per-user ACLs on the data?

What is the cost of a wrong answer vs. an "I don't know"?

✕Weak opening

So I'll use a vector database and an LLM…

Let me draw the boxes first

We'll use GPT-4 for everything

Scale is probably fine, let's move on

Accuracy should be high (no number given)

I'll add evals at the end if there's time

The strong column is a candidate establishing the constraints that will drive every later tradeoff. The weak column is a candidate designing in a vacuum — every decision they make afterward is ungrounded, and the interviewer knows it.

What interviewers grade: the leveling rubric

The same answer can be a pass at mid-level and a flag at senior. Interviewers grade you on a small number of dimensions that scale with the level you're targeting — adapted here from HelloInterview's breadth/depth/proactiveness model for the agentic context.

Leveling rubricthe bar rises with the role

Dimension	Mid	Senior	Staff	Principal+
Breadth	Covers the main phases with prompting	Hits all five ADEPT phases unprompted	Connects phases — how retrieval choice changes eval	Frames the whole problem space, names what is out of scope
Depth	Academic understanding of each piece	Goes deep in ~2 areas from real experience	Deep across multiple areas, cites failure modes	Teaches the interviewer something
Proactivity	Leads early, interviewer drives deep dives	Spots what is uniquely hard early	Drives the whole interview, preempts concerns	Reframes the problem to what actually matters
Eval discipline	Mentions evals when asked	Proposes a concrete eval strategy	Distinguishes retrieval vs generation metrics, validates the judge	Designs the hill-climb loop the team will operate

Notice the fourth row. Eval discipline is the dimension that most separates AI-engineer candidates — and the one most under-prepared. A senior+ candidate treats “how do you know it works” as the most important question in the room, not the last.

Run the script, but read the room

The framework is a default, not a cage. If the interviewer steers hard into retrieval, follow them — depth where they're curious beats marching through your phases. The point of ADEPT is that when you're not being steered, you always have the next move, and you never forget the phase that wins offers (T).

The framework's real job

A framework's value is not that it produces the perfect design. It is that it keeps you moving, in a defensible order, when forty-five minutes and a watching interviewer make it hard to think.

Announce the structure up front so partial progress still reads as competence.
Quantify in phase A — every later tradeoff should trace back to a number you established.
Always reach phase T. Budget time so you are never cut off before evals; if you are running long, compress E, not T.

The System Design Interview: What is Expected at Each Level ↗

HelloInterview·2025·Guide

The breadth/depth/proactiveness leveling model this rubric adapts. The core insight transfers directly: the same behavior is a pass at one level and a flag at the next, because what shifts is not whether you cover a topic but how much the interviewer has to pull it out of you.

Next: the Foundations that each ADEPT phase draws on — or jump to a worked design problem to see the framework run end to end.

How it differs from classic system design​

Phase A in depth: the questions that earn credit​

What interviewers grade: the leveling rubric​

Run the script, but read the room​

How it differs from classic system design

Phase A in depth: the questions that earn credit

What interviewers grade: the leveling rubric

Run the script, but read the room