Cheat Sheet

The whole guide, compressed to what fits on one screen the night before. Skim it, don't study it — by now the reasoning is yours; this is just the index to it. Every figure below links back to where the full argument lives.

If you remember nothing else: quantify before you architect, treat retrieval quality as a measured number, reach for the simplest control flow that works, never trust content you didn't author, and prove it with evals — not vibes. That sentence is the interview.

ADEPT in one screen

Align

Align on the problem

Pin the use case and state the quality bar as a number. Token-budget estimate the way classic design does capacity.

tokens/req × QPS → $/query

Design

Design the knowledge layer

What must the model know that it doesn't? Walk the prompt → RAG → fine-tune ladder; default to RAG and measure retrieval.

recall@k · faithfulness

Engineer

Engineer the agent loop

The control flow. Decide workflow vs agent and defend the simpler one. Tools, memory, orchestration, routing, recovery.

workflow vs agent · routing

Protect

Protect & optimize

Make it safe and affordable. Guardrails for injection and the lethal trifecta; caching, batching, context trimming, streaming.

guardrails · caching

Test

Test & evolve

Prove it works and keep it working. Golden sets, validated LLM-as-judge, regression gates in CI, tracing, cost per outcome.

evals · tracing · scale

The five-phase script in full lives in The ADEPT Framework. Announce the structure up front so you get credit for the phases you don't reach.

Red flags by phase

What sinks candidatesgrouped by ADEPT phase

ADesigning before quantifying. No quality bar stated as a number — "good answers" instead of a target metric.

DRetrieval quality as an afterthought. "Just use a bigger context window" instead of fixing recall and precision.

EReaching for an agent (or multi-agent) when a deterministic workflow would do the job more cheaply and reliably.

PTrusting retrieved or user content (injection). Granting full autonomy on irreversible actions with no human gate.

T"It looked good." 1–5 judge scores. Never validating the judge against humans. Reporting infra metrics only.

Each of these is a level-down signal on its own. Name the opposite unprompted to climb.

The decision ladders

Prompt → RAG → fine-tune

Better prompt

Capability already in the model — clarify the instruction first. Cheapest, fastest to iterate.

RAG

Fresh or proprietary FACTS the model lacks. Default for knowledge gaps; measure retrieval.

Fine-tune (LoRA first)

Consistent BEHAVIOR / FORMAT, or a cheaper specialized model. Reach last, LoRA before full.

Control-flow choices

Workflow vs agent

Deterministic, predictable subtasks → workflow. Model must direct itself over unpredictable steps → agent. Pick the simplest that works.

Single vs multi-agent

Default single. Go multi only when the task genuinely decomposes AND the value covers the ~15× token cost.

Model routing

Cheap model for easy turns, frontier for hard, with fallbacks. Don’t pay frontier prices for trivial calls.

Three ladders, one rule: justify every step up. The default sits at the cheap end; heavier choices need a reason out loud.

Component cheat sheet

Tools to name — with the right tradeoffsay the name AND why

Vector DBsplatform

pgvector · Pinecone · Qdrant · Weaviate · Milvus

pgvector if already on Postgres (under ~10M vectors); Pinecone for zero-ops managed (pricier at scale); Qdrant for best filtered search and easy self-host; Weaviate for native hybrid; Milvus for billion-scale.

Servingruntime

vLLM · TGI · TensorRT-LLM

vLLM is the default (PagedAttention + continuous batching); TGI for simple ops; TensorRT-LLM for max throughput, NVIDIA-compiled.

Eval / observabilityprocess

Ragas · LangSmith · Langfuse · Braintrust · Arize Phoenix · OTel GenAI

Ragas for RAG metrics; LangSmith for LangChain stacks; Langfuse for open-source self-host; Braintrust for CI eval gates; Phoenix is OTel-native; OpenTelemetry GenAI is the standard.

Orchestrationplatform

LangGraph · LlamaIndex · CrewAI · AutoGen · Temporal

LangGraph for stateful graphs; LlamaIndex for RAG; CrewAI for role-based multi-agent; AutoGen for conversational multi-agent; Temporal for durable execution.

Retrievalruntime

BM25 · dense embeddings · RRF · cross-encoder rerank

BM25 for sparse/lexical; dense embeddings for semantic; reciprocal rank fusion to merge them; cross-encoder rerank for final precision.

Naming a tool earns nothing; naming the tradeoff that picks it earns the point.

Numbers worth memorizing

≈90% / 85%

Prompt caching: cost / latency saved; cache reads ≈ 10% of input price (Anthropic)

~15×

Multi-agent burns ~15× the tokens of a single chat; token usage explained ~80% of performance variance (Anthropic)

~100:1

Agents run a ~100:1 input:output token ratio in production (Manus)

90.2%

Multi-agent lift over single-agent on a research eval (Anthropic)

O(n²)

Attention cost is quadratic in sequence length — why context is finite and expensive

Real, cited figures — drop one precisely and you sound like you’ve operated these systems, not just read about them.

Metrics that matter

The metric stackwhat to optimize, by layer

Retrievalprocess

recall@k · context precision · MRR / NDCG

Did the right context make it into the window, and was it ranked well? Bad retrieval is the top cause of a bad RAG system.

Generationprocess

faithfulness / groundedness · answer relevance · citation accuracy

Is the answer supported by the retrieved context, on-topic, and correctly cited? This is your hallucination defense.

Productruntime

task completion · escalation precision · CSAT

The only metrics the business actually feels. A faithful answer that fails the task still failed.

Costcost

cost per SUCCESSFUL outcome · cache hit rate

Dollars per successful task, not raw spend — optimizing tokens alone just makes a worse system cheaper.

Latencyruntime

TTFT · TPOT · p95

Time-to-first-token drives perceived speed (stream it); time-per-output-token drives total wait. Always quote the p95.

Lead with cost per successful outcome and a validated faithfulness number — those two phrases carry the whole eval story.

One-line glossary

Terms an interviewer might dropone line each

RAGRetrieve relevant docs, inject them into the prompt, ground the answer in them — fixes stale and proprietary knowledge.

Hybrid searchCombine sparse (BM25) and dense (embedding) retrieval to catch both exact terms and semantic matches.

RRFReciprocal rank fusion — merge multiple ranked lists by summing 1/(k+rank), no score calibration needed.

RerankingA cross-encoder rescores the top-k candidates for precision before they enter the context.

ReActReason + act loop: the model interleaves thoughts, tool calls, and observations until it answers.

Orchestrator-workerA lead agent decomposes a task and delegates subtasks to worker agents, then synthesizes.

Context windowThe finite token budget shared across system prompt, tools, retrieved context, and history — more is not better.

TTFT / TPOTTime-to-first-token (perceived speed) and time-per-output-token (streaming throughput).

KV cacheCached attention keys/values for prior tokens so generation doesn’t recompute the whole prefix each step.

PagedAttentionvLLM’s paged KV-cache memory management — packs more concurrent requests onto a GPU.

Continuous batchingSwap finished sequences out and new ones in mid-batch to keep the GPU saturated.

Prompt cachingReuse a cached prompt prefix across calls — large cost and latency savings on repeated context.

Context rotQuality degrades as the window fills with stale or irrelevant tokens — trim aggressively.

LLM-as-judgeUse a model to score outputs; prefer binary rubrics and validate against human labels.

Golden setA curated, labeled eval set you run offline as the regression bar for every change.

Lethal trifectaUntrusted input + access to private data + ability to exfiltrate — together they enable data theft.

Prompt injectionMalicious instructions in input. Direct = user-typed; indirect = hidden in retrieved/fetched content.

LoRA / PEFTParameter-efficient fine-tuning — train small adapter weights instead of the full model.

DistillationTrain a small student model to mimic a larger teacher for a cheaper specialized deployment.

MCPModel Context Protocol — an open standard for connecting models to tools and data sources.

QuantizationLower-precision weights (INT8/INT4) to shrink memory and speed inference, trading some accuracy.

FaithfulnessWhether the answer is actually supported by the provided context — groundedness, not correctness.

If a term comes up and you can give the one-liner plus when it matters, you keep momentum instead of stalling.

The three signals that win offers

What gets you the offer

Interviewers are buying one thing: confidence that you'll build something that works in production and keeps working. Three signals carry that. Show them and the level takes care of itself.

Production judgment over algorithmic polish — you talk in tokens, dollars, and failure modes, not just architecture diagrams.
Evals as a reflex — every design choice comes with how you’d measure it, and you validate the judge before trusting it.
Knowing when NOT to use an agent — you reach for the simplest control flow that works and defend it out loud.

Good luck. Reread the overview for the why, and keep The ADEPT Framework open as your in-room script. You’ve got this.

ADEPT in one screen​

Red flags by phase​

The decision ladders​

Component cheat sheet​

Numbers worth memorizing​

Metrics that matter​

One-line glossary​

The three signals that win offers​