Skip to main content

Cheat Sheet

The whole guide, compressed to what fits on one screen the night before. Skim it, don't study it — by now the reasoning is yours; this is just the index to it. Every figure below links back to where the full argument lives.

If you remember nothing else: quantify before you architect, treat retrieval quality as a measured number, reach for the simplest control flow that works, never trust content you didn't author, and prove it with evals — not vibes. That sentence is the interview.

ADEPT in one screen

A
Align
Align on the problem
Pin the use case and state the quality bar as a number. Token-budget estimate the way classic design does capacity.
tokens/req × QPS → $/query
D
Design
Design the knowledge layer
What must the model know that it doesn't? Walk the prompt → RAG → fine-tune ladder; default to RAG and measure retrieval.
recall@k · faithfulness
E
Engineer
Engineer the agent loop
The control flow. Decide workflow vs agent and defend the simpler one. Tools, memory, orchestration, routing, recovery.
workflow vs agent · routing
P
Protect
Protect & optimize
Make it safe and affordable. Guardrails for injection and the lethal trifecta; caching, batching, context trimming, streaming.
guardrails · caching
T
Test
Test & evolve
Prove it works and keep it working. Golden sets, validated LLM-as-judge, regression gates in CI, tracing, cost per outcome.
evals · tracing · scale
The five-phase script in full lives in The ADEPT Framework. Announce the structure up front so you get credit for the phases you don't reach.

Red flags by phase

What sinks candidatesgrouped by ADEPT phase
ADesigning before quantifying. No quality bar stated as a number — "good answers" instead of a target metric.
DRetrieval quality as an afterthought. "Just use a bigger context window" instead of fixing recall and precision.
EReaching for an agent (or multi-agent) when a deterministic workflow would do the job more cheaply and reliably.
PTrusting retrieved or user content (injection). Granting full autonomy on irreversible actions with no human gate.
T"It looked good." 1–5 judge scores. Never validating the judge against humans. Reporting infra metrics only.
Each of these is a level-down signal on its own. Name the opposite unprompted to climb.

The decision ladders

Prompt → RAG → fine-tune
Better prompt
Capability already in the model — clarify the instruction first. Cheapest, fastest to iterate.
RAG
Fresh or proprietary FACTS the model lacks. Default for knowledge gaps; measure retrieval.
Fine-tune (LoRA first)
Consistent BEHAVIOR / FORMAT, or a cheaper specialized model. Reach last, LoRA before full.
Control-flow choices
Workflow vs agent
Deterministic, predictable subtasks → workflow. Model must direct itself over unpredictable steps → agent. Pick the simplest that works.
Single vs multi-agent
Default single. Go multi only when the task genuinely decomposes AND the value covers the ~15× token cost.
Model routing
Cheap model for easy turns, frontier for hard, with fallbacks. Don’t pay frontier prices for trivial calls.
Three ladders, one rule: justify every step up. The default sits at the cheap end; heavier choices need a reason out loud.

Component cheat sheet

Tools to name — with the right tradeoffsay the name AND why
Vector DBsplatform
pgvector · Pinecone · Qdrant · Weaviate · Milvus
pgvector if already on Postgres (under ~10M vectors); Pinecone for zero-ops managed (pricier at scale); Qdrant for best filtered search and easy self-host; Weaviate for native hybrid; Milvus for billion-scale.
Servingruntime
vLLM · TGI · TensorRT-LLM
vLLM is the default (PagedAttention + continuous batching); TGI for simple ops; TensorRT-LLM for max throughput, NVIDIA-compiled.
Eval / observabilityprocess
Ragas · LangSmith · Langfuse · Braintrust · Arize Phoenix · OTel GenAI
Ragas for RAG metrics; LangSmith for LangChain stacks; Langfuse for open-source self-host; Braintrust for CI eval gates; Phoenix is OTel-native; OpenTelemetry GenAI is the standard.
Orchestrationplatform
LangGraph · LlamaIndex · CrewAI · AutoGen · Temporal
LangGraph for stateful graphs; LlamaIndex for RAG; CrewAI for role-based multi-agent; AutoGen for conversational multi-agent; Temporal for durable execution.
Retrievalruntime
BM25 · dense embeddings · RRF · cross-encoder rerank
BM25 for sparse/lexical; dense embeddings for semantic; reciprocal rank fusion to merge them; cross-encoder rerank for final precision.
Naming a tool earns nothing; naming the tradeoff that picks it earns the point.

Numbers worth memorizing

≈90% / 85%
Prompt caching: cost / latency saved; cache reads ≈ 10% of input price (Anthropic)
~15×
Multi-agent burns ~15× the tokens of a single chat; token usage explained ~80% of performance variance (Anthropic)
~100:1
Agents run a ~100:1 input:output token ratio in production (Manus)
90.2%
Multi-agent lift over single-agent on a research eval (Anthropic)
O(n²)
Attention cost is quadratic in sequence length — why context is finite and expensive
Real, cited figures — drop one precisely and you sound like you’ve operated these systems, not just read about them.

Metrics that matter

The metric stackwhat to optimize, by layer
Retrievalprocess
recall@k · context precision · MRR / NDCG
Did the right context make it into the window, and was it ranked well? Bad retrieval is the top cause of a bad RAG system.
Generationprocess
faithfulness / groundedness · answer relevance · citation accuracy
Is the answer supported by the retrieved context, on-topic, and correctly cited? This is your hallucination defense.
Productruntime
task completion · escalation precision · CSAT
The only metrics the business actually feels. A faithful answer that fails the task still failed.
Costcost
cost per SUCCESSFUL outcome · cache hit rate
Dollars per successful task, not raw spend — optimizing tokens alone just makes a worse system cheaper.
Latencyruntime
TTFT · TPOT · p95
Time-to-first-token drives perceived speed (stream it); time-per-output-token drives total wait. Always quote the p95.
Lead with cost per successful outcome and a validated faithfulness number — those two phrases carry the whole eval story.

One-line glossary

Terms an interviewer might dropone line each
RAGRetrieve relevant docs, inject them into the prompt, ground the answer in them — fixes stale and proprietary knowledge.
Hybrid searchCombine sparse (BM25) and dense (embedding) retrieval to catch both exact terms and semantic matches.
RRFReciprocal rank fusion — merge multiple ranked lists by summing 1/(k+rank), no score calibration needed.
RerankingA cross-encoder rescores the top-k candidates for precision before they enter the context.
ReActReason + act loop: the model interleaves thoughts, tool calls, and observations until it answers.
Orchestrator-workerA lead agent decomposes a task and delegates subtasks to worker agents, then synthesizes.
Context windowThe finite token budget shared across system prompt, tools, retrieved context, and history — more is not better.
TTFT / TPOTTime-to-first-token (perceived speed) and time-per-output-token (streaming throughput).
KV cacheCached attention keys/values for prior tokens so generation doesn’t recompute the whole prefix each step.
PagedAttentionvLLM’s paged KV-cache memory management — packs more concurrent requests onto a GPU.
Continuous batchingSwap finished sequences out and new ones in mid-batch to keep the GPU saturated.
Prompt cachingReuse a cached prompt prefix across calls — large cost and latency savings on repeated context.
Context rotQuality degrades as the window fills with stale or irrelevant tokens — trim aggressively.
LLM-as-judgeUse a model to score outputs; prefer binary rubrics and validate against human labels.
Golden setA curated, labeled eval set you run offline as the regression bar for every change.
Lethal trifectaUntrusted input + access to private data + ability to exfiltrate — together they enable data theft.
Prompt injectionMalicious instructions in input. Direct = user-typed; indirect = hidden in retrieved/fetched content.
LoRA / PEFTParameter-efficient fine-tuning — train small adapter weights instead of the full model.
DistillationTrain a small student model to mimic a larger teacher for a cheaper specialized deployment.
MCPModel Context Protocol — an open standard for connecting models to tools and data sources.
QuantizationLower-precision weights (INT8/INT4) to shrink memory and speed inference, trading some accuracy.
FaithfulnessWhether the answer is actually supported by the provided context — groundedness, not correctness.
If a term comes up and you can give the one-liner plus when it matters, you keep momentum instead of stalling.

The three signals that win offers

What gets you the offer

Interviewers are buying one thing: confidence that you'll build something that works in production and keeps working. Three signals carry that. Show them and the level takes care of itself.

  • Production judgment over algorithmic polish — you talk in tokens, dollars, and failure modes, not just architecture diagrams.
  • Evals as a reflex — every design choice comes with how you’d measure it, and you validate the judge before trusting it.
  • Knowing when NOT to use an agent — you reach for the simplest control flow that works and defend it out loud.

Good luck. Reread the overview for the why, and keep The ADEPT Framework open as your in-room script. You’ve got this.