Foundations
Nine concept clusters separate the candidate who has shipped agents from the one who has read about them. This chapter is the map — a dense, scannable reference to each cluster: a tight definition, the tradeoffs interviewers actually probe, the key terms you must use without hesitating, and two links per cluster — one to the deep-dive on this site that teaches it in full, one to the authoritative external source. Skim it to find your gaps; click through to close them.
The interview is not a vocabulary test. It is a test of judgment under tradeoffs. Every cluster below is really a decision: bigger context or tighter context, RAG or fine-tune, workflow or agent, judge or assertion. Speak the terms, but win on the tradeoff.
1. LLM fundamentals
The model is a probabilistic next-token machine, and every cost, latency, and reliability property falls out of that. Tokenization (BPE) splits text into sub-word tokens — it drives both price and how much fits in the window. The context window is a finite token budget shared by input and output. Decoding is sampling: temperature scales logit sharpness (low = peaked/deterministic, high = flat/creative), while top-p/top-k truncate the candidate set. Embeddings map text to vectors for similarity; function/tool calling plus structured output (JSON schema, constrained decoding) make the model programmable rather than just conversational.
Tradeoffs probed: low temp = repeatable but bland vs high temp = creative but risky; bigger context window does not mean better answers. Know that latency splits into TTFT (time-to-first-token) and TPOT (time-per-output-token), that output latency tracks output length not input size, and that input and output tokens are priced differently — so the cheap fix is often "make it answer shorter," not "make the prompt smaller."
BPE · context window · temperature · top-p/top-k · embeddings · tool calling · structured output · TTFT · TPOT
Go deep: Context assembly · Chip Huyen, AI Engineering
2. Prompt and context engineering
Prompt engineering is crafting the instructions for one call; context engineering is curating everything in the window across an agent's entire run. This matters because of context rot — as the token count grows, recall degrades, surfacing as context poisoning (a bad fact persists), distraction (irrelevant tokens crowd out signal), and confusion (contradictory context). Mitigate with compaction, retrieval over stuffing, and disciplined session management.
Tradeoffs probed: few-shot vs zero-shot; the value of decomposing a "God object" prompt into single-purpose, testable prompts. Know prompt caching cold: a stable prefix is reused so cache reads cost ~10% of normal input price — up to ~90% cost and ~85% latency savings — but a single-token change to the prefix (say, a timestamp at the top) invalidates the whole cache. Put volatile content at the end.
context rot · poisoning / distraction / confusion · compaction · prefix caching · few-shot vs zero-shot
Go deep: Compaction · Prompt caching · Anthropic — effective context engineering
3. Retrieval-augmented generation (RAG)
RAG retrieves external chunks and injects them as context to ground generation in current, proprietary facts. The pipeline is chunk → embed → vector index → retrieve → (rerank) → generate. Chunking strategy matters — fixed, recursive, semantic, or parent-child (retrieve a small chunk, feed the large parent). Hybrid search — BM25 (sparse/lexical) plus dense embeddings, merged with reciprocal rank fusion — is the production default; a cross-encoder reranker rescores the top-k for precision; GraphRAG suits highly connected corpora.
Tradeoffs probed: RAG beats fine-tuning for knowledge and freshness; long context will not replace RAG (cost plus "lost in the middle" recall sag); a bi-encoder is fast and cacheable while a cross-encoder is accurate but expensive. Name the metrics: context precision (share of retrieved that is relevant), context recall (share of relevant that is retrieved), faithfulness, plus recall@k, MRR, NDCG.
chunk → embed → index → retrieve → rerank → generate · hybrid search · RRF · cross-encoder · GraphRAG · lost in the middle
Go deep: The context hierarchy · Chip Huyen, AI Engineering (Ch. 6)
4. Agents
Lilian Weng's decomposition is the canonical frame: an agent is an LLM as brain wired to Planning + Memory + Tool use. Planning spans CoT, Tree-of-Thoughts, ReAct (interleave reasoning and acting), and Reflexion (learn from failed attempts). Memory is short-term/working (in-context) plus long-term (external vector store with fast retrieval). The runtime is the agent loop: observe → think → act → observe.
Anthropic draws the load-bearing line between workflows (LLMs orchestrated through predefined code paths) and agents (the LLM directs its own process). The five workflow patterns — chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer — cover most real systems.
Tradeoffs probed: prefer deterministic workflows for reliability and go agentic only when subtasks are genuinely unpredictable; single-agent vs multi-agent trades simplicity for parallelism but adds coordination cost — a multi-agent system can burn ~15× the tokens of a chat (Anthropic). "Why not just a workflow here?" is a question you must always be ready to answer.
Planning + Memory + Tool use · ReAct · Reflexion · observe → think → act · workflow vs agent · orchestrator-worker
Go deep: The agent loop · Sub-agents · Anthropic — building effective agents
5. Evaluation
Eval-driven development (Hamel Husain) inverts the usual instinct: do not start by writing evals, start with error analysis on real traces. Then write assertion/code-based evals for deterministic failures (cheap, fast, unambiguous) and reach for LLM-as-judge only for subjective cases. Make judges binary pass/fail, never a vague 1-5, and validate the judge against human labels before you trust it. Offline evals (golden sets, regression suites) are fast and cheap; online evals (A/B tests, production traces) are real but slow.
Tradeoffs probed: over-optimizing a single metric quietly degrades real performance. Know the RAG triad — faithfulness, answer relevance, context relevance — and the judge failure modes: position bias and length bias, controlled with pairwise comparison and by allowing ties.
error analysis first · assertion vs LLM-judge · binary judges · validate against humans · offline vs online · position / length bias
Go deep: Metrics that matter · Replay and evals · Hamel Husain — Your AI product needs evals
6. Guardrails and safety
Prompt injection — especially the indirect kind, smuggled in via retrieved documents or tool output — is the central agentic threat. Simon Willison's lethal trifecta names the failure precisely: an agent with (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally can be made to exfiltrate that data by a single poisoned input — no code vulnerability required. There is no reliable prompt-only fix; the defense is to break one leg of the trifecta.
Tradeoffs probed: guardrails add latency, cost, and false positives; sandboxing trades developer velocity for containment. Know the layered defenses: jailbreak detection, PII redaction, content moderation, output/schema validation, and sandboxing with least privilege and egress control.
prompt injection (indirect) · lethal trifecta · break one leg · PII redaction · schema validation · sandboxing / egress control
Go deep: Prompt injection · Sandboxing · Simon Willison — the lethal trifecta
7. Serving and inference
You should be able to explain why vLLM is fast. PagedAttention stores the KV cache in non-contiguous paged blocks — like an OS managing virtual memory — eliminating fragmentation for roughly 2-4× throughput, and continuous batching slots new requests in as earlier generations finish rather than waiting for a whole batch. The KV cache itself exists to avoid recomputing attention over prior tokens. Quantization (INT8/INT4, FP8, GPTQ, AWQ) shrinks memory at a small accuracy cost. Alternatives: TGI (Hugging Face, simple ops) and TensorRT-LLM (NVIDIA, compiled, max throughput).
Tradeoffs probed: prefill is compute-bound and parallel while decode is memory-bound and sequential, and attention is quadratic in sequence length — so long prompts hurt. Throughput trades against latency (batching lifts throughput but can hurt TTFT); quantization saves memory for a little accuracy; self-hosting is justified only by privacy or scale-economics, otherwise call an API. A model gateway/router picks the right model per request.
PagedAttention · continuous batching · KV cache · quantization (INT8/INT4, FP8, GPTQ, AWQ) · prefill vs decode · quadratic attention
Go deep: Model gateway · vLLM — PagedAttention and continuous batching
8. Observability and cost
You cannot improve what you cannot trace. The OpenTelemetry GenAI semantic conventions standardize agent telemetry through gen_ai.* span attributes — request.model, usage.input_tokens/output_tokens, and dedicated tool-call spans — so traces are portable across vendors.
Tradeoffs probed: track cost per successful outcome, not raw token cost — a cheap call that fails and retries five times is expensive. Trace the full multi-step run, not isolated calls, and review production traces on a regular cadence (it doubles as the input to error analysis from cluster 5).
OpenTelemetry GenAI · gen_ai.* spans · cost per successful outcome · full-run tracing
Go deep: Tracing · Cost accounting · applied-llms.org
9. Fine-tuning vs RAG vs prompting
There is a decision ladder, and the order is the whole answer: prompt → RAG → fine-tune, in that sequence, escalating only when the cheaper rung plateaus. LoRA/QLoRA/PEFT train small low-rank adapters — cheap and swappable — instead of full weights; distillation compresses a teacher model into a smaller student. The heuristic: need fresh or proprietary facts → RAG; need consistent behavior/format or a cheaper specialized model → fine-tune (LoRA first); just need more capability → write a better prompt.
Tradeoffs probed: fine-tune only after prompting plateaus around 90% and you expect to iterate repeatedly; "no GPUs before PMF"; and remember the recurring tax — every base-model upgrade can force you to redo the fine-tune.
prompt → RAG → fine-tune · LoRA/QLoRA/PEFT · distillation · facts vs behavior vs capability · no GPUs before PMF
Go deep: Evaluation and quality · applied-llms.org
Red flags
The fastest way to lose a senior signal is to over-engineer. Strong candidates reach for the simplest mechanism that works, escalate only on evidence, and can always name the tradeoff they are accepting.
- Reaching for fine-tuning to add knowledge — that is RAG's job; fine-tuning shapes behavior.
- Claiming a long context window removes the need for RAG — it raises cost and loses the middle.
- Treating prompt injection as a prompt-engineering problem — the fix is architectural: break the lethal trifecta.
- Defaulting to a multi-agent system when a deterministic workflow would be more reliable and far cheaper.
- Quoting raw token cost as the efficiency metric instead of cost per successful outcome.
- Using a vague 1-5 LLM judge that was never validated against human labels.
What to study next: pick the two clusters where you hesitated, click their Go deep links, and come back able to defend the tradeoff out loud. Then move on to the patterns chapter, where these foundations get assembled into systems.