Skip to main content

Foundations

Nine concept clusters separate the candidate who has shipped agents from the one who has read about them. This chapter is the map — a dense, scannable reference to each cluster: a tight definition, the tradeoffs interviewers actually probe, the key terms you must use without hesitating, and two links per cluster — one to the deep-dive on this site that teaches it in full, one to the authoritative external source. Skim it to find your gaps; click through to close them.

The interview is not a vocabulary test. It is a test of judgment under tradeoffs. Every cluster below is really a decision: bigger context or tighter context, RAG or fine-tune, workflow or agent, judge or assertion. Speak the terms, but win on the tradeoff.

~10%
cache-read price vs fresh input tokens
~90% / ~85%
cost / latency cut from prompt caching
~15×
tokens a multi-agent system can burn
Numbers worth knowing cold — the kind an interviewer drops to see if you flinch.

1. LLM fundamentals

The model is a probabilistic next-token machine, and every cost, latency, and reliability property falls out of that. Tokenization (BPE) splits text into sub-word tokens — it drives both price and how much fits in the window. The context window is a finite token budget shared by input and output. Decoding is sampling: temperature scales logit sharpness (low = peaked/deterministic, high = flat/creative), while top-p/top-k truncate the candidate set. Embeddings map text to vectors for similarity; function/tool calling plus structured output (JSON schema, constrained decoding) make the model programmable rather than just conversational.

Tradeoffs probed: low temp = repeatable but bland vs high temp = creative but risky; bigger context window does not mean better answers. Know that latency splits into TTFT (time-to-first-token) and TPOT (time-per-output-token), that output latency tracks output length not input size, and that input and output tokens are priced differently — so the cheap fix is often "make it answer shorter," not "make the prompt smaller."

BPE · context window · temperature · top-p/top-k · embeddings · tool calling · structured output · TTFT · TPOT

Go deep: Context assembly · Chip Huyen, AI Engineering

2. Prompt and context engineering

Prompt engineering is crafting the instructions for one call; context engineering is curating everything in the window across an agent's entire run. This matters because of context rot — as the token count grows, recall degrades, surfacing as context poisoning (a bad fact persists), distraction (irrelevant tokens crowd out signal), and confusion (contradictory context). Mitigate with compaction, retrieval over stuffing, and disciplined session management.

Tradeoffs probed: few-shot vs zero-shot; the value of decomposing a "God object" prompt into single-purpose, testable prompts. Know prompt caching cold: a stable prefix is reused so cache reads cost ~10% of normal input price — up to ~90% cost and ~85% latency savings — but a single-token change to the prefix (say, a timestamp at the top) invalidates the whole cache. Put volatile content at the end.

context rot · poisoning / distraction / confusion · compaction · prefix caching · few-shot vs zero-shot

Go deep: Compaction · Prompt caching · Anthropic — effective context engineering

3. Retrieval-augmented generation (RAG)

RAG retrieves external chunks and injects them as context to ground generation in current, proprietary facts. The pipeline is chunk → embed → vector index → retrieve → (rerank) → generate. Chunking strategy matters — fixed, recursive, semantic, or parent-child (retrieve a small chunk, feed the large parent). Hybrid searchBM25 (sparse/lexical) plus dense embeddings, merged with reciprocal rank fusion — is the production default; a cross-encoder reranker rescores the top-k for precision; GraphRAG suits highly connected corpora.

Tradeoffs probed: RAG beats fine-tuning for knowledge and freshness; long context will not replace RAG (cost plus "lost in the middle" recall sag); a bi-encoder is fast and cacheable while a cross-encoder is accurate but expensive. Name the metrics: context precision (share of retrieved that is relevant), context recall (share of relevant that is retrieved), faithfulness, plus recall@k, MRR, NDCG.

chunk → embed → index → retrieve → rerank → generate · hybrid search · RRF · cross-encoder · GraphRAG · lost in the middle

Go deep: The context hierarchy · Chip Huyen, AI Engineering (Ch. 6)

4. Agents

Lilian Weng's decomposition is the canonical frame: an agent is an LLM as brain wired to Planning + Memory + Tool use. Planning spans CoT, Tree-of-Thoughts, ReAct (interleave reasoning and acting), and Reflexion (learn from failed attempts). Memory is short-term/working (in-context) plus long-term (external vector store with fast retrieval). The runtime is the agent loop: observe → think → act → observe.

Anthropic draws the load-bearing line between workflows (LLMs orchestrated through predefined code paths) and agents (the LLM directs its own process). The five workflow patterns — chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer — cover most real systems.

Tradeoffs probed: prefer deterministic workflows for reliability and go agentic only when subtasks are genuinely unpredictable; single-agent vs multi-agent trades simplicity for parallelism but adds coordination cost — a multi-agent system can burn ~15× the tokens of a chat (Anthropic). "Why not just a workflow here?" is a question you must always be ready to answer.

LLM Powered Autonomous Agents
The reference decomposition of an agent into planning, memory, and tool use with the LLM as the controlling brain — the vocabulary nearly every interviewer expects.

Planning + Memory + Tool use · ReAct · Reflexion · observe → think → act · workflow vs agent · orchestrator-worker

Go deep: The agent loop · Sub-agents · Anthropic — building effective agents

5. Evaluation

Eval-driven development (Hamel Husain) inverts the usual instinct: do not start by writing evals, start with error analysis on real traces. Then write assertion/code-based evals for deterministic failures (cheap, fast, unambiguous) and reach for LLM-as-judge only for subjective cases. Make judges binary pass/fail, never a vague 1-5, and validate the judge against human labels before you trust it. Offline evals (golden sets, regression suites) are fast and cheap; online evals (A/B tests, production traces) are real but slow.

Tradeoffs probed: over-optimizing a single metric quietly degrades real performance. Know the RAG triad — faithfulness, answer relevance, context relevance — and the judge failure modes: position bias and length bias, controlled with pairwise comparison and by allowing ties.

error analysis first · assertion vs LLM-judge · binary judges · validate against humans · offline vs online · position / length bias

Go deep: Metrics that matter · Replay and evals · Hamel Husain — Your AI product needs evals

6. Guardrails and safety

Prompt injection — especially the indirect kind, smuggled in via retrieved documents or tool output — is the central agentic threat. Simon Willison's lethal trifecta names the failure precisely: an agent with (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally can be made to exfiltrate that data by a single poisoned input — no code vulnerability required. There is no reliable prompt-only fix; the defense is to break one leg of the trifecta.

Tradeoffs probed: guardrails add latency, cost, and false positives; sandboxing trades developer velocity for containment. Know the layered defenses: jailbreak detection, PII redaction, content moderation, output/schema validation, and sandboxing with least privilege and egress control.

The lethal trifecta for AI agents
Private data + untrusted content + external communication = data exfiltration with no code exploit. The defense is architectural: remove one leg, because the prompt layer cannot be trusted to hold.

prompt injection (indirect) · lethal trifecta · break one leg · PII redaction · schema validation · sandboxing / egress control

Go deep: Prompt injection · Sandboxing · Simon Willison — the lethal trifecta

7. Serving and inference

You should be able to explain why vLLM is fast. PagedAttention stores the KV cache in non-contiguous paged blocks — like an OS managing virtual memory — eliminating fragmentation for roughly 2-4× throughput, and continuous batching slots new requests in as earlier generations finish rather than waiting for a whole batch. The KV cache itself exists to avoid recomputing attention over prior tokens. Quantization (INT8/INT4, FP8, GPTQ, AWQ) shrinks memory at a small accuracy cost. Alternatives: TGI (Hugging Face, simple ops) and TensorRT-LLM (NVIDIA, compiled, max throughput).

Tradeoffs probed: prefill is compute-bound and parallel while decode is memory-bound and sequential, and attention is quadratic in sequence length — so long prompts hurt. Throughput trades against latency (batching lifts throughput but can hurt TTFT); quantization saves memory for a little accuracy; self-hosting is justified only by privacy or scale-economics, otherwise call an API. A model gateway/router picks the right model per request.

PagedAttention · continuous batching · KV cache · quantization (INT8/INT4, FP8, GPTQ, AWQ) · prefill vs decode · quadratic attention

Go deep: Model gateway · vLLM — PagedAttention and continuous batching

8. Observability and cost

You cannot improve what you cannot trace. The OpenTelemetry GenAI semantic conventions standardize agent telemetry through gen_ai.* span attributes — request.model, usage.input_tokens/output_tokens, and dedicated tool-call spans — so traces are portable across vendors.

Tradeoffs probed: track cost per successful outcome, not raw token cost — a cheap call that fails and retries five times is expensive. Trace the full multi-step run, not isolated calls, and review production traces on a regular cadence (it doubles as the input to error analysis from cluster 5).

Cost per successful outcomecost
Total spend divided by tasks that actually succeeded.
Raw token cost hides retries and failed runs; this is the number that maps to unit economics.
Full-run traceruntime
One trace spanning every model call, tool call, and retry in a task.
Single-call metrics cannot explain where a multi-step agent went wrong.
gen_ai.* attributesplatform
OTel-standard span fields for model, tokens, and tool calls.
Portable telemetry that any backend can ingest without custom glue.
What to instrument first — and the reason an interviewer wants to hear.

OpenTelemetry GenAI · gen_ai.* spans · cost per successful outcome · full-run tracing

Go deep: Tracing · Cost accounting · applied-llms.org

9. Fine-tuning vs RAG vs prompting

There is a decision ladder, and the order is the whole answer: prompt → RAG → fine-tune, in that sequence, escalating only when the cheaper rung plateaus. LoRA/QLoRA/PEFT train small low-rank adapters — cheap and swappable — instead of full weights; distillation compresses a teacher model into a smaller student. The heuristic: need fresh or proprietary facts → RAG; need consistent behavior/format or a cheaper specialized model → fine-tune (LoRA first); just need more capability → write a better prompt.

Tradeoffs probed: fine-tune only after prompting plateaus around 90% and you expect to iterate repeatedly; "no GPUs before PMF"; and remember the recurring tax — every base-model upgrade can force you to redo the fine-tune.

Reach for RAG when…
You need fresh or proprietary facts
Knowledge changes faster than you can retrain; ground it at query time.
Answers must be attributable
Retrieved chunks give citations and an audit trail.
The corpus is large or volatile
Re-index cheaply instead of re-training weights.
Reach for fine-tuning when…
You need consistent behavior or format
Bake the style or schema into the weights, not every prompt.
You want a cheaper specialized model
A small fine-tuned model can beat a big prompted one on a narrow task.
Prompting has plateaued ~90%
And you will iterate enough to amortize the cost — start with LoRA.
The decision an interviewer almost always asks: RAG, fine-tune, or just a better prompt?

prompt → RAG → fine-tune · LoRA/QLoRA/PEFT · distillation · facts vs behavior vs capability · no GPUs before PMF

Go deep: Evaluation and quality · applied-llms.org

Red flags

Red flags interviewers listen for

The fastest way to lose a senior signal is to over-engineer. Strong candidates reach for the simplest mechanism that works, escalate only on evidence, and can always name the tradeoff they are accepting.

  • Reaching for fine-tuning to add knowledge — that is RAG's job; fine-tuning shapes behavior.
  • Claiming a long context window removes the need for RAG — it raises cost and loses the middle.
  • Treating prompt injection as a prompt-engineering problem — the fix is architectural: break the lethal trifecta.
  • Defaulting to a multi-agent system when a deterministic workflow would be more reliable and far cheaper.
  • Quoting raw token cost as the efficiency metric instead of cost per successful outcome.
  • Using a vague 1-5 LLM judge that was never validated against human labels.

What to study next: pick the two clusters where you hesitated, click their Go deep links, and come back able to defend the tradeoff out loud. Then move on to the patterns chapter, where these foundations get assembled into systems.