Design an LLM Eval & Monitoring System
The prompt. "Design the system that tells us whether our LLM product is actually good — before we ship a change, and while it runs in production."
This is the round that increasingly carries the most weight at frontier labs. Anthropic's Applied AI team reportedly centers its system-design interview on eval harnesses rather than RAG architecture, because anyone can wire a vector store, but knowing whether the thing works — and proving it as the model, the prompt, and the data all shift underneath you — is the skill that separates senior AI engineers from people who have read about LLMs. The ADEPT framework still applies; the twist is that the system you're designing is the eval and observability platform itself.
Phase A — Align
Before any pipeline, answer the only question that matters: what does "good" mean for this product, and who decides? An eval system encodes a definition of quality — if that definition is vague or owned by no one, everything downstream is theater.
Phase D — Design the data layer
The eval system is only as good as its golden dataset, and the single biggest mistake is inventing inputs. You build the set from real production traces via error analysis — read what actually broke, cluster the failures, and curate from reality. Then you version it like code and grow it forever.
Build from failures
curate, don’t invent
Mine real production traces, do error analysis, cluster what broke. Start from observed failures — not inputs you imagined a user might type.
Version it
the set is code
Each example is an input / expected-output / context triple, versioned in source control. A change to the set is a reviewed diff, not a silent edit.
Grow it
ratchet, never shrink
Every escaped production failure is promoted into the set as a new case. The set only grows, so a fixed bug can never silently regress.
Each example is an input / expected-output / context triple, and the set is a ratchet: it only grows, so a bug you fixed can never quietly come back. See replay and evals for turning captured traces into reproducible cases.
Phase E — Engineer the eval pipeline
The pipeline runs in three layers, cheapest and most deterministic first — Hamel Husain's framing. The art is in the middle layer: an LLM-as-judge is only trustworthy if you have validated the judge against human labels.
An LLM judge is itself a model that can be wrong. If you have not measured how often it agrees with a human, you are not evaluating your product — you are trusting one un-evaluated model to grade another.
- Binary pass/fail beats a 1-5 score: a number nobody can define consistently is noise dressed as signal.
- Measure judge-vs-human agreement on a labeled sample before trusting the judge — an unvalidated judge is a confident liar.
- Counter position and length bias with pairwise comparison and an explicit "tie" option.
The whole pipeline wires into CI as a regression gate: a change that drops the golden-set score doesn't merge. This is the same worker/checker separation as in verification — the thing that produces output never gets to grade itself.
Phase P — Protect & optimize
Two threats. The mundane one is eval cost — evals burn tokens, so sample rather than score everything, use a cheap judge model, and cache results on unchanged inputs. The dangerous one is the meta-failure: optimizing a single metric until it games reality.
The point of an eval is to track reality, not to produce a number that goes up. The moment a metric becomes the target, it stops measuring the thing you cared about.
- A single headline number gets gamed: optimize needle-in-haystack and you overfit to needles, not usefulness.
- Separate retrieval metrics from generation metrics so a strong one can’t mask a weak one.
- Sample, use cheap judge models, and cache — an eval suite too expensive to run is an eval suite nobody runs.
Phase T — Test & evolve
This phase is the production half of the platform — the online timescale. You trace every run, account for every token, and watch the one metric that matters: cost per successful outcome, not raw spend.
Then operate the hill-climb loop: query production, form one hypothesis, change exactly one thing, re-run the golden set, and keep the change only if the score rises. That discipline — plus tracing, cost accounting, and metrics that matter — is what turns an eval suite into a system that compounds.
Common mistakes
Next: revisit the Foundations each ADEPT phase draws on, or move to the coding rounds where you implement the harness these systems run on.