Design an LLM Eval & Monitoring System

The prompt. "Design the system that tells us whether our LLM product is actually good — before we ship a change, and while it runs in production."

This is the round that increasingly carries the most weight at frontier labs. Anthropic's Applied AI team reportedly centers its system-design interview on eval harnesses rather than RAG architecture, because anyone can wire a vector store, but knowing whether the thing works — and proving it as the model, the prompt, and the data all shift underneath you — is the skill that separates senior AI engineers from people who have read about LLMs. The ADEPT framework still applies; the twist is that the system you're designing is the eval and observability platform itself.

offline

pre-merge regression — catch it before ship

online

production quality — catch it after ship

loop

promote escaped failures back into the set

Two timescales, one loop. Offline evals gate changes; online evals watch production; every production miss becomes an offline test. Designing only one half is the most common way to fail this round.

Phase A — Align

Before any pipeline, answer the only question that matters: what does "good" mean for this product, and who decides? An eval system encodes a definition of quality — if that definition is vague or owned by no one, everything downstream is theater.

Align on the platformfunctional vs non-functional

Definition of goodA measurable quality bar, owned by a named person — not "the answers seem fine."

Offline evalPre-merge regression: does this prompt/model/code change make quality go down?

Online evalProduction quality tracking: is the live system still good, right now, on real traffic?

The feedback loopThe wire between them — every production failure flows back to strengthen offline.

Eval cost & latencyEvals burn tokens too; a judge that costs more than the feature is its own failure.

CoverageWhich behaviors are actually exercised — an eval set that misses the failure mode is blind.

TrustAn uncalibrated judge is worse than no judge: it gives false confidence and you ship on it.

The platform’s job is to make quality a number a team can defend — and to keep that number honest as the model and data drift.

Phase D — Design the data layer

The eval system is only as good as its golden dataset, and the single biggest mistake is inventing inputs. You build the set from real production traces via error analysis — read what actually broke, cluster the failures, and curate from reality. Then you version it like code and grow it forever.

Build from failures

curate, don’t invent

Mine real production traces, do error analysis, cluster what broke. Start from observed failures — not inputs you imagined a user might type.

Version it

the set is code

Each example is an input / expected-output / context triple, versioned in source control. A change to the set is a reviewed diff, not a silent edit.

Grow it

ratchet, never shrink

Every escaped production failure is promoted into the set as a new case. The set only grows, so a fixed bug can never silently regress.

Hamel Husain’s discipline: the golden set is a living artifact grown from real failures, not a static fixture written once. See replay-and-evals for how traces become test cases.

Each example is an input / expected-output / context triple, and the set is a ratchet: it only grows, so a bug you fixed can never quietly come back. See replay and evals for turning captured traces into reproducible cases.

Phase E — Engineer the eval pipeline

The pipeline runs in three layers, cheapest and most deterministic first — Hamel Husain's framing. The art is in the middle layer: an LLM-as-judge is only trustworthy if you have validated the judge against human labels.

cheap · deterministic · runs first

Assertion / code-based evals

Deterministic checks for the failures you can express in code — schema valid, no PII leaked, required field present, latency budget met. Free, fast, zero ambiguity. Run these before you spend a token on a judge.

pass / fail

subjective quality

LLM-as-judge

For quality you can’t assert in code. Make the judge binary pass/fail, not a 1-5 score; hand it the rubric; and validate it against human labels — measure judge-vs-human agreement before you trust it. Control position and length bias with pairwise comparison plus ties.

binary + validated

ground truth

Human review

A sampled slice reviewed by people — both to catch what the judge misses and to keep recalibrating the judge. The smallest layer by volume, the anchor for everything above it.

sampled labels

Three layers, run cheapest first. The recurring failure is skipping layer-2 validation — shipping a judge nobody checked against humans.

Who validates the validators

An LLM judge is itself a model that can be wrong. If you have not measured how often it agrees with a human, you are not evaluating your product — you are trusting one un-evaluated model to grade another.

Binary pass/fail beats a 1-5 score: a number nobody can define consistently is noise dressed as signal.
Measure judge-vs-human agreement on a labeled sample before trusting the judge — an unvalidated judge is a confident liar.
Counter position and length bias with pairwise comparison and an explicit "tie" option.

The whole pipeline wires into CI as a regression gate: a change that drops the golden-set score doesn't merge. This is the same worker/checker separation as in verification — the thing that produces output never gets to grade itself.

Your AI Product Needs Evals ↗

Hamel Husain·2024·Blog

The three-layer model this pipeline follows — assertions, then LLM-as-judge, then human review — built from real production failures via error analysis. The core discipline: evals are software you grow from observed failures, not a benchmark you run once.

Creating a LLM-as-a-Judge That Drives Business Results ↗

Hamel Husain·2024·Blog

Make judges binary, give them a rubric, and validate them against human labels before trusting them. An unvalidated judge gives false confidence — and a team that ships on false confidence is worse off than one with no eval at all.

Phase P — Protect & optimize

Two threats. The mundane one is eval cost — evals burn tokens, so sample rather than score everything, use a cheap judge model, and cache results on unchanged inputs. The dangerous one is the meta-failure: optimizing a single metric until it games reality.

Don’t let a metric eat the system

The point of an eval is to track reality, not to produce a number that goes up. The moment a metric becomes the target, it stops measuring the thing you cared about.

A single headline number gets gamed: optimize needle-in-haystack and you overfit to needles, not usefulness.
Separate retrieval metrics from generation metrics so a strong one can’t mask a weak one.
Sample, use cheap judge models, and cache — an eval suite too expensive to run is an eval suite nobody runs.

Phase T — Test & evolve

This phase is the production half of the platform — the online timescale. You trace every run, account for every token, and watch the one metric that matters: cost per successful outcome, not raw spend.

Observe production

Tracingplatform

OpenTelemetry GenAI conventions — gen_ai.* spans capturing prompts, tool calls, token usage.

Without a trace you can’t do error analysis, and without error analysis the golden set stops growing.

Token accountingcost

Input and output tokens attributed per run, per feature, per tenant.

The raw input to cost — and to catching a prompt change that quietly tripled spend.

Cost per successful outcomecost

Dollars divided by tasks that actually succeeded — not dollars per call.

The north star: cutting raw cost while success drops is just a cheaper bad product.

Online quality scoreruntime

Sampled live traffic scored continuously by the same judge pipeline.

Catches drift the offline set never saw — the production half of the loop.

Regression alertingprocess

Alerts on quality, cost, and latency moving the wrong way.

Quality degrades silently; infra dashboards stay green while answers rot.

Online evals sample live traffic and score it continuously; alerts fire on quality, cost, or latency regressions; and every caught failure is promoted back into the golden set. See tracing, cost-accounting, and metrics-that-matter.

Then operate the hill-climb loop: query production, form one hypothesis, change exactly one thing, re-run the golden set, and keep the change only if the score rises. That discipline — plus tracing, cost accounting, and metrics that matter — is what turns an eval suite into a system that compounds.

Common mistakes

✓What earns the offer

A defined, owned, measurable bar for "good" — built from real failures

Binary, rubric-driven judges validated against human labels

Both timescales — offline regression gate AND online quality tracking

Cost per successful outcome as the north-star metric

A loop: every escaped production failure becomes an offline test

✕What flags you

"It looked good to me" passing for an eval

1-5 judge scores nobody can apply consistently

A judge never checked against humans — false confidence at scale

Tracking only latency and errors while quality silently degrades

Optimizing raw cost — a cheaper, worse product — instead of cost-per-success

Offline OR online, never both, with no loop between them

The throughline: an eval system that isn’t validated, isn’t closed-loop, or optimizes the wrong number is one that tells you you’re fine right up until you ship the regression.

Next: revisit the Foundations each ADEPT phase draws on, or move to the coding rounds where you implement the harness these systems run on.

Phase A — Align​

Phase D — Design the data layer​

Phase E — Engineer the eval pipeline​

Phase P — Protect & optimize​

Phase T — Test & evolve​

Common mistakes​