Skip to main content

Design an LLM Eval & Monitoring System

The prompt. "Design the system that tells us whether our LLM product is actually good — before we ship a change, and while it runs in production."

This is the round that increasingly carries the most weight at frontier labs. Anthropic's Applied AI team reportedly centers its system-design interview on eval harnesses rather than RAG architecture, because anyone can wire a vector store, but knowing whether the thing works — and proving it as the model, the prompt, and the data all shift underneath you — is the skill that separates senior AI engineers from people who have read about LLMs. The ADEPT framework still applies; the twist is that the system you're designing is the eval and observability platform itself.

offline
pre-merge regression — catch it before ship
online
production quality — catch it after ship
loop
promote escaped failures back into the set
Two timescales, one loop. Offline evals gate changes; online evals watch production; every production miss becomes an offline test. Designing only one half is the most common way to fail this round.

Phase A — Align

Before any pipeline, answer the only question that matters: what does "good" mean for this product, and who decides? An eval system encodes a definition of quality — if that definition is vague or owned by no one, everything downstream is theater.

Align on the platformfunctional vs non-functional
Definition of goodA measurable quality bar, owned by a named person — not "the answers seem fine."
Offline evalPre-merge regression: does this prompt/model/code change make quality go down?
Online evalProduction quality tracking: is the live system still good, right now, on real traffic?
The feedback loopThe wire between them — every production failure flows back to strengthen offline.
Eval cost & latencyEvals burn tokens too; a judge that costs more than the feature is its own failure.
CoverageWhich behaviors are actually exercised — an eval set that misses the failure mode is blind.
TrustAn uncalibrated judge is worse than no judge: it gives false confidence and you ship on it.
The platform’s job is to make quality a number a team can defend — and to keep that number honest as the model and data drift.

Phase D — Design the data layer

The eval system is only as good as its golden dataset, and the single biggest mistake is inventing inputs. You build the set from real production traces via error analysis — read what actually broke, cluster the failures, and curate from reality. Then you version it like code and grow it forever.

1

Build from failures

curate, don’t invent

Mine real production traces, do error analysis, cluster what broke. Start from observed failures — not inputs you imagined a user might type.

2

Version it

the set is code

Each example is an input / expected-output / context triple, versioned in source control. A change to the set is a reviewed diff, not a silent edit.

3

Grow it

ratchet, never shrink

Every escaped production failure is promoted into the set as a new case. The set only grows, so a fixed bug can never silently regress.

Hamel Husain’s discipline: the golden set is a living artifact grown from real failures, not a static fixture written once. See replay-and-evals for how traces become test cases.

Each example is an input / expected-output / context triple, and the set is a ratchet: it only grows, so a bug you fixed can never quietly come back. See replay and evals for turning captured traces into reproducible cases.


Phase E — Engineer the eval pipeline

The pipeline runs in three layers, cheapest and most deterministic first — Hamel Husain's framing. The art is in the middle layer: an LLM-as-judge is only trustworthy if you have validated the judge against human labels.

1
cheap · deterministic · runs first
Assertion / code-based evals
Deterministic checks for the failures you can express in code — schema valid, no PII leaked, required field present, latency budget met. Free, fast, zero ambiguity. Run these before you spend a token on a judge.
pass / fail
2
subjective quality
LLM-as-judge
For quality you can’t assert in code. Make the judge binary pass/fail, not a 1-5 score; hand it the rubric; and validate it against human labels — measure judge-vs-human agreement before you trust it. Control position and length bias with pairwise comparison plus ties.
binary + validated
3
ground truth
Human review
A sampled slice reviewed by people — both to catch what the judge misses and to keep recalibrating the judge. The smallest layer by volume, the anchor for everything above it.
sampled labels
Three layers, run cheapest first. The recurring failure is skipping layer-2 validation — shipping a judge nobody checked against humans.
Who validates the validators

An LLM judge is itself a model that can be wrong. If you have not measured how often it agrees with a human, you are not evaluating your product — you are trusting one un-evaluated model to grade another.

  • Binary pass/fail beats a 1-5 score: a number nobody can define consistently is noise dressed as signal.
  • Measure judge-vs-human agreement on a labeled sample before trusting the judge — an unvalidated judge is a confident liar.
  • Counter position and length bias with pairwise comparison and an explicit "tie" option.

The whole pipeline wires into CI as a regression gate: a change that drops the golden-set score doesn't merge. This is the same worker/checker separation as in verification — the thing that produces output never gets to grade itself.

Your AI Product Needs Evals
The three-layer model this pipeline follows — assertions, then LLM-as-judge, then human review — built from real production failures via error analysis. The core discipline: evals are software you grow from observed failures, not a benchmark you run once.
Creating a LLM-as-a-Judge That Drives Business Results
Make judges binary, give them a rubric, and validate them against human labels before trusting them. An unvalidated judge gives false confidence — and a team that ships on false confidence is worse off than one with no eval at all.

Phase P — Protect & optimize

Two threats. The mundane one is eval cost — evals burn tokens, so sample rather than score everything, use a cheap judge model, and cache results on unchanged inputs. The dangerous one is the meta-failure: optimizing a single metric until it games reality.

Don’t let a metric eat the system

The point of an eval is to track reality, not to produce a number that goes up. The moment a metric becomes the target, it stops measuring the thing you cared about.

  • A single headline number gets gamed: optimize needle-in-haystack and you overfit to needles, not usefulness.
  • Separate retrieval metrics from generation metrics so a strong one can’t mask a weak one.
  • Sample, use cheap judge models, and cache — an eval suite too expensive to run is an eval suite nobody runs.

Phase T — Test & evolve

This phase is the production half of the platform — the online timescale. You trace every run, account for every token, and watch the one metric that matters: cost per successful outcome, not raw spend.

Observe production
Tracingplatform
OpenTelemetry GenAI conventions — gen_ai.* spans capturing prompts, tool calls, token usage.
Without a trace you can’t do error analysis, and without error analysis the golden set stops growing.
Token accountingcost
Input and output tokens attributed per run, per feature, per tenant.
The raw input to cost — and to catching a prompt change that quietly tripled spend.
Cost per successful outcomecost
Dollars divided by tasks that actually succeeded — not dollars per call.
The north star: cutting raw cost while success drops is just a cheaper bad product.
Online quality scoreruntime
Sampled live traffic scored continuously by the same judge pipeline.
Catches drift the offline set never saw — the production half of the loop.
Regression alertingprocess
Alerts on quality, cost, and latency moving the wrong way.
Quality degrades silently; infra dashboards stay green while answers rot.
Online evals sample live traffic and score it continuously; alerts fire on quality, cost, or latency regressions; and every caught failure is promoted back into the golden set. See tracing, cost-accounting, and metrics-that-matter.

Then operate the hill-climb loop: query production, form one hypothesis, change exactly one thing, re-run the golden set, and keep the change only if the score rises. That discipline — plus tracing, cost accounting, and metrics that matter — is what turns an eval suite into a system that compounds.


Common mistakes

What earns the offer
A defined, owned, measurable bar for "good" — built from real failures
Binary, rubric-driven judges validated against human labels
Both timescales — offline regression gate AND online quality tracking
Cost per successful outcome as the north-star metric
A loop: every escaped production failure becomes an offline test
What flags you
"It looked good to me" passing for an eval
1-5 judge scores nobody can apply consistently
A judge never checked against humans — false confidence at scale
Tracking only latency and errors while quality silently degrades
Optimizing raw cost — a cheaper, worse product — instead of cost-per-success
Offline OR online, never both, with no loop between them
The throughline: an eval system that isn’t validated, isn’t closed-loop, or optimizes the wrong number is one that tells you you’re fine right up until you ship the regression.

Next: revisit the Foundations each ADEPT phase draws on, or move to the coding rounds where you implement the harness these systems run on.