Evaluation & Quality
You cannot improve, or safely scale, what you cannot measure. An evaluation harness is what turns "the agent feels better this week" into a number you can hill-climb — and what catches regressions before they reach production. It's also the answer to the platform's central tension: how to ship faster without shipping more slop.
Every prompt tweak, model swap, and retrieval change is a hypothesis. Without evals, you're shipping vibes — and you'll discover the regression in production, attributed to nobody. With evals, every change is a measured experiment.
This is the feedback and evaluation patterns applied as platform infrastructure that every workflow plugs into.
Structure
Production failures become new eval cases, so the suite gets stronger every time something breaks. The harness is a gate in the pipeline, not a side project.
What to Build
- A golden case set — representative tasks with known-good outcomes, drawn from real work. This is the foundation; everything else scores against it.
- Automated scoring where possible — tests pass, types check, output matches schema. Cheap, objective, fast. Prefer it wherever the task is verifiable.
- LLM-as-Judge — for subjective quality (is this a good review comment?), a model scores against a rubric. Calibrate it against human judgment so you trust the number.
- Domain metrics — task-specific signals: PR-review precision, test coverage delta, time-to-triage, false-positive rate.
- Human feedback capture — accept/reject/edit signals from engineers, fed back as new cases and as a quality trend.
- Regression gating — wire the suite into CI so a change that drops the score can't ship silently.
Hill-climbing
Evaluation isn't a one-time audit — it's the loop you optimize in. Make one change, measure against the golden set, keep it if the score rises, revert if it falls. Repeat. This is hill-climbing, and it's the difference between a platform that improves predictably and one that drifts on intuition.
The discipline: change one thing at a time, measure, and never let "it feels better" override "the number went down."
Key Characteristics
- Measurement precedes scale — never expand an agent's autonomy past what its eval record justifies. Evals are how authority is earned.
- Production is your best test set — every failure that escaped is a case you were missing. Close the loop.
- Judge calibration matters — an LLM judge you haven't checked against humans is just another ungrounded opinion.
- Speed and quality are both measured — track throughput and regression rate, so "faster" never quietly means "worse."
- Evals make accountability concrete — a scored, gated change is an attributable decision, not a vibe deployment.
When to Use
- Always, before scaling any workflow's autonomy or reach.
- When prompt/model/retrieval changes land on intuition and regressions surface in production.
- When you need to prove the platform helps — to skeptics, to leadership, to yourself.
Pitfalls
- No evals at all — the Vibe Deployment anti-pattern. Shipping changes you can't measure is how slop accumulates invisibly.
- Evals that don't reflect real work — a suite of toy cases gives a green number and a false sense of safety. The Happy Path Mirage.
- Measuring speed only — optimizing throughput without a quality gate is optimizing for faster regressions.