Metrics That Matter
Reliability is an evidence problem. The harness has to expose what the agent actually did and how good it was, in a form that guides the next decision. Without that, agents decide under uncertainty, evaluations become opinions, and retries become blind wandering. This page is about the signals worth collecting — and the ones that quietly let a system get worse while the dashboards stay green.
Code review shows what was written; runtime tracing shows what actually ran. You need both.
Two layers, not one
Most teams instrument the machine and forget the work. A harness needs both: runtime observability answers what the system did, and process observability answers why this change should be accepted. Runtime signals without process signals tell you the app booted but not whether the agent built the right thing; process signals without runtime signals are taste with no evidence underneath. Runtime signals are captured as traces and spans and aggregated across runs by telemetry.
Runtime observability
what did the system do
System-level signals emitted as the agent runs — the same telemetry you would expect from any production service, captured around the agent loop.
- Lifecycle: did it reach ready, run, shut down cleanly
- Critical-path execution: entry, checkpoints, exit
- Data flow between components
- Resource trend (e.g. monotonically growing memory)
- Full error context, not just the message
Process observability
why accept this change
Visibility into the harness’s own decision artifacts — the plan, the contract, the score — so acceptance is reproducible instead of a vibe.
- Sprint contract: scope, verification standards, exclusions
- Evaluator rubric scores with hard thresholds
- Acceptance criteria tied to evidence
- Where the checker’s judgment diverged from a human’s
When this is missing, the failures are systematic, not occasional:
Correct vs. looks-correct
Static review proves the code was written; only runtime evidence proves it ran. Without traces you cannot tell the difference.
Evaluation becomes mysticism
With no rubric, the same output gets graded differently every time. Quality stops being reproducible.
Retries become guesses
An agent that does not know why it failed retries in a random direction, burning tokens and budget on unrelated paths.
Handoff cliff
A new session re-diagnoses from scratch. That redundant diagnosis eats 30–50% of total session time.
Four metrics, all green: latency nominal, error rate zero, throughput up, success rate 94%. No one looked at what "success" meant until week six. It meant "response delivered without a 500." The agent had been confidently completing tasks — wrong tasks — for six weeks. The metric tracked output, not outcome. The fix was one field: verified_completion: boolean in the event schema. The dashboard went red the same day and stayed red until the harness was fixed.
The metrics that matter
A useful metric changes a decision. Vanity metrics — lines generated, agents deployed, tokens consumed — change nothing. Below are the signals that actually move ship/block, retry/stop, and trust/widen decisions, grouped by what question they answer.
The discipline behind the table: never measure speed without quality, and never measure quality without cost. A harness that ships faster also ships slop faster unless the quality and cost signals are first-class next to the latency ones.
The rubric is the instrument
Subjective review does not survive scale — quality ends up depending on which reviewer, on which day. A rubric makes acceptance reproducible: different evaluators reach the same conclusion on the same output, and each dimension carries a hard threshold so a single failing dimension fails the whole change.
| Dimension | A | B | C | D |
|---|---|---|---|---|
| Functionality | Golden journeys pass end to end | Main flow passes, edges thin | Partial — core path breaks | Does not run |
| Correctness | All tests + type checks pass | Main flow tested | Skeleton tests only | Build fails |
| Architecture | Boundaries fully respected | Minor deviations | Obvious deviations | Serious violations |
| Product depth | Real, usable feature | Works but shallow | Presentational only | Stub |
Cost is a first-class signal
When every change carries a line-item cost, AI-native development stops being a vague budget line and becomes something you profile the way you profile latency. Stitch every agent run's duration, tokens, and dollars into one task trace, then divide by successful outcomes.
See Cost & Token Accounting for metering this at the model gateway, and Prompt Caching for why the cache-hit metric often moves cost more than anything else.
Observability is the hill-climb
The point of all this is not a wall of charts — it is a loop you can climb. Early evaluators in real systems would spot a genuine problem, then talk themselves out of it and approve the work anyway. The fix was not a smarter model; it was reading the checker's own logs, finding the points where its judgment diverged from a human's, and updating the rubric and prompt to close that gap. The checker's logs are themselves a signal you hill-climb on. The payoff is the 3× in the stat band: when the evaluator cites evidence — "contrast is 2.1:1, the standard is 4.5:1" — a fix that once took three or four blind retry cycles lands in one.
That makes the operating loop concrete — the same one behind replay and online evaluation:
Query & correlate
step 1
Pull the runtime signals for a failed or low-scoring run and line them up against the rubric verdict.
Reason & change
step 2
Form a hypothesis from evidence, make one change — a prompt, a rubric threshold, a retrieval tweak.
Restart & verify
step 3
Re-run the same repeatable workload; the score goes up or it does not. Keep it only if it does.
For an internal platform this loop is also how you earn trust: a single confidently-wrong output sets adoption back months, so the fastest way to widen an agent's autonomy is to show the numbers climbing on a workload everyone recognizes.
A metric that never flips a ship/block decision is decoration. Every number on your primary dashboard should have a story about a decision it changed.
- If a metric has never blocked a deploy or reversed a decision, remove it from the primary dashboard.
- Track speed and quality together — speed without quality is a faster way to ship broken things.
- Cost per successful outcome is the only cost number that matters. Raw token spend is overhead accounting.
Pitfalls
- Infra metrics only. Latency and error rate are necessary but not sufficient. Without VCR, rubric scores, and end-to-end pass rate, the agent can get reliably worse while every operational chart stays green — the Happy Path Mirage at the metrics layer.
- Quality without cost. Optimizing the rubric while ignoring cost per successful outcome is how you ship an excellent agent nobody can afford to run.
- An uncalibrated judge. A rubric scored by a checker you never compared against human verdicts produces a confident, meaningless trend line. Measure evaluator–human agreement before you trust the score.
- Vanity counts. Lines generated, agents deployed, tokens consumed — none of these change a decision. If a metric never flips a ship/block call, stop reporting it.