Skip to main content

Metrics That Matter

Reliability is an evidence problem. The harness has to expose what the agent actually did and how good it was, in a form that guides the next decision. Without that, agents decide under uncertainty, evaluations become opinions, and retries become blind wandering. This page is about the signals worth collecting — and the ones that quietly let a system get worse while the dashboards stay green.

Code review shows what was written; runtime tracing shows what actually ran. You need both.

30–50%
of session time lost to re-diagnosis when there is no observability
2
layers — runtime (what ran) and process (why accept it)
faster iteration when signals are present vs. blind retries
1
feedback loop you actually hill-climb on
Observability is not a dashboard you add at the end. It is the property that turns a nondeterministic agent into a system you can improve.

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
Developers using Copilot completed a representative coding task 55.8% faster — but speed alone masked a quality dimension. The study measured task completion time and self-reported productivity; it did not measure defect rate or correctness. Speed without a quality signal is exactly half the picture.

Two layers, not one

Most teams instrument the machine and forget the work. A harness needs both: runtime observability answers what the system did, and process observability answers why this change should be accepted. Runtime signals without process signals tell you the app booted but not whether the agent built the right thing; process signals without runtime signals are taste with no evidence underneath. Runtime signals are captured as traces and spans and aggregated across runs by telemetry.

R

Runtime observability

what did the system do

System-level signals emitted as the agent runs — the same telemetry you would expect from any production service, captured around the agent loop.

  • Lifecycle: did it reach ready, run, shut down cleanly
  • Critical-path execution: entry, checkpoints, exit
  • Data flow between components
  • Resource trend (e.g. monotonically growing memory)
  • Full error context, not just the message
P

Process observability

why accept this change

Visibility into the harness’s own decision artifacts — the plan, the contract, the score — so acceptance is reproducible instead of a vibe.

  • Sprint contract: scope, verification standards, exclusions
  • Evaluator rubric scores with hard thresholds
  • Acceptance criteria tied to evidence
  • Where the checker’s judgment diverged from a human’s
Layered observability: runtime signals explain behavior, process artifacts explain intent. Design them together — they reinforce each other.

When this is missing, the failures are systematic, not occasional:

1

Correct vs. looks-correct

Static review proves the code was written; only runtime evidence proves it ran. Without traces you cannot tell the difference.

2

Evaluation becomes mysticism

With no rubric, the same output gets graded differently every time. Quality stops being reproducible.

3

Retries become guesses

An agent that does not know why it failed retries in a random direction, burning tokens and budget on unrelated paths.

4

Handoff cliff

A new session re-diagnoses from scratch. That redundant diagnosis eats 30–50% of total session time.


Field note

Four metrics, all green: latency nominal, error rate zero, throughput up, success rate 94%. No one looked at what "success" meant until week six. It meant "response delivered without a 500." The agent had been confidently completing tasks — wrong tasks — for six weeks. The metric tracked output, not outcome. The fix was one field: verified_completion: boolean in the event schema. The dashboard went red the same day and stayed red until the harness was fixed.

six weeks of green dashboards

The metrics that matter

A useful metric changes a decision. Vanity metrics — lines generated, agents deployed, tokens consumed — change nothing. Below are the signals that actually move ship/block, retry/stop, and trust/widen decisions, grouped by what question they answer.

The signals worth collectingfour families · all four needed
p50 / p95 latencyruntime
Per-turn and per-tool wall-clock time, tail included.
The tail is where incidents hide; a p95 that doubles after a deploy is the first symptom.
Failure & retry countruntime
Tool failures, malformed outputs, and retries per run.
A climbing retry rate means a dependency or prompt is silently degrading.
Queue depthruntime
Pending runs waiting for a worker slot.
Backpressure signal — a rising queue means you are throttled or under-provisioned, not faster.
Critical-path reachruntime
Did the app reach ready state and execute the golden journey end to end.
Distinguishes “the agent says done” from “the path a user takes actually works.”
Verified completion rateprocess
Tasks that pass executable verification ÷ tasks activated (VCR).
The honest definition of done; gate new work on VCR rather than on the agent’s confidence.
End-to-end pass rateprocess
Share of changes that pass the full pipeline, not just unit tests.
Component-boundary defects only appear here; unit-green is not system-green.
Rubric dimension scoresprocess
Each acceptance dimension scored against a fixed rubric with a hard threshold.
Turns “feels off” into “contrast is 2.1:1, WCAG AA needs 4.5:1” — evidence the generator can act on.
Evaluator–human agreementprocess
How often the automated checker matches a human’s verdict on the same output.
An uncalibrated judge is just another confident opinion; this is what you tune.
Cost per taskcost
Dollars and tokens in/out for a full task, sub-agents included.
A 200-call agent tree is dollars per run; un-instrumented, it is a month-end surprise.
Cost per successful outcomecost
Spend ÷ tasks that actually passed.
The only cost number that matters — optimizing raw spend just makes a worse agent cheaper.
Cache hit ratecost
Share of prompt tokens served from the prefix cache.
Often the largest single cost lever in a long run; a falling rate means context ordering broke.
Adoptionplatform
Share of teams and engineers actively using the workflow.
The only number that says the platform is real. A lagging signal of trust, not a vanity count.
Time to first valueplatform
Minutes from zero to a working golden path.
Onboarding cost, measured. The faster this is, the faster adoption compounds.
Would-be-missedplatform
Would engineers be upset if it disappeared tomorrow.
The truest product signal an internal platform has — ask it directly.
Runtime, process, cost, platform. Track latency without quality and the agent can get reliably, quickly worse with green dashboards; track quality without cost and you go broke being right.

The discipline behind the table: never measure speed without quality, and never measure quality without cost. A harness that ships faster also ships slop faster unless the quality and cost signals are first-class next to the latency ones.


The rubric is the instrument

Subjective review does not survive scale — quality ends up depending on which reviewer, on which day. A rubric makes acceptance reproducible: different evaluators reach the same conclusion on the same output, and each dimension carries a hard threshold so a single failing dimension fails the whole change.

Evaluator rubriceach dimension has a hard floor — any miss fails the change
DimensionABCD
FunctionalityGolden journeys pass end to endMain flow passes, edges thinPartial — core path breaksDoes not run
CorrectnessAll tests + type checks passMain flow testedSkeleton tests onlyBuild fails
ArchitectureBoundaries fully respectedMinor deviationsObvious deviationsSerious violations
Product depthReal, usable featureWorks but shallowPresentational onlyStub
An evaluator rubric: an independent checker interacts with the running app like a user and scores each dimension against a hard floor. Separating the worker from the checker is the single largest quality lever — in role-separation experiments scored on a five-point rubric, output quality climbed from 1.6 to 3.3 to 4.9 as planner and evaluator roles were added.

Cost is a first-class signal

When every change carries a line-item cost, AI-native development stops being a vague budget line and becomes something you profile the way you profile latency. Stitch every agent run's duration, tokens, and dollars into one task trace, then divide by successful outcomes.

Execution traceone task · two review loops
$0.178
total cost
137k → 10.5k
tokens in / out
6m 58s
wall clock
7 runs
incl. 2 review passes
Implementationopened PR
2m 14s
$0.0842
48k → 6k
Risk ProfileLOW
11s
$0.0031
9k → 0.3k
ReviewREQUEST_CHANGES
1m 02s
$0.0418
31k → 2k
Implementationpushed fix
48s
$0.0226
18k → 1k
ReviewAPPROVE
39s
$0.0203
17k → 0.7k
Deploymentsquash-merged
1m 30s
$0.0000
Monitorclean
34s
$0.0061
14k → 0.5k
A real task trace: each bar begins where the prior run ended, to scale (the 11-second risk check is widened slightly for legibility). Two review passes, risk scored once, deploy, then a monitor window — about eighteen cents. Attribute cost per run and the expensive path stops hiding.

See Cost & Token Accounting for metering this at the model gateway, and Prompt Caching for why the cache-hit metric often moves cost more than anything else.


Observability is the hill-climb

The point of all this is not a wall of charts — it is a loop you can climb. Early evaluators in real systems would spot a genuine problem, then talk themselves out of it and approve the work anyway. The fix was not a smarter model; it was reading the checker's own logs, finding the points where its judgment diverged from a human's, and updating the rubric and prompt to close that gap. The checker's logs are themselves a signal you hill-climb on. The payoff is the 3× in the stat band: when the evaluator cites evidence — "contrast is 2.1:1, the standard is 4.5:1" — a fix that once took three or four blind retry cycles lands in one.

That makes the operating loop concrete — the same one behind replay and online evaluation:

Query & correlate

step 1

Pull the runtime signals for a failed or low-scoring run and line them up against the rubric verdict.

Reason & change

step 2

Form a hypothesis from evidence, make one change — a prompt, a rubric threshold, a retrieval tweak.

Restart & verify

step 3

Re-run the same repeatable workload; the score goes up or it does not. Keep it only if it does.

Query and correlate, then reason and change, then restart and verify. Standardize it with one trace per run, one span per turn, one sub-span per verification step, and the loop becomes routine.

For an internal platform this loop is also how you earn trust: a single confidently-wrong output sets adoption back months, so the fastest way to widen an agent's autonomy is to show the numbers climbing on a workload everyone recognizes.


Metric discipline

A metric that never flips a ship/block decision is decoration. Every number on your primary dashboard should have a story about a decision it changed.

  • If a metric has never blocked a deploy or reversed a decision, remove it from the primary dashboard.
  • Track speed and quality together — speed without quality is a faster way to ship broken things.
  • Cost per successful outcome is the only cost number that matters. Raw token spend is overhead accounting.

Pitfalls

  • Infra metrics only. Latency and error rate are necessary but not sufficient. Without VCR, rubric scores, and end-to-end pass rate, the agent can get reliably worse while every operational chart stays green — the Happy Path Mirage at the metrics layer.
  • Quality without cost. Optimizing the rubric while ignoring cost per successful outcome is how you ship an excellent agent nobody can afford to run.
  • An uncalibrated judge. A rubric scored by a checker you never compared against human verdicts produces a confident, meaningless trend line. Measure evaluator–human agreement before you trust the score.
  • Vanity counts. Lines generated, agents deployed, tokens consumed — none of these change a decision. If a metric never flips a ship/block call, stop reporting it.