Metrics That Matter

Reliability is an evidence problem. The harness has to expose what the agent actually did and how good it was, in a form that guides the next decision. Without that, agents decide under uncertainty, evaluations become opinions, and retries become blind wandering. This page is about the signals worth collecting — and the ones that quietly let a system get worse while the dashboards stay green.

Code review shows what was written; runtime tracing shows what actually ran. You need both.

30–50%

of session time lost to re-diagnosis when there is no observability

layers — runtime (what ran) and process (why accept it)

3×

faster iteration when signals are present vs. blind retries

feedback loop you actually hill-climb on

Observability is not a dashboard you add at the end. It is the property that turns a nondeterministic agent into a system you can improve.

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot ↗

Peng, Kalliamvakou, Croft, Demirer·2023·IEEE Software

Developers using Copilot completed a representative coding task 55.8% faster — but speed alone masked a quality dimension. The study measured task completion time and self-reported productivity; it did not measure defect rate or correctness. Speed without a quality signal is exactly half the picture.

Two layers, not one

Most teams instrument the machine and forget the work. A harness needs both: runtime observability answers what the system did, and process observability answers why this change should be accepted. Runtime signals without process signals tell you the app booted but not whether the agent built the right thing; process signals without runtime signals are taste with no evidence underneath. Runtime signals are captured as traces and spans and aggregated across runs by telemetry.

Runtime observability

what did the system do

System-level signals emitted as the agent runs — the same telemetry you would expect from any production service, captured around the agent loop.

Lifecycle: did it reach ready, run, shut down cleanly
Critical-path execution: entry, checkpoints, exit
Data flow between components
Resource trend (e.g. monotonically growing memory)
Full error context, not just the message

Process observability

why accept this change

Visibility into the harness’s own decision artifacts — the plan, the contract, the score — so acceptance is reproducible instead of a vibe.

Sprint contract: scope, verification standards, exclusions
Evaluator rubric scores with hard thresholds
Acceptance criteria tied to evidence
Where the checker’s judgment diverged from a human’s

Layered observability: runtime signals explain behavior, process artifacts explain intent. Design them together — they reinforce each other.

When this is missing, the failures are systematic, not occasional:

Correct vs. looks-correct

Static review proves the code was written; only runtime evidence proves it ran. Without traces you cannot tell the difference.

Evaluation becomes mysticism

With no rubric, the same output gets graded differently every time. Quality stops being reproducible.

Retries become guesses

An agent that does not know why it failed retries in a random direction, burning tokens and budget on unrelated paths.

Handoff cliff

A new session re-diagnoses from scratch. That redundant diagnosis eats 30–50% of total session time.

Field note

Four metrics, all green: latency nominal, error rate zero, throughput up, success rate 94%. No one looked at what "success" meant until week six. It meant "response delivered without a 500." The agent had been confidently completing tasks — wrong tasks — for six weeks. The metric tracked output, not outcome. The fix was one field: verified_completion: boolean in the event schema. The dashboard went red the same day and stayed red until the harness was fixed.

— six weeks of green dashboards

The metrics that matter

A useful metric changes a decision. Vanity metrics — lines generated, agents deployed, tokens consumed — change nothing. Below are the signals that actually move ship/block, retry/stop, and trust/widen decisions, grouped by what question they answer.

The signals worth collectingfour families · all four needed

p50 / p95 latencyruntime

Per-turn and per-tool wall-clock time, tail included.

The tail is where incidents hide; a p95 that doubles after a deploy is the first symptom.

Failure & retry countruntime

Tool failures, malformed outputs, and retries per run.

A climbing retry rate means a dependency or prompt is silently degrading.

Queue depthruntime

Pending runs waiting for a worker slot.

Backpressure signal — a rising queue means you are throttled or under-provisioned, not faster.

Critical-path reachruntime

Did the app reach ready state and execute the golden journey end to end.

Distinguishes “the agent says done” from “the path a user takes actually works.”

Verified completion rateprocess

Tasks that pass executable verification ÷ tasks activated (VCR).

The honest definition of done; gate new work on VCR rather than on the agent’s confidence.

End-to-end pass rateprocess

Share of changes that pass the full pipeline, not just unit tests.

Component-boundary defects only appear here; unit-green is not system-green.

Rubric dimension scoresprocess

Each acceptance dimension scored against a fixed rubric with a hard threshold.

Turns “feels off” into “contrast is 2.1:1, WCAG AA needs 4.5:1” — evidence the generator can act on.

Evaluator–human agreementprocess

How often the automated checker matches a human’s verdict on the same output.

An uncalibrated judge is just another confident opinion; this is what you tune.

Cost per taskcost

Dollars and tokens in/out for a full task, sub-agents included.

A 200-call agent tree is dollars per run; un-instrumented, it is a month-end surprise.

Cost per successful outcomecost

Spend ÷ tasks that actually passed.

The only cost number that matters — optimizing raw spend just makes a worse agent cheaper.

Cache hit ratecost

Share of prompt tokens served from the prefix cache.

Often the largest single cost lever in a long run; a falling rate means context ordering broke.

Adoptionplatform

Share of teams and engineers actively using the workflow.

The only number that says the platform is real. A lagging signal of trust, not a vanity count.

Time to first valueplatform

Minutes from zero to a working golden path.

Onboarding cost, measured. The faster this is, the faster adoption compounds.

Would-be-missedplatform

Would engineers be upset if it disappeared tomorrow.

The truest product signal an internal platform has — ask it directly.

Runtime, process, cost, platform. Track latency without quality and the agent can get reliably, quickly worse with green dashboards; track quality without cost and you go broke being right.

The discipline behind the table: never measure speed without quality, and never measure quality without cost. A harness that ships faster also ships slop faster unless the quality and cost signals are first-class next to the latency ones.

The rubric is the instrument

Subjective review does not survive scale — quality ends up depending on which reviewer, on which day. A rubric makes acceptance reproducible: different evaluators reach the same conclusion on the same output, and each dimension carries a hard threshold so a single failing dimension fails the whole change.

Evaluator rubriceach dimension has a hard floor — any miss fails the change

Dimension	A	B	C	D
Functionality	Golden journeys pass end to end	Main flow passes, edges thin	Partial — core path breaks	Does not run
Correctness	All tests + type checks pass	Main flow tested	Skeleton tests only	Build fails
Architecture	Boundaries fully respected	Minor deviations	Obvious deviations	Serious violations
Product depth	Real, usable feature	Works but shallow	Presentational only	Stub

An evaluator rubric: an independent checker interacts with the running app like a user and scores each dimension against a hard floor. Separating the worker from the checker is the single largest quality lever — in role-separation experiments scored on a five-point rubric, output quality climbed from 1.6 to 3.3 to 4.9 as planner and evaluator roles were added.

Cost is a first-class signal

When every change carries a line-item cost, AI-native development stops being a vague budget line and becomes something you profile the way you profile latency. Stitch every agent run's duration, tokens, and dollars into one task trace, then divide by successful outcomes.

Execution traceone task · two review loops

$0.178

total cost

137k → 10.5k

tokens in / out

6m 58s

wall clock

7 runs

incl. 2 review passes

Implementationopened PR

2m 14s

$0.0842

48k → 6k

Risk ProfileLOW

11s

$0.0031

9k → 0.3k

ReviewREQUEST_CHANGES

1m 02s

$0.0418

31k → 2k

Implementationpushed fix

48s

$0.0226

18k → 1k

ReviewAPPROVE

39s

$0.0203

17k → 0.7k

Deploymentsquash-merged

1m 30s

$0.0000

—

Monitorclean

34s

$0.0061

14k → 0.5k

A real task trace: each bar begins where the prior run ended, to scale (the 11-second risk check is widened slightly for legibility). Two review passes, risk scored once, deploy, then a monitor window — about eighteen cents. Attribute cost per run and the expensive path stops hiding.

See Cost & Token Accounting for metering this at the model gateway, and Prompt Caching for why the cache-hit metric often moves cost more than anything else.

Observability is the hill-climb

The point of all this is not a wall of charts — it is a loop you can climb. Early evaluators in real systems would spot a genuine problem, then talk themselves out of it and approve the work anyway. The fix was not a smarter model; it was reading the checker's own logs, finding the points where its judgment diverged from a human's, and updating the rubric and prompt to close that gap. The checker's logs are themselves a signal you hill-climb on. The payoff is the 3× in the stat band: when the evaluator cites evidence — "contrast is 2.1:1, the standard is 4.5:1" — a fix that once took three or four blind retry cycles lands in one.

That makes the operating loop concrete — the same one behind replay and online evaluation:

→

Query & correlate

step 1

Pull the runtime signals for a failed or low-scoring run and line them up against the rubric verdict.

→

Reason & change

step 2

Form a hypothesis from evidence, make one change — a prompt, a rubric threshold, a retrieval tweak.

→

Restart & verify

step 3

Re-run the same repeatable workload; the score goes up or it does not. Keep it only if it does.

Query and correlate, then reason and change, then restart and verify. Standardize it with one trace per run, one span per turn, one sub-span per verification step, and the loop becomes routine.

For an internal platform this loop is also how you earn trust: a single confidently-wrong output sets adoption back months, so the fastest way to widen an agent's autonomy is to show the numbers climbing on a workload everyone recognizes.

Metric discipline

A metric that never flips a ship/block decision is decoration. Every number on your primary dashboard should have a story about a decision it changed.

If a metric has never blocked a deploy or reversed a decision, remove it from the primary dashboard.
Track speed and quality together — speed without quality is a faster way to ship broken things.
Cost per successful outcome is the only cost number that matters. Raw token spend is overhead accounting.

Pitfalls

Infra metrics only. Latency and error rate are necessary but not sufficient. Without VCR, rubric scores, and end-to-end pass rate, the agent can get reliably worse while every operational chart stays green — the Happy Path Mirage at the metrics layer.
Quality without cost. Optimizing the rubric while ignoring cost per successful outcome is how you ship an excellent agent nobody can afford to run.
An uncalibrated judge. A rubric scored by a checker you never compared against human verdicts produces a confident, meaningless trend line. Measure evaluator–human agreement before you trust the score.
Vanity counts. Lines generated, agents deployed, tokens consumed — none of these change a decision. If a metric never flips a ship/block call, stop reporting it.

Two layers, not one​

The metrics that matter​

The rubric is the instrument​

Cost is a first-class signal​

Observability is the hill-climb​

Pitfalls​