Skip to main content

Telemetry & Aggregation

Traces explain a single run; metrics explain the system. Telemetry aggregates signals across all runs — success rates, latencies, tool usage, halt reasons, retry frequency — so you can see trends, catch regressions, and answer "is the agent getting better or worse?" without reading individual transcripts.

Where Metrics That Matter decides which signals are worth collecting, this page is about aggregating them — the pipeline, rollups, alerting, and slicing that turn raw per-run events into a trend you can act on.

One trace tells you why a run failed. A thousand traces, aggregated, tell you that tool X started timing out after Tuesday's deploy, or that one team's runs quietly cost twice what they did last month. Aggregation is how you notice problems before users report them.


Structure

Every run emits structured signals; the pipeline aggregates them into outcome, performance, usage, quality, and cost views, with alerts on regressions.


How It Works

  1. Emit per run — every run reports the signals chosen in Metrics That Matter — outcome, latency, tokens, retries, halt reason — as structured events with a consistent schema. The OpenTelemetry GenAI semantic conventions are the emerging standard for that schema — emit in their shape and aggregation rides existing observability pipelines instead of a bespoke one.
  2. Aggregate — roll events into rates and distributions: success rate, p50/p95 latency, turns per run, tool-failure frequency, model mix.
  3. Track quality over time — fold in eval scores and human accept/reject signals so quality is a trend line, not a vibe.
  4. Alert on regressions — set thresholds on the signals that matter (success rate drop, latency spike, cost surge) and page when they break.
  5. Slice for diagnosis — break metrics down by model, tool, task type, or version to localize a regression to its cause.

Key Characteristics

  • System view, not run view — metrics answer questions about behavior in aggregate that no single trace can.
  • Aggregation, not collection — the work here is rolling thousands of per-run events into rates and distributions; which signals to emit is the separate question Metrics That Matter answers.
  • Trends beat snapshots — the value is in the derivative: what changed, when, and after what.
  • Sliceable by dimension — aggregate numbers say something regressed; slicing by model/tool/version says what.
  • Quality must be measured, not assumed — without eval signals in the metrics, "better" is an opinion.

Pitfalls

  • Unstructured events — emitting per-run data in inconsistent shapes makes systematic aggregation impossible. Standardize the event schema before you build the dashboards on top of it.
  • No alerting — metrics nobody watches catch regressions only in the postmortem.
  • Unsliceable aggregates — a single global success-rate number tells you there's a problem but never where it is.