Skip to main content

Tracing & Spans

An agent run is a tree of decisions: turns, model calls, tool dispatches, sub-agents, retries. Tracing records that tree as structured spans so you can see exactly what happened, in what order, and why. Without it, debugging an agent is reading tea leaves; with it, every run is a transcript you can walk.

When an agent does something baffling, the only question that matters is "what did it actually see and do?" A trace answers it. Agents are nondeterministic enough that observability is not optional — it's the difference between fixing a problem and re-rolling the dice.


Structure

Each unit of work is a span, nested under its parent. A run's trace is the whole tree — turns, model calls, tools, sub-agents, and retries — with inputs, outputs, and timing on each.

This span schema doesn't have to be invented per-harness — there's a converging open standard for it.

Semantic conventions for generative AI systems
The OpenTelemetry GenAI semantic conventions define an emerging standard span schema covering model calls, token usage, tool calls, and agent spans. Major observability vendors (Datadog among them) have adopted it, and frameworks like LangChain emit it natively or via instrumentation — so a harness that traces in this shape interoperates with existing APM stacks instead of living in a bespoke silo.

How It Works

  1. Span every unit of work — open a span for the run, each turn, each model call, each tool dispatch, each sub-agent.
  2. Capture inputs and outputs — record the assembled prompt, the model's raw output, tool arguments and results, and the halt reason — the actual data, not just timing.
  3. Preserve the hierarchy — nest spans so a sub-agent's trace sits under the turn that spawned it, keyed to the session id.
  4. Attach metadata — tokens, cost, model, latency, and error class on each span, feeding metrics and cost accounting.
  5. Make it navigable — surface the trace as a walkable tree so a human can drill from "the run failed" to "this tool call returned the wrong thing on turn 7."

Key Characteristics

  • The trace is the ground truth — what the agent actually saw and did, not what you assume it did. Nondeterminism makes this indispensable.
  • Inputs and outputs, not just spans — timing alone tells you it was slow; the captured prompt and output tell you why it was wrong.
  • Hierarchy mirrors execution — nested spans let you follow delegation and retries instead of staring at a flat log.
  • Correlated by session — a stable session/run id ties traces, logs, costs, and the durable record together.
  • The foundation for everything else in this group — metrics, cost, and replay all derive from well-structured traces.

Pitfalls

  • Logging only the final answer — without per-turn traces, a wrong result is unexplainable and unfixable.
  • Timing without payloads — knowing a model call took 4s doesn't help when the bug is in what you put in the prompt.
  • Flat, uncorrelated logs — lines with no span hierarchy or run id can't be reassembled into what actually happened.