Tracing & Spans
An agent run is a tree of decisions: turns, model calls, tool dispatches, sub-agents, retries. Tracing records that tree as structured spans so you can see exactly what happened, in what order, and why. Without it, debugging an agent is reading tea leaves; with it, every run is a transcript you can walk.
When an agent does something baffling, the only question that matters is "what did it actually see and do?" A trace answers it. Agents are nondeterministic enough that observability is not optional — it's the difference between fixing a problem and re-rolling the dice.
Structure
Each unit of work is a span, nested under its parent. A run's trace is the whole tree — turns, model calls, tools, sub-agents, and retries — with inputs, outputs, and timing on each.
This span schema doesn't have to be invented per-harness — there's a converging open standard for it.
How It Works
- Span every unit of work — open a span for the run, each turn, each model call, each tool dispatch, each sub-agent.
- Capture inputs and outputs — record the assembled prompt, the model's raw output, tool arguments and results, and the halt reason — the actual data, not just timing.
- Preserve the hierarchy — nest spans so a sub-agent's trace sits under the turn that spawned it, keyed to the session id.
- Attach metadata — tokens, cost, model, latency, and error class on each span, feeding metrics and cost accounting.
- Make it navigable — surface the trace as a walkable tree so a human can drill from "the run failed" to "this tool call returned the wrong thing on turn 7."
Key Characteristics
- The trace is the ground truth — what the agent actually saw and did, not what you assume it did. Nondeterminism makes this indispensable.
- Inputs and outputs, not just spans — timing alone tells you it was slow; the captured prompt and output tell you why it was wrong.
- Hierarchy mirrors execution — nested spans let you follow delegation and retries instead of staring at a flat log.
- Correlated by session — a stable session/run id ties traces, logs, costs, and the durable record together.
- The foundation for everything else in this group — metrics, cost, and replay all derive from well-structured traces.
Pitfalls
- Logging only the final answer — without per-turn traces, a wrong result is unexplainable and unfixable.
- Timing without payloads — knowing a model call took 4s doesn't help when the bug is in what you put in the prompt.
- Flat, uncorrelated logs — lines with no span hierarchy or run id can't be reassembled into what actually happened.