Skip to main content

Replay & Online Evaluation

The final loop: recorded runs become the input to improving the system. Replay re-runs captured traces against a changed harness to catch regressions before they ship; online evaluation scores live runs continuously so quality is a tracked signal, not a guess.

Every production run is a free test case you already paid to generate. Replay lets you ask "would my change have broken this real run?" before deploying it. Online evals let you ask "is the agent good right now?" continuously. Together they close the loop from observation back to improvement.


Structure

Captured runs feed both an offline regression gate (replay before shipping) and a live quality signal (online scoring), and notable live runs become new replay cases.


How It Works

  1. Capture runs as cases — persist traces with their inputs and outcomes; the interesting ones (failures, edge cases) become a golden set.
  2. Replay against changes — when the prompt, model, or harness changes, re-run captured cases and compare outcomes to the baseline.
  3. Gate on regressions — wire replay into the pipeline so a change that breaks previously-good runs can't ship silently.
  4. Score live runs — sample production runs and score them with automated checks and an LLM-as-Judge, so quality is observed continuously, not just pre-merge.
  5. Feed the loop — promote notable live runs into the golden set, so the suite hardens every time something interesting happens.

Key Characteristics

  • Production is your test set — real captured runs are higher-signal cases than anything you'd invent. Mining them is nearly free.
  • Replay needs determinism — comparing a replayed run to its baseline requires controlled inputs, the same property durability depends on.
  • Two timescales — replay gates before shipping; online evals watch after. You need both — one prevents known regressions, the other catches unknown drift.
  • Hill-climb, don't guess — replay turns "this prompt feels better" into a measured before/after, the discipline that lets a harness improve predictably.
  • The loop hardens itself — every escaped failure promoted into the suite makes the next regression impossible to repeat.

For designing the suite itself — graders, tasks, environments — Anthropic's "Demystifying evals for AI agents" is the reference on building agent evals that measure what you think they measure.


Pitfalls

  • Shipping harness changes without replay — a prompt or model tweak that silently breaks a class of runs is the Vibe Deployment anti-pattern at the harness level.
  • Online evals that don't reflect real quality — a judge no one calibrated against humans produces a trend line nobody should trust.
  • Capturing traces but never mining them — sitting on a goldmine of real cases while testing against toy inputs is the Happy Path Mirage.