Skip to main content

Error Recovery

Things fail mid-run: the model emits malformed JSON, a tool times out, an API returns 503, a schema doesn't validate. Error recovery is the harness logic that keeps the loop alive through these without derailing it — deciding what to retry, what to surface to the model, and what is fatal.

The difference between a demo and a production harness is almost entirely error handling. The happy path is easy; the value is in turning a transient 503 into a retried success and a malformed tool call into a corrected one — invisibly.


Structure

Errors are classified, not treated uniformly. Transient errors retry, malformed output is handed back to the model to self-correct, and fatal errors abort with a reason.


How It Works

  1. Validate everything — tool arguments against schemas, tool calls against the registry, model output against the expected shape. Catch errors at the boundary, not three turns later.
  2. Classify the failure — transient (network, rate limit, timeout), correctable (malformed args, schema miss), or fatal (auth failure, missing capability).
  3. Retry transient with backoff — exponential backoff plus jitter, capped at a retry budget. Idempotency matters: don't re-run a non-idempotent side effect blindly.
  4. Reflect correctable errors — feed the validation error back to the model as an observation so it can fix the call. This is Reflection applied to tool use.
  5. Abort fatal errors cleanly — stop, record the cause, return partial progress. Don't loop on something that can't succeed.

Key Characteristics

  • Classification drives behavior — the same retry logic for every error is wrong: retrying an auth failure wastes budget, aborting on a transient blip wastes a run.
  • Model-fixable vs. harness-fixable — malformed output goes back to the model; infrastructure failures are the harness's job. Don't ask the model to fix a network partition.
  • Retries consume the budget — every retry counts against the run's budget. Recovery is not free, and infinite retry is just an infinite loop in disguise.
  • Circuit breakers stop cascades — when a dependency is clearly down, fail fast for a cooldown window instead of retrying every call into the void.
  • Idempotency is a precondition — safe retries require that re-running an action is harmless, or that you track what already happened.

There's also a subtler failure than a crashed loop: recovering into a false success. Anthropic's work on harnesses for long-running agents describes a later agent instance that looks around a half-finished workspace, sees that progress was made, and declares the job done. Nothing failed, nothing threw, the classifier above never fires — the run simply resumed from ambiguous state and took the optimistic reading. The guard is the same discipline that makes retries safe: record what actually happened. When completion is a written artifact rather than an inference from the wreckage, a recovering agent can't mistake "someone got partway" for "finished."


Pitfalls

  • Blind retry of side effects — retrying a "send email" or "deploy" tool without idempotency keys sends twice, deploys twice.
  • Treating malformed output as fatal — aborting a run because the model returned slightly-off JSON throws away a run that one reflective retry would have saved.
  • Unbounded retries — backoff without a cap turns a flaky dependency into a hung, expensive run.