Error Recovery
Things fail mid-run: the model emits malformed JSON, a tool times out, an API returns 503, a schema doesn't validate. Error recovery is the harness logic that keeps the loop alive through these without derailing it — deciding what to retry, what to surface to the model, and what is fatal.
The difference between a demo and a production harness is almost entirely error handling. The happy path is easy; the value is in turning a transient 503 into a retried success and a malformed tool call into a corrected one — invisibly.
Structure
Errors are classified, not treated uniformly. Transient errors retry, malformed output is handed back to the model to self-correct, and fatal errors abort with a reason.
How It Works
- Validate everything — tool arguments against schemas, tool calls against the registry, model output against the expected shape. Catch errors at the boundary, not three turns later.
- Classify the failure — transient (network, rate limit, timeout), correctable (malformed args, schema miss), or fatal (auth failure, missing capability).
- Retry transient with backoff — exponential backoff plus jitter, capped at a retry budget. Idempotency matters: don't re-run a non-idempotent side effect blindly.
- Reflect correctable errors — feed the validation error back to the model as an observation so it can fix the call. This is Reflection applied to tool use.
- Abort fatal errors cleanly — stop, record the cause, return partial progress. Don't loop on something that can't succeed.
Key Characteristics
- Classification drives behavior — the same retry logic for every error is wrong: retrying an auth failure wastes budget, aborting on a transient blip wastes a run.
- Model-fixable vs. harness-fixable — malformed output goes back to the model; infrastructure failures are the harness's job. Don't ask the model to fix a network partition.
- Retries consume the budget — every retry counts against the run's budget. Recovery is not free, and infinite retry is just an infinite loop in disguise.
- Circuit breakers stop cascades — when a dependency is clearly down, fail fast for a cooldown window instead of retrying every call into the void.
- Idempotency is a precondition — safe retries require that re-running an action is harmless, or that you track what already happened.
There's also a subtler failure than a crashed loop: recovering into a false success. Anthropic's work on harnesses for long-running agents describes a later agent instance that looks around a half-finished workspace, sees that progress was made, and declares the job done. Nothing failed, nothing threw, the classifier above never fires — the run simply resumed from ambiguous state and took the optimistic reading. The guard is the same discipline that makes retries safe: record what actually happened. When completion is a written artifact rather than an inference from the wreckage, a recovering agent can't mistake "someone got partway" for "finished."
Pitfalls
- Blind retry of side effects — retrying a "send email" or "deploy" tool without idempotency keys sends twice, deploys twice.
- Treating malformed output as fatal — aborting a run because the model returned slightly-off JSON throws away a run that one reflective retry would have saved.
- Unbounded retries — backoff without a cap turns a flaky dependency into a hung, expensive run.