Durability & Resume
Long-running agents outlive the processes that start them. A run can take minutes or hours, span a deploy, or pause for a human approval that arrives tomorrow. Durability is how the harness survives that: checkpointing progress so a run can pause, crash, and resume exactly where it left off — without redoing work.
A stateless agent that crashes after step 40 of 50 starts over and pays for steps 1–40 twice — if it can recover at all. A durable one reloads its journal, skips the 40 completed steps instantly, and continues from 41. For anything long or expensive, this is the difference between viable and not.
Structure
Each completed step is journaled. On resume, the harness replays the journal — completed steps return cached results instantly — and only the first incomplete step onward runs live.
How It Works
- Checkpoint at step boundaries — after each completed step (a turn, a tool call, a sub-agent), append its result to a durable journal keyed to the session.
- Make steps addressable — each step has a stable identity so a replay can match a completed step to its cached result.
- Resume by replay — on restart, walk the journal: return cached results for completed steps, and execute the first incomplete one live.
- Pause as a first-class state — a run waiting on human approval or an external event suspends durably and wakes when the event arrives, holding no process.
- Handle determinism — replay assumes steps are reproducible; isolate nondeterministic inputs (time, randomness) so a resumed run matches the original.
Key Characteristics
- Journaled steps, not periodic snapshots — recording each step's result lets resume skip exactly the completed work, not "the last save point."
- Resume is replay — completed steps return instantly from the journal; only new work executes. Same inputs, same prefix, zero rework.
- Pause without holding resources — a durable suspend means a run waiting a day for approval consumes nothing while it waits.
- Determinism is a precondition — uncontrolled time/randomness in steps breaks replay; pass them in or stamp them so the prefix stays stable.
- Durability underpins long-horizon agents — without it, run length is capped by process uptime.
None of this machinery needs inventing. Temporal is the canonical durable-execution engine — event-sourced workflow replay, exactly the journal-and-resume model above — and is widely used for resumable long-running jobs; an agent harness can sit on it rather than rebuild it. Anthropic reports the same lesson from running agents in production: long-running agents need durable execution that resumes from where errors occurred rather than restarting, and they use rainbow deployments so an update never breaks agents already in flight.
Pitfalls
- No checkpoints — a crash loses the entire run and reruns everything, doubling cost or losing the work outright.
- Nondeterministic steps — calling
now()or a random source mid-step makes replay diverge from the original run. - Snapshot-only persistence — coarse snapshots force re-execution of everything since the last one; journal at step granularity instead.