Skip to main content

Streaming & Steering

A long run that a human can only watch finish is a black box, and black boxes don't get trusted or corrected. The harness has to make a running agent observable and steerable while it runs — emitting what it's doing as it does it, accepting a stop without corrupting state, and taking mid-course guidance without throwing away the work so far. The principle underneath all three: humans steer, agents execute — and the harness's job is to widen the bandwidth of that steering.

The user is a node in the agentic loop, not a spectator outside it. Streaming is how the loop talks to them; interruption and steering are how they talk back — at the loop's natural boundaries, without tearing it down.

events, not tokens
stream semantic milestones; tokens are just for liveness
cooperative
past the loop boundary, only the tool can stop itself
interrupt = checkpoint
a clean stop leaves a state you can resume or steer from
undo = approval
what can’t be undone is what must pause for a human

Stream events, not just tokens

Raw token deltas prove the agent is alive, but they bury the human in noise. A good harness layers a small, typed event vocabulary over the token stream so the UI can show structure — and so many clients (CLI, IDE, web) can consume one stable protocol instead of reverse-engineering the internal loop.

~

Deltas

liveness

Token-by-token text and reasoning, and partial tool arguments as they generate. Shows the agent is working; not the thing to build UX on.

Lifecycle events

structure

Typed milestones: turn started, tool called, tool output, plan updated, running diff updated, active agent changed on handoff. This is what users actually read.

Terminal events

authoritative truth

A single completed / failed / interrupted event per turn, carrying the final state so the UI never has to reconstruct truth from deltas.

Deltas for liveness, lifecycle events for structure, exactly one terminal event for truth. Streaming the running diff and plan status is what lets a human supervise a long turn live instead of waiting for it to land.

Two design rules make this robust. Treat the event schema as a product boundary — a small, versioned vocabulary, not a firehose of internal events — so it can stay stable while the loop underneath changes. And complete the stream server-side even on disconnect: consume it to the end so persistence and session state finalize, then let a reload recover cleanly, distinguishing in-progress from complete.

This layering is how the shipping harnesses work. Both Claude Code and OpenAI Codex stream model output token-by-token for liveness, but surface tool calls as discrete events the user can read — and interrupt — rather than burying them in the text.


Interrupt cleanly

A stop has to leave the system in a state you can trust. The harness checks for an interrupt automatically at the boundary between tool calls — but once a tool callback is already running, cancellation is cooperative: the harness can't safely rip the tool out from under itself, so the tool must forward a cancellation signal and stop itself. Anything that mutates state needs an idempotency key so a retry after cancel doesn't double-fire.

However the stop arrives, the turn reaches exactly one terminal state and leaves a checkpoint behind — an append-only transcript plus a snapshot of what changed. That recoverable state is what makes interruption cheap: you're not discarding the run, you're pausing it somewhere you can pick back up. Interruptibility falls out of designing for context loss in the first place.

This is exactly the model Claude Code ships: Esc interrupts mid-turn, and the cancellation is cooperative — the running tool gets the signal and stops itself, rather than being killed under its feet. Codex makes the same guarantee at a different distance: tasks run asynchronously in a cloud sandbox, and a run in flight can still be steered or cancelled.


Steer without restarting

The most valuable interaction isn't stopping — it's redirecting without losing progress. Harnesses offer two speeds, and the difference is whether the in-flight tool call survives:

Two speeds of steeringqueue vs. interrupt
Queue and redirectSend guidance without stopping; the agent reads it at the next loop boundary, after the current tool action completes, and adjusts its next decision. The in-flight work is unharmed — the cost is that the agent may do a little wrong-direction work before it reads you.
Interrupt and redirectHard-stop the active turn, then steer from the resulting checkpoint. Costs the in-flight tool call, buys immediacy when the agent is clearly heading the wrong way.
Steering is a continuation of the conversation, not an exception path — append input into the active turn (guarded against steering a stale one), and no restart is needed.

Steerability is gated on architecture: it requires a streaming-input session — a long-lived loop you can feed messages into mid-run — not a one-shot call. And because a resumed run replays its current step from the top, side effects before a steer/interrupt point must be idempotent, or moved after it. When you do need to back up further than the last boundary, checkpoint history gives you rewind (replay from an earlier checkpoint) and fork (branch a new line from that point, leaving the original intact).


Approval is steering's special case

A pause for human approval is just a harness-initiated interrupt: the loop hits a consequential action, emits an approval request (with the command or diff preview and a reason), and durably waits — holding no process — until the human answers. The boundary for which actions pause has a clean rule: what can't be undone is what must be approved. File edits can be checkpointed and reverted, so they can run and be rolled back; a deploy, a payment, an outbound message touches the world beyond the harness and can't be un-rung — so it stops for a human first. The undo boundary defines the approval boundary.


Pitfalls

  • Streaming the firehose — piping every internal event to the user is as useless as streaming nothing. Curate a small lifecycle vocabulary; tokens are liveness, not UX.
  • Assuming hard cancellation — believing the harness can kill a running tool mid-execution. Past the loop boundary it can't; build cooperative cancellation and idempotency in from the start.
  • Non-idempotent pre-interrupt side effects — since resume replays the current step, a side effect that ran before the pause runs again. Make it idempotent or move it after the interrupt point.
  • Steering a stale turn — applying guidance to a turn that already advanced. Guard mid-run input with a compare-and-swap against the active turn id.