Streaming Output
Tokens are delivered to the client incrementally as they are generated, rather than waiting for the complete response. This reduces perceived latency (the user sees output immediately) and enables progressive rendering, early termination, and real-time status updates during agent execution.
Structure
The stream carries multiple event types: raw tokens (for progressive text rendering), tool call notifications (agent is using a tool), step completions (agent finished a phase), and the final result.
How It Works
Token streaming:
- Start generation — agent begins producing output
- Stream tokens — each token is sent to the client as it's generated
- Render progressively — client displays text as it arrives (the "typing" effect)
- Complete — final token signals end of generation
Agent event streaming:
- Subscribe — client subscribes to the agent's event stream
- Receive events — tool calls, observations, step completions arrive as structured events
- Update UI — client shows agent progress (thinking, searching, writing...)
- Final result — last event contains the complete response
Transport:
- Server-Sent Events (SSE) — the standard for LLM streaming
- WebSockets — bidirectional, for interactive agents
- HTTP chunked transfer — simplest, least overhead
Key Characteristics
- Low perceived latency — user sees output in milliseconds, not seconds
- Early termination — client can cancel if the output is going off-track
- Progress visibility — users see what the agent is doing in real-time
- Complexity — streaming adds client-side buffering and event handling logic
- Structured output challenge — streaming partial JSON is tricky (needs buffering until valid)
When to Use
- User-facing applications where perceived speed matters
- Long-running agent tasks where users need progress updates
- Interactive sessions where the user may want to interrupt or redirect
- Chat interfaces — streaming is expected UX for AI conversations
- Agent dashboards that show tool calls and reasoning in real-time