Skip to main content

Conversation Summarization

Uses an LLM to compress older conversation history into a running summary, freeing context window space for new messages. Recent turns are kept verbatim while older turns are replaced by their compressed form. This solves the core failure mode of buffer memory — long conversations exceeding the context window.


Structure

When the conversation approaches the context limit, older messages are compressed into a summary by a separate LLM call. The summary replaces the raw messages, and new turns continue to accumulate until the next compression cycle.


Mechanism

  • New messages are appended normally to the conversation buffer
  • When buffer approaches context limit, a summarization pass runs
  • Older messages are compressed into a condensed summary paragraph
  • Summary replaces the raw messages — originals are discarded or archived
  • Summarization can be triggered by token count, turn count, or time

Key Characteristics

  • Extends conversation length — conversations can run far beyond the context window
  • Lossy compression — detail is permanently lost during summarization
  • Added latency — summarization requires an extra LLM call
  • Quality depends on summarizer — bad summaries compound over time
  • Cost trade-off — extra LLM calls for summarization vs. fewer input tokens per turn

When to Use

  • Long-running conversations that will exceed the context window
  • The exact wording of older messages doesn't matter, only the gist
  • You need to maintain context over hours or days of interaction
  • You want to avoid hard failures when conversations get long
  • Combined with buffer memory as a fallback (summary + recent buffer is the standard approach)