Skip to main content

Context Assembly

Every turn, the harness builds the exact context the model sees: system prompt, instructions, conversation history, retrieved knowledge, tool definitions, and the current observation. Context assembly is the engineering of that construction — what to include, in what order, and how to fit it in a finite window. It is the single biggest lever on agent quality.

The model has no memory and no access to anything you don't put in front of it. Its entire world each turn is the context you assembled. Garbage in, garbage out is not a cliché here — it's the whole game.


Structure

Assembly is a budget-constrained packing problem: fit the most useful information into a fixed window, in the order that helps the model most.


How It Works

  1. Gather candidates — system prompt, authored grounding (the instruction files and briefings of the context hierarchy), recent history, retrieved documents, tool schemas, and the latest observation.
  2. Prioritize — rank by importance: instructions and the current task are non-negotiable; older history and marginal retrieval are first to be cut.
  3. Budget — measure token cost and fit candidates within the window, reserving headroom for the model's response.
  4. Order deliberately — placement matters. Stable content (system prompt, tool defs) goes first so it stays cacheable; the most relevant retrieval and the current observation go last, nearest the model's attention.
  5. Format — render into the message structure the model expects, with clear delimiters between sections.
ManusContext Engineeringsource ↗
KV-cache hit rate is the single most important metric for a production agent

After rebuilding their agent framework several times, Yichao “Peak” Ji’s team at Manus concluded that assembly choices dominate production economics. A typical Manus run sees an input-to-output token ratio around 100:1 — the agent is overwhelmingly input-heavy, so whether that input hits the cache decides both latency and cost. Their rules follow directly. Keep the prompt prefix stable: a single-token difference, like a timestamp at the top of the system prompt, invalidates the cache from that point onward. Make context append-only and never modify earlier turns. And when the tool set must change mid-task, mask tools rather than remove them — removal busts the cache and confuses the model. Their last rule is the escape hatch for the packing problem itself: use the filesystem as the ultimate context, externalizing what doesn’t need to be in the window.

With Claude Sonnet at the time, cached input tokens cost $0.30/MTok versus $3/MTok uncached — a 10× difference that assembly decisions directly control.

Key Characteristics

  • Assembly is a packing problem under a hard limit — there is always more potentially-useful context than window. The art is choosing what to drop.
  • Order affects both quality and cost — recent and final positions get the most attention; stable prefixes enable prompt caching. Put fixed content first, volatile content last.
  • Reserve output headroom — assembling right up to the limit leaves no room to answer. Always budget for the response.
  • Relevance beats volume — more context is not better context. Anthropic's context engineering guidance frames attention as a finite budget that degrades as the window fills — "context rot." Precise retrieval beats dumping the whole knowledge base, which dilutes attention.
  • Tool definitions are context too — large tool schemas consume real budget; see Tool Registry for loading them on demand.

Pitfalls

  • Dumping everything in — stuffing the window degrades attention and cost without improving answers. The Prompt Monolith and Tool Junk Drawer both live here.
  • Volatile content in the prefix — putting a timestamp or per-turn data before the stable system prompt busts the cache every turn.
  • No headroom — filling the window so full the model can't respond, or silently truncating the most recent (most important) turn.