Skip to main content

Prompt Caching

Across the turns of an agent run, the front of the context — system prompt, tool definitions, early history — barely changes. Prompt caching lets the model provider reuse the compute for that stable prefix instead of reprocessing it every call, cutting both latency and cost dramatically. Exploiting it is a harness responsibility: it depends entirely on how you order and mutate context.

The same tokens get sent at the top of every single turn. Without caching you pay to process the entire system prompt and tool schemas on turn 1, turn 2, turn 40. With it, you pay once and reuse — often the largest single cost reduction in a long-running agent.

Prompt caching with Claude
Prompt caching reduces cost by up to 90% and latency by up to 85% for long prompts. Cached input tokens are priced at roughly one-tenth of uncached input — $0.30 per million tokens cached versus $3 per million uncached on the Sonnet model of the time.

Structure

The cacheable prefix must stay byte-stable across turns; only the suffix changes. A single edit to the prefix invalidates the cache from that point on.


How It Works

  1. Order for stability — put the most stable content first (context assembly): system prompt, then tool definitions, then settled history, with volatile per-turn content last.
  2. Hold the prefix byte-stable — the prefix must be identical across turns to hit the cache. A changed timestamp, reordered tool, or rewritten early message breaks it.
  3. Mark cache breakpoints — where the provider supports it, signal where the reusable prefix ends so the cache boundary is explicit.
  4. Compact at boundaries — when compaction must rewrite history, do it at a cache boundary to minimize how much cached prefix is invalidated.
  5. Mind the TTL — caches expire after a short window; pacing of turns and wakeups affects whether you keep hitting a warm cache.

The Manus team learned this in production: their context-engineering writeup calls KV-cache hit rate "the single most important metric for a production agent" — their agents run roughly a 100:1 input-to-output token ratio, and a single-token prefix change, like a timestamp in the system prompt, invalidates the cache from that point onward.


Key Characteristics

  • Stable prefix, volatile suffix — the entire benefit rests on keeping the front of the context unchanged turn to turn.
  • Any prefix edit invalidates downstream — change one token in the middle and everything after it must be reprocessed. Edits are expensive; appends are cheap.
  • Ordering is a cost decision — where you place content for assembly directly determines cache hit rate.
  • Caches are short-lived — a long sleep between turns can cost a cache miss; pace work to the TTL when it matters.
  • Compaction and caching trade off — rewriting history reclaims window but burns cache; do it deliberately, at boundaries.

Pitfalls

  • Volatile data in the prefix — a per-turn timestamp or session id before the stable content busts the cache every single turn.
  • Reordering tools or messages — non-deterministic ordering of the prefix means it never matches and never caches.
  • Compacting mid-prefix every turn — constant history rewrites guarantee perpetual cache misses; the cost accounting will show it.