Maintaining the Harness

Everything in this section is software, and software rots. Instructions go stale, rules contradict each other, components that earned their place under last year's model become dead weight under this year's. A harness that is only ever added to becomes the very thing it was built to prevent: a noisy, contradictory environment the agent can't work in reliably. Pay down harness debt the way you pay down technical debt — on a schedule, with measurements.

"Add a rule" is short-term pain relief and long-term poison. Every instruction, gate, and component you add is a liability with a maintenance cost — and the only way to know it's still earning its keep is to measure what happens when you take it away.

questions per instruction: source, applicability, expiry

ablation

remove one component, re-run the benchmark, compare

shift, not shrink

what model progress does to the valuable harness combinations

monthly

a reasonable audit cadence for a living harness

Audit instructions like dependencies

Instruction files grow by reflex — every incident adds a rule, no rule ever leaves. The failure modes compound quietly: a bloated entry file eats context budget, critical rules sink into the middle where attention is weakest, stale rules contradict new ones, and the agent picks between contradictions at random.

The discipline is to manage every instruction with three pieces of metadata: source (why was this added — which incident, which review), applicability (when does it matter — which tasks load it), and expiry (what would make it removable). An instruction that can't answer the third question is permanent by accident. Audit on a cadence; delete what's stale, merge what's redundant, and move what survives to where it's routed, not broadcast.

Ablate to find what's earning its keep

Intuition is a bad judge of which harness components matter. The honest instrument is an ablation test: hold the model and task set fixed, remove one component at a time, and re-run a repeatable benchmark. The component whose removal hurts most is your highest-marginal-value piece; the one whose removal changes nothing is a candidate for deletion.

One caveat the measurement-minded will appreciate: ablation ranks marginal value, not bottlenecks. A component can be individually removable while the failure pattern points somewhere else entirely — finding the bottleneck still requires failure records and root-cause attribution, not just deltas.

Simplify as models improve

The strangest maintenance duty: deleting things that work. Harness components are compensations for model weaknesses, and model weaknesses move. A sprint-splitting mechanism built because last year's model drifted on long builds becomes pure overhead when this year's model decomposes natively; meanwhile the independent checker keeps adding value near the model's capability boundary, where it still catches stubbed implementations and missing functionality.

Building Effective Agents ↗

Schluntz, Zhang·2024·Anthropic Engineering

Find the simplest pattern that works, and add complexity only when it measurably improves outcomes. The same test that gates additions also gates what stays: a component that no longer moves the measurement no longer belongs in the harness.

The audit sorts what's left into three buckets:

↓

Retire compensations

model caught up

Scaffolding that worked around a weakness the model no longer has is now cost without benefit — latency, tokens, and complexity. Ablate and remove.

→

Keep the boundary guards

still earning

Components that operate at the capability frontier — verification, evidence gates, budget enforcement — keep their value as the frontier moves with them.

↑

Reinvest the headroom

new combinations

As models improve, the interesting harness combinations don’t shrink — they shift. Capability freed from babysitting goes into longer autonomy, wider scopes, harder tasks.

A harness tuned for one model generation is mis-tuned for the next. Re-benchmark on every model upgrade — in both directions: what broke, and what became unnecessary.

The deeper habit underneath all three: the harness itself deserves the treatment this section prescribes for everything else. It has metrics (does the benchmark score move), a replay suite (the captured runs it's tested against), and a definition of done (the audit checklist). A harness you measure, prune, and re-tune is infrastructure. One you only ever add to is sediment.

Audit instructions like dependencies​

Ablate to find what's earning its keep​

Simplify as models improve​

Audit instructions like dependencies

Ablate to find what's earning its keep

Simplify as models improve