Maintaining the Harness
Everything in this section is software, and software rots. Instructions go stale, rules contradict each other, components that earned their place under last year's model become dead weight under this year's. A harness that is only ever added to becomes the very thing it was built to prevent: a noisy, contradictory environment the agent can't work in reliably. Pay down harness debt the way you pay down technical debt — on a schedule, with measurements.
"Add a rule" is short-term pain relief and long-term poison. Every instruction, gate, and component you add is a liability with a maintenance cost — and the only way to know it's still earning its keep is to measure what happens when you take it away.
Audit instructions like dependencies
Instruction files grow by reflex — every incident adds a rule, no rule ever leaves. The failure modes compound quietly: a bloated entry file eats context budget, critical rules sink into the middle where attention is weakest, stale rules contradict new ones, and the agent picks between contradictions at random.
The discipline is to manage every instruction with three pieces of metadata: source (why was this added — which incident, which review), applicability (when does it matter — which tasks load it), and expiry (what would make it removable). An instruction that can't answer the third question is permanent by accident. Audit on a cadence; delete what's stale, merge what's redundant, and move what survives to where it's routed, not broadcast.
Ablate to find what's earning its keep
Intuition is a bad judge of which harness components matter. The honest instrument is an ablation test: hold the model and task set fixed, remove one component at a time, and re-run a repeatable benchmark. The component whose removal hurts most is your highest-marginal-value piece; the one whose removal changes nothing is a candidate for deletion.
One caveat the measurement-minded will appreciate: ablation ranks marginal value, not bottlenecks. A component can be individually removable while the failure pattern points somewhere else entirely — finding the bottleneck still requires failure records and root-cause attribution, not just deltas.
Simplify as models improve
The strangest maintenance duty: deleting things that work. Harness components are compensations for model weaknesses, and model weaknesses move. A sprint-splitting mechanism built because last year's model drifted on long builds becomes pure overhead when this year's model decomposes natively; meanwhile the independent checker keeps adding value near the model's capability boundary, where it still catches stubbed implementations and missing functionality.
The audit sorts what's left into three buckets:
Retire compensations
model caught up
Scaffolding that worked around a weakness the model no longer has is now cost without benefit — latency, tokens, and complexity. Ablate and remove.
Keep the boundary guards
still earning
Components that operate at the capability frontier — verification, evidence gates, budget enforcement — keep their value as the frontier moves with them.
Reinvest the headroom
new combinations
As models improve, the interesting harness combinations don’t shrink — they shift. Capability freed from babysitting goes into longer autonomy, wider scopes, harder tasks.
The deeper habit underneath all three: the harness itself deserves the treatment this section prescribes for everything else. It has metrics (does the benchmark score move), a replay suite (the captured runs it's tested against), and a definition of done (the audit checklist). A harness you measure, prune, and re-tune is infrastructure. One you only ever add to is sediment.