Scope & Feature Lists

Agents overreach and under-finish. Given a feature to build, an unconstrained agent activates everything adjacent — "let me fix this too while I'm here" — diluting finite attention across five half-done changes, none of which passes end to end. The fix is not a smarter prompt; it is a harness-enforced boundary: one task in flight, tracked in a machine-readable feature list, with executable completion evidence required before anything new starts.

Do less but finish always beats do more but leave half-done. Lines of code generated and features completed are not just uncorrelated — in agent work they tend to run opposite. The constraint that forces finishing is the harness's job, because the model will never impose it on itself.

WIP = 1

one task active at a time — the safe default

(behavior, check, state)

the triple every feature entry carries

VCR

verified ÷ activated — gate new work on it

irreversible

passing is set by the verifier, never the agent

Scope discipline in four mechanisms. Attention is finite: capacity spread across k tasks gives each one C/k — below the threshold where any of them finishes.

The feature list is a primitive, not a memo

A status note like "shopping cart mostly done" is for humans, and agents ignore it the way humans do. A feature list is something stronger: a machine-readable data structure the harness itself runs on. The scheduler picks the next item from it, the verifier gates state transitions on it, the handoff reports from it, and progress metrics tally it. Documents can be ignored; primitives can't be bypassed.

Every entry is a triple — a behavior, an executable check, and a state:

feature_list.json
{
  "id": "rate-limit-login",
  "behavior": "Sixth login attempt from one IP within 60s returns 429",
  "verification": "pytest tests/test_ratelimit.py::test_sixth_attempt_429",
  "state": "active",
  "evidence": "—"
}

The states form a small machine, and the harness — not the agent — controls the transitions:

The only path into passing is the verification command actually succeeding — and the transition is irreversible, recorded with evidence (the commit, the test output). The agent can request verification; it cannot grade itself. This is the same worker/checker separation that anchors Verification & Done, enforced at the data-structure level — a database constraint, not an application-layer suggestion.

WIP = 1, enforced

With the list in place, the scope rule is one line in the harness: only one entry may be active at a time, and nothing new activates until the active one passes. Track the ratio of verified to activated tasks — the verified completion rate from Metrics That Matter — and block activation whenever it dips below 1.0. The count of not-yet-passing entries is back-pressure: the agent always knows exactly what "done" still owes.

Calibrating granularityeach entry: completable in one session

Too broad"Implement the shopping cart" — spans sessions, so it can never reach passing cleanly and squats in active forever.

Right-sized"User can add an item to the cart" — one behavior, one verification command, finishable inside a single session's budget.

Too narrow"Create the name field on the Cart model" — an implementation detail, not a behavior; the verification command would test nothing a user can see.

Scope each entry to one verifiable behavior that fits in one session. The verification command is the test of granularity: if you can't write one, the entry is wrong-sized.

In controlled comparisons, the constrained setup wins decisively: WIP=1 runs finish most of their feature list with everything passing end to end, while unconstrained runs touch three times the files, write four times the code, and leave most features failing the full pipeline. Little's Law says the same thing queueing theory has always said: more work in flight means longer lead time per item — and for agents, items that never finish at all.

Anthropic ran this experiment at full scale and published the wreckage and the fix:

AnthropicLong-Running Agentssource ↗

Right-sizing sessions: one feature per session beats one-shotting the app

Anthropic's work on harnesses for long-running coding agents documents the unscoped failure mode exactly: the agent tries to one-shot the entire application, runs out of context mid-feature, and the next session inherits an undocumented half-built feature — a later instance sees apparent progress and falsely declares the work done. The fix was structural. An initializer agent sets up the environment, git repo, init scripts, a structured feature list, and a progress log; a coding agent then works exactly one feature per session, runs end-to-end tests, commits clean, and updates the log before stopping. Compaction alone was insufficient to carry continuity — the feature list and one-per-session discipline were what made progress accumulate instead of churn.

Scope control turned multi-session progress from regression-prone to monotonic — every session ends with tested, committed, recorded progress the next one can trust.

Pitfalls

The agent edits its own states — if the model can write "state": "passing", it will. State transitions belong to the verifier; the agent only submits requests.
A list with no verification commands — entries that are just behavior descriptions degrade back into a memo. Every entry carries its executable check or it doesn't go on the list.
Letting "while I'm here" through — adjacent refactors feel free and cost the active task its attention. New work goes on the list as not_started; the WIP gate decides when it runs.

The feature list is a primitive, not a memo​

WIP = 1, enforced​

Pitfalls​

The feature list is a primitive, not a memo

WIP = 1, enforced

Pitfalls