Verification & Definition of Done
The most common agent failure is not bad code — it's a confident "I'm done" when it isn't. Models are systematically overconfident about their own work, and the gap between the agent's confidence and actual correctness — the verification gap — is where unfinished work ships. The fix is structural: a written, command-verifiable Definition of Done, checked in layers, by something other than the model that did the work.
An agent must never be the sole judge of its own "done." The same model that generated the work grades it generously — worst of all on anything subjective. The solution is to separate the worker from the checker.
Three layers, each gating the next
Termination criteria live in the harness as an ordered gate. Each layer is cheaper than the one after it, and failing any layer stops the climb — there is no point end-to-end testing code that doesn't type-check.
Layer 1 — static. Lint, type checks, build. Cheapest, catches the mechanical.
Layer 2 — runtime. The test suite passes, the app reaches ready state, the critical path executes. This is the core completion evidence — code that compiles but never ran proves nothing.
Layer 3 — end-to-end. The full pipeline, wired together. This layer exists because of component-boundary defects: module A and module B each pass their mocked unit tests and still break when connected — interface mismatches, state that doesn't propagate, resources that leak, errors that vanish between layers. Only a full run can prove their absence. In one instructive case, all five defects in a feature were caught by end-to-end tests and zero by unit tests.
Knowing the work will be verified end to end changes how the agent writes it — it starts considering interactions and error paths before the gate, not after. The gate is also a behavioral instrument, not just a filter.
GitHub Copilot Workspace (2024) made a deliberate structural choice: natural language task → specification → plan → code → review, with a hard stop and human review point between each phase. The agent never jumps from "I understand the task" to "I've written the code." Planning is its own phase, reviewable before implementation begins. The architecture treats every transition between phases as a potential correction point — the human can steer the plan before any code is written, or the plan before any tests are run. This is the worker/checker separation expressed as product UX rather than harness policy, and the effect is the same: the model cannot talk itself past a bad plan into a confident wrong implementation.
Separate the worker from the checker
Self-evaluation is structurally biased: the same weights that produced the work assess it, and they are generous. The harness fix is an independent evaluator — a separate agent (or deterministic script) whose only job is to try the work like a user would: drive the running app, hit the endpoints, check the database state, and score the result against a rubric with hard floors.
Worker
generates
Implements against the feature list entry and its verification command. Submits a claim of done — never a verdict.
Checker
verifies
Independent and deliberately nitpicky. Runs the three layers, interacts with the running system, scores against the rubric, cites evidence for every objection.
Harness
decides
Owns the state transition. Only a passing verdict moves the feature to passing — with the evidence recorded.
We gave the evaluator a pass/fail rubric and it started failing everything — correctly, for a while. Then it started approving things too. We read its logs and found it was citing "unclear requirements" for anything ambiguous and marking it pass rather than escalating. The rubric needed a third state: cannot determine — escalate to human. A binary checker in an ambiguous domain will drift toward whichever answer is less work to justify.
Two disciplines keep the gate honest:
- Function first. Verify core behavior before performance, performance before style — and no refactoring until the core path passes. "While I'm here" improvements before verification are how a passing change becomes a broken one.
- Errors that teach. A bare
Test failedsends the agent guessing. An agent-oriented failure message states what failed, why, and where to look — "POST /api/reset-password returned 500; check the email-service config in env; the template belongs at templates/reset-email.html" — turning each gate failure into a self-correcting recovery loop instead of a blind retry.
Promote review feedback into the gate
The gate should grow. When a human reviewer catches the same class of issue twice, that feedback becomes a rule the harness checks forever: an architectural boundary turned into a lint ("the renderer must not touch the filesystem" as an executable check, not a doc), a recurring review comment turned into a test. Capture taste once, enforce it continuously — the same loop that hardens the eval suite hardens the done-gate.
The principle behind all of it: enforce invariants; don't micromanage implementation. The harness pins down what must be true — parsed at the boundary, no cross-layer imports, all three layers green — and leaves the model free in how it gets there.
An agent that cannot verify its own output has no definition of done — only a feeling. The gap between confidence and correctness is where unfinished work ships.
- Every done-condition must be command-verifiable — not a description, a command you can run.
- The checker must be structurally independent: different process, different prompt, different perspective.
- Verification gates are behavioral instruments, not just filters — knowing the gate exists changes how the agent builds toward it.
Pitfalls
- "Done" by vibes — no written Definition of Done means the agent substitutes its own, and its own is "looks right." Write the conditions as commands; see the Happy Path Mirage.
- Unit-green as system-green — skipping Layer 3 because Layers 1–2 pass is exactly the gap component-boundary defects live in.
- A checker that wants to say yes — evaluators drift agreeable: they spot a real issue, then talk themselves out of it. Tune the checker by reading its logs against human verdicts and tightening the rubric where they diverge.