Skip to main content

Verification & Definition of Done

The most common agent failure is not bad code — it's a confident "I'm done" when it isn't. Models are systematically overconfident about their own work, and the gap between the agent's confidence and actual correctness — the verification gap — is where unfinished work ships. The fix is structural: a written, command-verifiable Definition of Done, checked in layers, by something other than the model that did the work.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Frontier models resolve 1–12% of real GitHub issues with basic scaffolding. Purpose-built agentic harnesses with structured verification push verified completion rates into the 40–70% range on the same benchmark — a 5–10× gap driven almost entirely by how the harness structures the work, not the underlying model.

An agent must never be the sole judge of its own "done." The same model that generated the work grades it generously — worst of all on anything subjective. The solution is to separate the worker from the checker.

3
gated layers — static, runtime, end-to-end
command-verifiable
every done-condition is something you can run
worker ≠ checker
an independent evaluator grades the work
1.6 → 4.9
quality out of 5 as checker and planner roles are added
Without a written Definition of Done, the agent invents its own — and its version is 'the code looks right.'

Three layers, each gating the next

Termination criteria live in the harness as an ordered gate. Each layer is cheaper than the one after it, and failing any layer stops the climb — there is no point end-to-end testing code that doesn't type-check.

Layer 1 — static. Lint, type checks, build. Cheapest, catches the mechanical.

Layer 2 — runtime. The test suite passes, the app reaches ready state, the critical path executes. This is the core completion evidence — code that compiles but never ran proves nothing.

Layer 3 — end-to-end. The full pipeline, wired together. This layer exists because of component-boundary defects: module A and module B each pass their mocked unit tests and still break when connected — interface mismatches, state that doesn't propagate, resources that leak, errors that vanish between layers. Only a full run can prove their absence. In one instructive case, all five defects in a feature were caught by end-to-end tests and zero by unit tests.

Knowing the work will be verified end to end changes how the agent writes it — it starts considering interactions and error paths before the gate, not after. The gate is also a behavioral instrument, not just a filter.


GitHubAgentic Codingsource ↗
Copilot Workspace: structural separation of planning, coding, and review

GitHub Copilot Workspace (2024) made a deliberate structural choice: natural language task → specification → plan → code → review, with a hard stop and human review point between each phase. The agent never jumps from "I understand the task" to "I've written the code." Planning is its own phase, reviewable before implementation begins. The architecture treats every transition between phases as a potential correction point — the human can steer the plan before any code is written, or the plan before any tests are run. This is the worker/checker separation expressed as product UX rather than harness policy, and the effect is the same: the model cannot talk itself past a bad plan into a confident wrong implementation.

Key finding: making each step reviewable and correctable before the next one begins is the architectural move — not better prompts.

Separate the worker from the checker

Self-evaluation is structurally biased: the same weights that produced the work assess it, and they are generous. The harness fix is an independent evaluator — a separate agent (or deterministic script) whose only job is to try the work like a user would: drive the running app, hit the endpoints, check the database state, and score the result against a rubric with hard floors.

W

Worker

generates

Implements against the feature list entry and its verification command. Submits a claim of done — never a verdict.

C

Checker

verifies

Independent and deliberately nitpicky. Runs the three layers, interacts with the running system, scores against the rubric, cites evidence for every objection.

H

Harness

decides

Owns the state transition. Only a passing verdict moves the feature to passing — with the evidence recorded.

Adding role separation is the single largest measured quality lever — in role-separation experiments scored on a five-point rubric, output climbed from 1.6 (one agent does everything) to 3.3 (worker + checker) to 4.9 (planner + worker + checker).
Field note

We gave the evaluator a pass/fail rubric and it started failing everything — correctly, for a while. Then it started approving things too. We read its logs and found it was citing "unclear requirements" for anything ambiguous and marking it pass rather than escalating. The rubric needed a third state: cannot determine — escalate to human. A binary checker in an ambiguous domain will drift toward whichever answer is less work to justify.

post-mortem, Q3 review

Two disciplines keep the gate honest:

  • Function first. Verify core behavior before performance, performance before style — and no refactoring until the core path passes. "While I'm here" improvements before verification are how a passing change becomes a broken one.
  • Errors that teach. A bare Test failed sends the agent guessing. An agent-oriented failure message states what failed, why, and where to look — "POST /api/reset-password returned 500; check the email-service config in env; the template belongs at templates/reset-email.html" — turning each gate failure into a self-correcting recovery loop instead of a blind retry.

Promote review feedback into the gate

The gate should grow. When a human reviewer catches the same class of issue twice, that feedback becomes a rule the harness checks forever: an architectural boundary turned into a lint ("the renderer must not touch the filesystem" as an executable check, not a doc), a recurring review comment turned into a test. Capture taste once, enforce it continuously — the same loop that hardens the eval suite hardens the done-gate.

The principle behind all of it: enforce invariants; don't micromanage implementation. The harness pins down what must be true — parsed at the boundary, no cross-layer imports, all three layers green — and leaves the model free in how it gets there.


The verification gap

An agent that cannot verify its own output has no definition of done — only a feeling. The gap between confidence and correctness is where unfinished work ships.

  • Every done-condition must be command-verifiable — not a description, a command you can run.
  • The checker must be structurally independent: different process, different prompt, different perspective.
  • Verification gates are behavioral instruments, not just filters — knowing the gate exists changes how the agent builds toward it.

Pitfalls

  • "Done" by vibes — no written Definition of Done means the agent substitutes its own, and its own is "looks right." Write the conditions as commands; see the Happy Path Mirage.
  • Unit-green as system-green — skipping Layer 3 because Layers 1–2 pass is exactly the gap component-boundary defects live in.
  • A checker that wants to say yes — evaluators drift agreeable: they spot a real issue, then talk themselves out of it. Tune the checker by reading its logs against human verdicts and tightening the rubric where they diverge.