Skip to main content

Test-Driven Evaluation

Use executable tests — unit tests, integration tests, property checks, or query execution — as the evaluation signal for agent output. The agent generates an artifact, the system executes it against a test harness, and pass/fail results provide a ground-truth correctness signal. No opinions, no bias — the code either works or it doesn't.

This is the most reliable evaluation pattern available.


Structure

The agent produces output (code, SQL, config). The test runner executes it against existing or generated tests. Failures are fed back to the agent with error details for self-repair.


How It Works

  1. Generate — agent produces an executable artifact (code, query, configuration)
  2. Execute — artifact runs in a test environment against a test suite
  3. Evaluate — test results produce concrete pass/fail signals with error messages
  4. Feedback — failures are fed back to the agent with stack traces and assertion errors
  5. Iterate — agent revises the artifact based on concrete error signals

Test sources:

  • Existing tests — run the project's test suite against the agent's changes (SWE-Bench approach)
  • Generated tests — agent writes tests first, then writes the implementation
  • Property-based — define invariants that must hold (output is valid JSON, query returns rows)
  • Execution-based — does the code run without errors? Does the SQL return results?

Key Characteristics

  • Ground truth — tests produce binary correctness signals, not opinions
  • Self-debugging — error messages give the agent concrete feedback to act on
  • Limited scope — only works for executable artifacts (code, queries, configs)
  • Test quality matters — bad tests give false confidence
  • Infrastructure required — needs a sandboxed execution environment

When to Use

  • The agent generates executable output (code, SQL, API calls, data transformations)
  • Correctness is verifiable by execution (not just by reading)
  • You want the highest-confidence evaluation signal available
  • The agent should self-repair based on concrete error messages
  • Coding agents, data pipeline agents, infrastructure automation agents