Test-Driven Evaluation

Use executable tests — unit tests, integration tests, property checks, or query execution — as the evaluation signal for agent output. The agent generates an artifact, the system executes it against a test harness, and pass/fail results provide a ground-truth correctness signal. No opinions, no bias — the code either works or it doesn't.

This is the most reliable evaluation pattern available.

Structure

The agent produces output (code, SQL, config). The test runner executes it against existing or generated tests. Failures are fed back to the agent with error details for self-repair.

How It Works

Generate — agent produces an executable artifact (code, query, configuration)
Execute — artifact runs in a test environment against a test suite
Evaluate — test results produce concrete pass/fail signals with error messages
Feedback — failures are fed back to the agent with stack traces and assertion errors
Iterate — agent revises the artifact based on concrete error signals

Test sources:

Existing tests — run the project's test suite against the agent's changes (SWE-Bench approach)
Generated tests — agent writes tests first, then writes the implementation
Property-based — define invariants that must hold (output is valid JSON, query returns rows)
Execution-based — does the code run without errors? Does the SQL return results?

Key Characteristics

Ground truth — tests produce binary correctness signals, not opinions
Self-debugging — error messages give the agent concrete feedback to act on
Limited scope — only works for executable artifacts (code, queries, configs)
Test quality matters — bad tests give false confidence
Infrastructure required — needs a sandboxed execution environment

When to Use

The agent generates executable output (code, SQL, API calls, data transformations)
Correctness is verifiable by execution (not just by reading)
You want the highest-confidence evaluation signal available
The agent should self-repair based on concrete error messages
Coding agents, data pipeline agents, infrastructure automation agents

Structure​

How It Works​

Key Characteristics​

When to Use​

Structure

How It Works

Key Characteristics

When to Use