Evaluation Patterns

These patterns address a hard question: how do you know if an agent's output is good?

LLM outputs are non-deterministic. The same prompt can produce different results, and "correct" is often subjective. Evaluation patterns give you systematic ways to measure quality — at runtime (during agent execution) and offline (during development and CI/CD).

Without evaluation, you're shipping on vibes. With it, you can gate deployments, catch regressions, and continuously improve.

Patterns

Pattern	Signal Type	Runtime / Offline
LLM-as-Judge	LLM scores another LLM's output	Both
Test-Driven Evaluation	Executable tests produce pass/fail	Both
Domain Metrics	Composite scoring across quality dimensions	Both
Eval Suite	Regression testing against curated datasets	Offline
Human Feedback	Human ratings and corrections on live output	Collection: runtime, Use: offline

How to Choose

Start with Test-Driven Evaluation if your agent produces executable artifacts (code, SQL, API calls). Tests give you ground-truth correctness signals — no opinions, no bias.

Add LLM-as-Judge for anything that can't be tested programmatically — writing quality, helpfulness, tone, reasoning quality. This is the most flexible evaluation method but requires calibration.

Add Domain Metrics when you need multi-dimensional quality measurement (faithfulness + relevance + groundedness for RAG, coherence + consistency + fluency for summarization).

Build an Eval Suite when you're ready for production. Curate test cases, run them in CI/CD, gate deployments on quality thresholds. This is how you stop regressions.

Collect Human Feedback to calibrate your automated evaluations. Human ratings are the ground truth that all other evaluation methods approximate.

Relationship to Other Patterns

Several evaluation-adjacent patterns live in other categories:

Reflection (Reasoning) — agent critiques its own output as a reasoning strategy
Evaluator-Optimizer (Orchestration) — generator + evaluator in a refinement loop
Guardrails (Output) — safety/policy validation on output

The patterns here focus on measuring quality rather than improving it inline. They produce scores, not revisions.

Patterns​

How to Choose​

Relationship to Other Patterns​

Patterns

How to Choose

Relationship to Other Patterns