Skip to main content

Evaluation Patterns

These patterns address a hard question: how do you know if an agent's output is good?

LLM outputs are non-deterministic. The same prompt can produce different results, and "correct" is often subjective. Evaluation patterns give you systematic ways to measure quality — at runtime (during agent execution) and offline (during development and CI/CD).

Without evaluation, you're shipping on vibes. With it, you can gate deployments, catch regressions, and continuously improve.


Patterns

PatternSignal TypeRuntime / Offline
LLM-as-JudgeLLM scores another LLM's outputBoth
Test-Driven EvaluationExecutable tests produce pass/failBoth
Domain MetricsComposite scoring across quality dimensionsBoth
Eval SuiteRegression testing against curated datasetsOffline
Human FeedbackHuman ratings and corrections on live outputCollection: runtime, Use: offline

How to Choose

Start with Test-Driven Evaluation if your agent produces executable artifacts (code, SQL, API calls). Tests give you ground-truth correctness signals — no opinions, no bias.

Add LLM-as-Judge for anything that can't be tested programmatically — writing quality, helpfulness, tone, reasoning quality. This is the most flexible evaluation method but requires calibration.

Add Domain Metrics when you need multi-dimensional quality measurement (faithfulness + relevance + groundedness for RAG, coherence + consistency + fluency for summarization).

Build an Eval Suite when you're ready for production. Curate test cases, run them in CI/CD, gate deployments on quality thresholds. This is how you stop regressions.

Collect Human Feedback to calibrate your automated evaluations. Human ratings are the ground truth that all other evaluation methods approximate.


Relationship to Other Patterns

Several evaluation-adjacent patterns live in other categories:

  • Reflection (Reasoning) — agent critiques its own output as a reasoning strategy
  • Evaluator-Optimizer (Orchestration) — generator + evaluator in a refinement loop
  • Guardrails (Output) — safety/policy validation on output

The patterns here focus on measuring quality rather than improving it inline. They produce scores, not revisions.