Domain Metrics
Evaluate agent output using a composite of domain-specific metrics tailored to the task type. Rather than a single score, this pattern measures quality across multiple orthogonal dimensions — faithfulness, relevance, groundedness, coherence — giving you a quality profile, not just a number.
RAGAS for RAG evaluation is the canonical example.
Structure
Each metric evaluates a different quality dimension independently. The composite profile reveals where quality is strong and where it breaks down — a response can be highly relevant but poorly grounded.
How It Works
- Define dimensions — identify the quality axes that matter for your domain
- Implement metrics — each dimension gets its own scoring function (LLM-based, programmatic, or hybrid)
- Score independently — run each metric on the output
- Compose profile — aggregate into a multi-dimensional quality profile
- Set thresholds — define minimum acceptable scores per dimension
Common metric sets:
- RAG: faithfulness, answer relevance, context precision, context recall (RAGAS)
- Summarization: coherence, consistency, fluency, relevance
- Code generation: correctness, efficiency, readability, test coverage
- Agents: task completion rate, tool use efficiency, step count, cost
Key Characteristics
- Multi-dimensional — reveals where quality breaks down, not just "good or bad"
- Domain-specific — metrics must be designed for each task type
- Actionable — low faithfulness suggests retrieval issues; low relevance suggests query issues
- Setup cost — designing and validating metric sets requires domain expertise
- Composability — metrics can be mixed and matched across use cases
When to Use
- You need to understand why output quality is low, not just that it's low
- Building RAG systems where faithfulness and groundedness matter independently
- Single-number scoring hides important quality distinctions
- You want to set per-dimension quality thresholds (block unfaithful but allow imperfect fluency)
- Debugging and improving agent pipelines — pinpoint which dimension is the bottleneck