Skip to main content

Domain Metrics

Evaluate agent output using a composite of domain-specific metrics tailored to the task type. Rather than a single score, this pattern measures quality across multiple orthogonal dimensions — faithfulness, relevance, groundedness, coherence — giving you a quality profile, not just a number.

RAGAS for RAG evaluation is the canonical example.


Structure

Each metric evaluates a different quality dimension independently. The composite profile reveals where quality is strong and where it breaks down — a response can be highly relevant but poorly grounded.


How It Works

  1. Define dimensions — identify the quality axes that matter for your domain
  2. Implement metrics — each dimension gets its own scoring function (LLM-based, programmatic, or hybrid)
  3. Score independently — run each metric on the output
  4. Compose profile — aggregate into a multi-dimensional quality profile
  5. Set thresholds — define minimum acceptable scores per dimension

Common metric sets:

  • RAG: faithfulness, answer relevance, context precision, context recall (RAGAS)
  • Summarization: coherence, consistency, fluency, relevance
  • Code generation: correctness, efficiency, readability, test coverage
  • Agents: task completion rate, tool use efficiency, step count, cost

Key Characteristics

  • Multi-dimensional — reveals where quality breaks down, not just "good or bad"
  • Domain-specific — metrics must be designed for each task type
  • Actionable — low faithfulness suggests retrieval issues; low relevance suggests query issues
  • Setup cost — designing and validating metric sets requires domain expertise
  • Composability — metrics can be mixed and matched across use cases

When to Use

  • You need to understand why output quality is low, not just that it's low
  • Building RAG systems where faithfulness and groundedness matter independently
  • Single-number scoring hides important quality distinctions
  • You want to set per-dimension quality thresholds (block unfaithful but allow imperfect fluency)
  • Debugging and improving agent pipelines — pinpoint which dimension is the bottleneck