Skip to main content

LLM-as-Judge

A separate LLM scores or grades another model's output against specified criteria, producing a numerical score and structured rationale. Unlike Evaluator-Optimizer (which feeds critique back into a revision loop), LLM-as-Judge is a one-shot scoring pass — it produces a quality signal, not a revision.

This is the most widely used evaluation method for anything that can't be checked programmatically.


Structure

The judge receives the agent's output (and optionally the input, reference answer, or retrieved context) along with explicit scoring criteria. It returns a score and explanation.


How It Works

  1. Define criteria — specify what the judge should evaluate (accuracy, helpfulness, tone, completeness)
  2. Construct judge prompt — include the output to evaluate, any reference material, and a scoring rubric
  3. Score — judge LLM produces a numerical score (1-5, 1-10) and a rationale explaining the score
  4. Aggregate — collect scores across multiple outputs for trends and averages

Judge modes:

  • Reference-free — judge scores output without a "correct" answer (helpfulness, tone)
  • Reference-based — judge compares output against a gold-standard answer (factual accuracy)
  • Pairwise — judge compares two outputs and picks the better one (A/B testing)

Key Characteristics

  • Flexible — can evaluate anything expressible in natural language criteria
  • Scalable — can score thousands of outputs without human reviewers
  • Biased — position bias (prefers first item), verbosity bias (prefers longer), self-preference bias
  • Calibration required — judge scores must be validated against human ratings
  • Cost — every evaluation is an additional LLM call

When to Use

  • You need to evaluate subjective quality (writing, helpfulness, reasoning)
  • Ground-truth answers don't exist or are expensive to create
  • You're scoring production outputs at scale (monitoring, quality dashboards)
  • Programmatic metrics don't capture what you care about
  • You want to automate evaluation that would otherwise require human reviewers