LLM-as-Judge

A separate LLM scores or grades another model's output against specified criteria, producing a numerical score and structured rationale. Unlike Evaluator-Optimizer (which feeds critique back into a revision loop), LLM-as-Judge is a one-shot scoring pass — it produces a quality signal, not a revision.

This is the most widely used evaluation method for anything that can't be checked programmatically.

Structure

The judge receives the agent's output (and optionally the input, reference answer, or retrieved context) along with explicit scoring criteria. It returns a score and explanation.

How It Works

Define criteria — specify what the judge should evaluate (accuracy, helpfulness, tone, completeness)
Construct judge prompt — include the output to evaluate, any reference material, and a scoring rubric
Score — judge LLM produces a numerical score (1-5, 1-10) and a rationale explaining the score
Aggregate — collect scores across multiple outputs for trends and averages

Judge modes:

Reference-free — judge scores output without a "correct" answer (helpfulness, tone)
Reference-based — judge compares output against a gold-standard answer (factual accuracy)
Pairwise — judge compares two outputs and picks the better one (A/B testing)

Key Characteristics

Flexible — can evaluate anything expressible in natural language criteria
Scalable — can score thousands of outputs without human reviewers
Biased — position bias (prefers first item), verbosity bias (prefers longer), self-preference bias
Calibration required — judge scores must be validated against human ratings
Cost — every evaluation is an additional LLM call

When to Use

You need to evaluate subjective quality (writing, helpfulness, reasoning)
Ground-truth answers don't exist or are expensive to create
You're scoring production outputs at scale (monitoring, quality dashboards)
Programmatic metrics don't capture what you care about
You want to automate evaluation that would otherwise require human reviewers

Structure​

How It Works​

Key Characteristics​

When to Use​

Structure

How It Works

Key Characteristics

When to Use