Human Feedback

Collect explicit human judgments — thumbs up/down, star ratings, preference selections, free-text corrections — on agent outputs during production use. These signals serve as ground truth for calibrating automated evaluations, curating eval datasets, and driving model improvements.

Every other evaluation pattern approximates what humans think. This one asks them directly.

Structure

Feedback is collected passively (thumbs up/down on every response) or actively (annotation queues where reviewers score sampled outputs). The collected signals feed back into the system through multiple channels.

How It Works

Instrument — add feedback UI to agent outputs (rating buttons, correction fields)
Collect — capture human signals alongside the full input/output context
Aggregate — track quality trends over time, by topic, by user segment
Calibrate — compare human ratings against automated evaluations to find gaps
Improve — use feedback to update prompts, fine-tune models, or expand eval datasets

Feedback types:

Binary — thumbs up/down (lowest friction, most volume)
Scalar — 1-5 star ratings (more signal, less volume)
Preference — "which response is better?" (A/B comparison)
Correction — user provides the correct answer (highest signal, lowest volume)
Annotation — trained reviewers score against rubrics (highest quality, most expensive)

Key Characteristics

Ground truth — human judgment is the ultimate quality signal
Expensive — human time is the scarcest resource
Noisy — individual ratings vary; need volume for reliability
Selection bias — users who rate are not representative of all users
Feedback loop — improvements driven by feedback improve future outputs

When to Use

You need ground truth to calibrate automated evaluations (LLM-as-Judge, domain metrics)
Automated metrics don't capture what users actually care about
You're building an eval dataset and need real-world examples of good and bad outputs
User satisfaction is the ultimate metric (not just technical correctness)
You want to identify failure modes that automated evaluation misses

Structure​

How It Works​

Key Characteristics​

When to Use​

Structure

How It Works

Key Characteristics

When to Use