Human Feedback
Collect explicit human judgments — thumbs up/down, star ratings, preference selections, free-text corrections — on agent outputs during production use. These signals serve as ground truth for calibrating automated evaluations, curating eval datasets, and driving model improvements.
Every other evaluation pattern approximates what humans think. This one asks them directly.
Structure
Feedback is collected passively (thumbs up/down on every response) or actively (annotation queues where reviewers score sampled outputs). The collected signals feed back into the system through multiple channels.
How It Works
- Instrument — add feedback UI to agent outputs (rating buttons, correction fields)
- Collect — capture human signals alongside the full input/output context
- Aggregate — track quality trends over time, by topic, by user segment
- Calibrate — compare human ratings against automated evaluations to find gaps
- Improve — use feedback to update prompts, fine-tune models, or expand eval datasets
Feedback types:
- Binary — thumbs up/down (lowest friction, most volume)
- Scalar — 1-5 star ratings (more signal, less volume)
- Preference — "which response is better?" (A/B comparison)
- Correction — user provides the correct answer (highest signal, lowest volume)
- Annotation — trained reviewers score against rubrics (highest quality, most expensive)
Key Characteristics
- Ground truth — human judgment is the ultimate quality signal
- Expensive — human time is the scarcest resource
- Noisy — individual ratings vary; need volume for reliability
- Selection bias — users who rate are not representative of all users
- Feedback loop — improvements driven by feedback improve future outputs
When to Use
- You need ground truth to calibrate automated evaluations (LLM-as-Judge, domain metrics)
- Automated metrics don't capture what users actually care about
- You're building an eval dataset and need real-world examples of good and bad outputs
- User satisfaction is the ultimate metric (not just technical correctness)
- You want to identify failure modes that automated evaluation misses