Skip to main content

Eval Suite

Maintain a curated dataset of test cases — input, expected behavior, and evaluation criteria — and run the agent against it on every change. Compare scores to a baseline. Flag regressions. Gate deployments on quality thresholds. This is the LLM equivalent of a test suite in software engineering.

Without this, you're deploying prompt changes on vibes.


Structure

The eval suite runs automatically in CI/CD. Each test case is scored using LLM-as-Judge, domain metrics, or programmatic checks. Results are compared against the previous baseline to detect regressions.


How It Works

  1. Curate dataset — collect representative input-output pairs covering key scenarios and edge cases
  2. Define evaluators — choose scoring methods per test case (LLM judge, metrics, assertions)
  3. Run baseline — score the current agent version to establish baseline scores
  4. Run on change — every PR or prompt change triggers the eval suite
  5. Compare — flag test cases that regressed, improved, or stayed the same
  6. Gate — block deployment if regressions exceed a threshold

Dataset management:

  • Seed from production — sample real user queries as test cases
  • Add failure cases — every bug becomes a new test case
  • Cover edge cases — intentionally adversarial, ambiguous, or tricky inputs
  • Version the dataset — track changes to the eval set alongside code changes

Key Characteristics

  • Regression protection — catches quality drops before they reach production
  • Reproducible — same dataset, same evaluation, comparable scores across runs
  • Living dataset — eval suite grows as you discover new failure modes
  • Maintenance cost — dataset must be curated, updated, and kept relevant
  • Slow feedback — full suite runs can take minutes to hours depending on size

When to Use

  • You're making regular changes to prompts, models, or agent logic
  • Quality regressions have reached production and you need to prevent recurrence
  • Multiple people are changing the agent and you need shared quality standards
  • You want to gate deployments on measurable quality thresholds
  • You're ready to treat agent quality with the same rigor as software testing