Vibe-Based Deployment

Shipping prompt changes, model swaps, or tool updates based on manual spot-checking and "it seems to work" rather than systematic evaluation. You modify the agent, try a few queries, and deploy.

As Hamel Husain puts it: "Most AI teams invest weeks building complex systems but can't tell if their changes are helping or hurting."

Why It Happens

Building evaluation infrastructure feels like a distraction from building features
LLM outputs are hard to evaluate automatically
The system "seems fine" in demos
Teams rationalize that the problem space is too open-ended for tests
Writing evals requires understanding failure modes, which requires looking at data, which teams skip

What Goes Wrong

Silent regressions — a prompt change that fixes case A breaks cases B, C, D
No baseline — you can't compare because you have nothing to compare against
Compounding errors — each "seems fine" deployment stacks uncertainty
False confidence — the 5 queries you tested don't represent the thousands in production
No learning — without data on failures, you can't systematically improve

What to Do Instead

Error analysis first — manually review 50-100 production traces before building anything
Binary evals — start with simple pass/fail judgments, not complex scoring
Curate an eval set — every bug becomes a test case (see Eval Suite)
Run evals on every change — compare before and after, flag regressions
Spend 60-80% of time on error analysis rather than infrastructure

Signs You Have This

You can't name your agent's top 5 failure modes
Prompt changes are deployed the same day they're written
"I tried it and it works" is the evaluation methodology
There's no eval dataset, no scoring, no baseline
Quality problems are discovered by users, not by you

Why It Happens​

What Goes Wrong​

What to Do Instead​

Signs You Have This​

Why It Happens

What Goes Wrong

What to Do Instead

Signs You Have This