Vibe-Based Deployment
Shipping prompt changes, model swaps, or tool updates based on manual spot-checking and "it seems to work" rather than systematic evaluation. You modify the agent, try a few queries, and deploy.
As Hamel Husain puts it: "Most AI teams invest weeks building complex systems but can't tell if their changes are helping or hurting."
Why It Happens
- Building evaluation infrastructure feels like a distraction from building features
- LLM outputs are hard to evaluate automatically
- The system "seems fine" in demos
- Teams rationalize that the problem space is too open-ended for tests
- Writing evals requires understanding failure modes, which requires looking at data, which teams skip
What Goes Wrong
- Silent regressions — a prompt change that fixes case A breaks cases B, C, D
- No baseline — you can't compare because you have nothing to compare against
- Compounding errors — each "seems fine" deployment stacks uncertainty
- False confidence — the 5 queries you tested don't represent the thousands in production
- No learning — without data on failures, you can't systematically improve
What to Do Instead
- Error analysis first — manually review 50-100 production traces before building anything
- Binary evals — start with simple pass/fail judgments, not complex scoring
- Curate an eval set — every bug becomes a test case (see Eval Suite)
- Run evals on every change — compare before and after, flag regressions
- Spend 60-80% of time on error analysis rather than infrastructure
Signs You Have This
- You can't name your agent's top 5 failure modes
- Prompt changes are deployed the same day they're written
- "I tried it and it works" is the evaluation methodology
- There's no eval dataset, no scoring, no baseline
- Quality problems are discovered by users, not by you