Skip to main content

Vibe-Based Deployment

Shipping prompt changes, model swaps, or tool updates based on manual spot-checking and "it seems to work" rather than systematic evaluation. You modify the agent, try a few queries, and deploy.

As Hamel Husain puts it: "Most AI teams invest weeks building complex systems but can't tell if their changes are helping or hurting."


Why It Happens

  • Building evaluation infrastructure feels like a distraction from building features
  • LLM outputs are hard to evaluate automatically
  • The system "seems fine" in demos
  • Teams rationalize that the problem space is too open-ended for tests
  • Writing evals requires understanding failure modes, which requires looking at data, which teams skip

What Goes Wrong

  • Silent regressions — a prompt change that fixes case A breaks cases B, C, D
  • No baseline — you can't compare because you have nothing to compare against
  • Compounding errors — each "seems fine" deployment stacks uncertainty
  • False confidence — the 5 queries you tested don't represent the thousands in production
  • No learning — without data on failures, you can't systematically improve

What to Do Instead

  • Error analysis first — manually review 50-100 production traces before building anything
  • Binary evals — start with simple pass/fail judgments, not complex scoring
  • Curate an eval set — every bug becomes a test case (see Eval Suite)
  • Run evals on every change — compare before and after, flag regressions
  • Spend 60-80% of time on error analysis rather than infrastructure

Signs You Have This

  • You can't name your agent's top 5 failure modes
  • Prompt changes are deployed the same day they're written
  • "I tried it and it works" is the evaluation methodology
  • There's no eval dataset, no scoring, no baseline
  • Quality problems are discovered by users, not by you