The Happy Path Mirage
Building and evaluating agents exclusively against clean, well-formed inputs and expected workflows. Then being surprised when production traffic — with its ambiguity, adversarial inputs, malformed data, and edge cases — causes failures.
Why It Happens
- Clean demos are convincing to stakeholders
- Edge cases are tedious to enumerate
- Security concerns feel paranoid when the agent "just answers questions"
- The gap between demo quality and production quality is not visible until deployment
- Teams want to ship fast and iterate later
What Goes Wrong
- Prompt injection — untrusted content manipulates agent behavior
- The lethal trifecta — private data access + untrusted content + exfiltration capability (found in GitHub MCP, Notion, Supabase)
- Messy reality — production databases aren't cleanly documented, user queries aren't well-formed
- Cascading failures — one unexpected input corrupts the agent's state for subsequent turns
- False confidence — demo pass rate ≠ production pass rate
What to Do Instead
- Test adversarial inputs from day one — angry users, nonsense, out-of-scope requests, manipulation attempts
- Remove exfiltration vectors — if untrusted data enters the context, ensure the agent can't leak it (no URL fetching, no email sending without approval)
- Safety in infrastructure, not prompts — move validation, PII filtering, and access control into code, not system prompt instructions
- Human-in-the-loop for high-stakes actions — require approval before send, delete, publish, deploy
- Fuzz test — throw random, malformed, and boundary inputs at the agent systematically
Signs You Have This
- Your test suite only has well-formed, polite queries
- You've never tested what happens with adversarial or nonsensical input
- The agent has access to sensitive data and can also call external APIs
- Safety rules are enforced in the prompt, not in code
- You found out about a failure mode from a user, not from testing