Skip to main content

The Happy Path Mirage

Building and evaluating agents exclusively against clean, well-formed inputs and expected workflows. Then being surprised when production traffic — with its ambiguity, adversarial inputs, malformed data, and edge cases — causes failures.


Why It Happens

  • Clean demos are convincing to stakeholders
  • Edge cases are tedious to enumerate
  • Security concerns feel paranoid when the agent "just answers questions"
  • The gap between demo quality and production quality is not visible until deployment
  • Teams want to ship fast and iterate later

What Goes Wrong

  • Prompt injection — untrusted content manipulates agent behavior
  • The lethal trifecta — private data access + untrusted content + exfiltration capability (found in GitHub MCP, Notion, Supabase)
  • Messy reality — production databases aren't cleanly documented, user queries aren't well-formed
  • Cascading failures — one unexpected input corrupts the agent's state for subsequent turns
  • False confidence — demo pass rate ≠ production pass rate

What to Do Instead

  • Test adversarial inputs from day one — angry users, nonsense, out-of-scope requests, manipulation attempts
  • Remove exfiltration vectors — if untrusted data enters the context, ensure the agent can't leak it (no URL fetching, no email sending without approval)
  • Safety in infrastructure, not prompts — move validation, PII filtering, and access control into code, not system prompt instructions
  • Human-in-the-loop for high-stakes actions — require approval before send, delete, publish, deploy
  • Fuzz test — throw random, malformed, and boundary inputs at the agent systematically

Signs You Have This

  • Your test suite only has well-formed, polite queries
  • You've never tested what happens with adversarial or nonsensical input
  • The agent has access to sensitive data and can also call external APIs
  • Safety rules are enforced in the prompt, not in code
  • You found out about a failure mode from a user, not from testing