Agent QA Audit — is your AI agent actually tested?

Everyone is shipping AI agents. Almost nobody is testing them. The Agent QA Audit is a 90-second browser toy that forces the question out into the open: when your agent hallucinates a tool call, ignores its system prompt halfway through a session, or burns the entire token budget answering "hi" — what actually happens? Does a test catch it, does a guardrail soften it, or does it ship untested while you refresh the dashboard and hope?

The rules are short. You get eight face-down cards, each hiding a real AI-agent failure mode — hallucinated tool arguments, infinite retry loops on a 404, leaked secrets in logs, the wrong tool picked entirely. Flip a card, read the failure mode, and drop it into one of three lanes: CAUGHT IN QA (a test catches it first), SHIPPED (it reaches prod but you have a guardrail), or YOLO IN PROD (it ships untested and you pray). There are no wrong answers here — only confessions. Each failure mode carries a severity weight, and your lane choices tally into an Agent Trust Score from 0 to 100.

Then the dial spins. The needle sweeps from red to green, the score counts up, and a rarity-tier badge flips into place: Legendary if you are a Battle-Hardened Agent Wrangler, Rare for a Cautious Deployer, Common for an Optimistic Shipper, or Cursed if the audit declares you a Chaos Goblin. A savage one-liner types out underneath, tuned to your specific triage pattern, and the whole verdict — dial, badge, and quip — renders as a screenshot you can post to X or LinkedIn. The screenshot is the flex. Or the confession.

The premise is not that agents are bad. The premise is that agentic QA — testing how an agent behaves when tools fail, inputs go null, and prompts drift — is a different discipline than testing a normal app, and most teams have not built it yet. The audit is a way to laugh at the gap, screenshot the result, and — if you actually want agent testing infrastructure that catches this before prod does — wander over to the assessment next door.

Related reading: Vibe-QA assessment · Vibe-Coder Bug Roulette · Flaky or Fixable triage game · Vibe coding and the QA gap