Why 'Vibe Checking' Your Code is Replacing Traditional QA Gates: A Post-Agile Manifesto
The QA-gate playbook was built for quarterly releases — here is the testing stack indie hackers actually ship with at 3 AM

Picture a solo founder at 11 PM with a feature half-shipped, three open Stripe webhooks, and a tab counting down to a TestFlight build deadline. The QA playbook says: write unit tests for every branch, integration tests for every endpoint, run them through a staging environment, hold a release review. The founder closes the playbook and ships anyway. Not because they don't care about quality — because the QA-gate model was built for quarterly releases at companies with dedicated QA teams, and they have neither.
What they have instead is vibe checking: a tight loop of smoke tests, LLM judges, and eyeball diffs that catches the regressions a paying user would notice without auditing every code path. It is not slop. It is a deliberate prioritization — and as the post-Agile reality of solo and small teams matures, it is becoming the default. This post is a manifesto, but it is also a buildable stack. We will walk through three production-ready code examples — a Playwright golden path, an LLM judge with cost guards, and a CI workflow — plus the gotchas that bite when you put it in CI.
The QA Gate Theater Indie Hackers Cannot Afford
Traditional QA gates assume a release is a discrete, expensive event. They are a toll booth before deployment: feature freeze, test pass, sign-off, deploy. That model fits monthly releases at organizations where bugs in production cost six figures and QA engineers are a separate department.
For an indie hacker shipping daily, the same gate is theater. Every gate adds latency between "I wrote it" and "real users see it." And the longer that latency, the more code you write on top of unverified assumptions. Google's DORA State of DevOps Report has consistently shown that elite performers deploy multiple times per day, and that the leading indicator is small batch size with fast feedback rather than exhaustive pre-deploy testing. The 2024 Stack Overflow Developer Survey ranks Jest and Cypress among the most-used JavaScript testing frameworks, but it does not measure how often a solo founder bypasses them under deadline pressure. Anecdotally, from working with indie builders across Barcelona, Madrid, and Valencia: that number is "basically always."
Vibe checking accepts this reality and engineers around it. Instead of asking "is the code correct in every dimension?" it asks "would a paying user notice if this broke?" The answer comes from three layers, each cheap to set up and cheap to run.
What Is Vibe Checking, Really?
Answer capsule: Vibe checking pairs smoke tests, LLM judges, and eyeball diffs to catch the regressions users notice in 10% of the setup time of a full QA gate.
Vibe checking is the deliberate practice of testing the user-perceptible surface of your product — and consciously skipping everything else. The three layers, in order of return-on-effort:
- Layer 1 — Golden path smoke tests. One or two Playwright tests that drive the most-used flow end-to-end (signup, the core action, checkout). They prove the app boots and the money path works.
- Layer 2 — LLM judges. For probabilistic features (chat, generation, search), a small LLM call decides whether the candidate output is acceptable vs a baseline. Replaces snapshot tests that flake every time the model breathes.
- Layer 3 — CI vibe-check workflow. A GitHub Action that wires all three together, runs on every PR, posts to Slack, and respects a budget. Twelve minutes max, fail fast.
Notice what is missing: no exhaustive unit tests, no integration matrix, no QA sign-off. That is the trade. You accept that some logic bugs will reach production — but only ones that do not affect the golden path or the AI features users actually exercise. Sentry catches the rest in production, and you fix forward.
Vibe Checks vs Traditional QA Gates: The Comparison Table
Before we drop into code, here is how the two models stack up across the metrics that matter when you are a one-person product team:
| Dimension | Traditional QA Gates | Vibe Checking |
|---|---|---|
| Setup time | Days to weeks (test infra, harness, fixtures) | Hours (one Playwright spec + one judge) |
| CI runtime | 15–45 min suites, parallel shards required | 3–10 min, single runner |
| Maintenance cost | High — every refactor breaks N tests | Low — only golden path + judges to update |
| Coverage of AI / LLM features | Poor — snapshots flake, mocks lie | Good — judge evaluates intent, not bytes |
| Detection of edge-case logic bugs | Excellent (when tests exist) | Weak — relies on Sentry post-deploy |
| Suitable team size | 5+ engineers, dedicated QA | 1–4 engineers, no QA |
| Cost per CI run | $0.50–$5 (CI minutes) | $0.05–$0.50 (CI + LLM tokens) |
Layer 1: The Golden-Path Smoke Test
The golden path is the one flow that, if it broke, you would hear about within an hour. For most B2B SaaS: signup → core action → upgrade. For a content product: landing → generate → save. Pick one. Write one Playwright test. Do not pick five.
Here is a production-ready Playwright spec that exercises the flow, stubs the non-deterministic AI call, handles a Stripe redirect race condition, and annotates page errors so you do not miss a console-error regression:
// vibe-check/golden-path.spec.ts
import { test, expect } from '@playwright/test';
const BASE_URL = process.env.VIBE_CHECK_URL ?? 'http://localhost:3000';
const TEST_EMAIL = `vibe-${Date.now()}@example.test`;
test.describe.configure({ mode: 'serial', retries: 2 });
test('signup → first action → paywall renders', async ({ page }) => {
// Surface console errors as test failures — a silent JS error is a vibe failure.
const consoleErrors: string[] = [];
page.on('pageerror', (err) => consoleErrors.push(err.message));
page.on('console', (msg) => {
if (msg.type() === 'error') consoleErrors.push(msg.text());
});
await page.goto(`${BASE_URL}/signup`, {
waitUntil: 'networkidle',
timeout: 30_000,
});
await page.fill('input[name="email"]', TEST_EMAIL);
await page.fill('input[name="password"]', 'vibe-check-only');
// Wait for navigation AND submit at the same time — avoids the race where
// the click resolves before the form posts.
await Promise.all([
page.waitForURL(/\/onboarding/, { timeout: 15_000 }),
page.click('button[type="submit"]'),
]);
// Stub the AI call so the test isn't gated on model latency or cost.
await page.route('**/api/generate', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ id: 'fake-1', body: 'stubbed output' }),
});
});
await page.click('[data-testid="generate"]');
await expect(page.getByText('stubbed output')).toBeVisible({ timeout: 10_000 });
// Edge case: Stripe redirect can race with our paywall fallback render.
// Whichever lands first wins — but one of them MUST land.
const paywall = page
.getByTestId('paywall')
.waitFor({ state: 'visible' })
.then(() => 'paywall' as const);
const stripe = page
.waitForURL(/checkout\.stripe\.com/, { timeout: 8_000 })
.then(() => 'stripe' as const)
.catch(() => null);
await page.click('[data-testid="upgrade"]');
const winner = await Promise.race([paywall, stripe]);
expect(winner, 'neither paywall nor Stripe redirect happened').not.toBeNull();
// Fail the test if any console errors leaked through, even on green path.
expect(consoleErrors, consoleErrors.join('\n')).toHaveLength(0);
});Why this works: it tests the experience, not the implementation. The page.route stub means the test passes even if your AI provider is down — because that is a separate failure mode, monitored separately. The console-error capture catches React hydration bugs and uncaught promise rejections that visual assertions miss. The redirect race handles the real-world non-determinism of Stripe's 302 vs your client-side fallback render.
Gotcha — networkidle is a lie on apps with long-poll or SSE. If your app uses Server-Sent Events or a heartbeat WebSocket, waitUntil: 'networkidle' will time out. Switch to 'domcontentloaded' and assert on a specific selector instead.
Layer 2: The LLM Judge for Probabilistic Features
Snapshot tests do not work for LLM outputs. The model breathes and the bytes change; you either freeze the snapshot (and stop catching regressions) or accept constant flake. The fix is to test intent, not bytes — by asking another LLM whether the candidate output is meaningfully equivalent to a known-good baseline.
Here is a judge with the four things production code needs and toy examples skip: a cost budget, retry-with-backoff on rate limits, schema validation of the verdict, and a short-circuit for byte-identical outputs:
// vibe-check/llm-judge.ts
import Anthropic from '@anthropic-ai/sdk';
import { z } from 'zod';
const VerdictSchema = z.object({
verdict: z.enum(['pass', 'fail', 'flaky']),
reasoning: z.string().min(20).max(400),
severity: z.enum(['low', 'medium', 'high']).optional(),
});
type Verdict = z.infer<typeof VerdictSchema>;
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const MAX_RETRIES = 3;
const COST_BUDGET_USD = Number(process.env.VIBE_BUDGET_USD ?? '1.00');
let spentUsd = 0;
// Verify pricing in the Anthropic dashboard before relying on these constants.
const INPUT_USD_PER_M = 3;
const OUTPUT_USD_PER_M = 15;
export async function judgeOutput(args: {
prompt: string;
baseline: string;
candidate: string;
rubric: string;
}): Promise<Verdict> {
const { prompt, baseline, candidate, rubric } = args;
if (spentUsd >= COST_BUDGET_USD) {
throw new Error(
`Vibe check budget exceeded: $${spentUsd.toFixed(3)} >= $${COST_BUDGET_USD}`,
);
}
// Edge case: identical strings — skip the API call entirely.
if (baseline.trim() === candidate.trim()) {
return { verdict: 'pass', reasoning: 'Outputs are byte-identical.' };
}
const judgePrompt = [
'You are a regression judge. Decide if CANDIDATE is acceptable vs BASELINE.',
`Rubric: ${rubric}`,
`User prompt: ${prompt}`,
`BASELINE:\n${baseline}`,
`CANDIDATE:\n${candidate}`,
'Reply ONLY as JSON: {"verdict":"pass|fail|flaky","reasoning":"...","severity":"low|medium|high"}',
].join('\n\n');
let lastErr: unknown;
for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
try {
const res = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 400,
// temperature=0 is non-negotiable for a judge — you want stable verdicts.
temperature: 0,
messages: [{ role: 'user', content: judgePrompt }],
});
spentUsd +=
(res.usage.input_tokens * INPUT_USD_PER_M +
res.usage.output_tokens * OUTPUT_USD_PER_M) /
1_000_000;
const text = res.content
.filter((b): b is Extract<typeof b, { type: 'text' }> => b.type === 'text')
.map((b) => b.text)
.join('');
// Be forgiving of preamble — locate the JSON envelope rather than parsing whole text.
const start = text.indexOf('{');
const end = text.lastIndexOf('}');
if (start === -1 || end === -1) {
throw new SyntaxError(`Judge did not return JSON: ${text.slice(0, 120)}`);
}
return VerdictSchema.parse(JSON.parse(text.slice(start, end + 1)));
} catch (err) {
lastErr = err;
// Rate-limit or overloaded — exponential backoff and retry.
if (
err instanceof Anthropic.APIError &&
(err.status === 429 || err.status === 529)
) {
await new Promise((r) => setTimeout(r, 2 ** attempt * 1000));
continue;
}
// Schema or JSON failure on the last try → report 'flaky' rather than 'fail',
// so a single bad parse does not block a deploy.
if (attempt === MAX_RETRIES - 1) {
if (err instanceof z.ZodError || err instanceof SyntaxError) {
return {
verdict: 'flaky',
reasoning: `Judge produced unparseable output: ${String(err).slice(0, 200)}`,
};
}
throw err;
}
}
}
throw lastErr ?? new Error('Judge exhausted retries');
}Three subtleties worth calling out. First, temperature: 0 on the judge is non-negotiable — you want the same verdict on the same inputs. Second, the budget guard is a global module-scoped variable, which is fine for a single CI run but breaks if you reuse the module across Playwright workers; in that case, persist spend to a file and read it on init. Third, treating a parse error as 'flaky' rather than 'fail' is a deliberate choice: a mis-formatted JSON envelope is a judge problem, not a candidate problem, and should not single-handedly block a deploy.
Layer 3: The CI Vibe-Check Workflow
Now we wire it together. Here is a GitHub Actions workflow that runs the Playwright golden path on every PR, runs the LLM judge only when prompt files change (saving cost), enforces a 12-minute timeout, cancels stale runs, and posts a Slack message regardless of outcome:
# .github/workflows/vibe-check.yml
name: vibe-check
on:
pull_request:
workflow_dispatch:
# Cancel an in-flight vibe-check when a new commit lands on the same PR.
concurrency:
group: vibe-check-${{ github.ref }}
cancel-in-progress: true
jobs:
check:
runs-on: ubuntu-latest
timeout-minutes: 12
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # need history for the prompt-change diff below
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
# Layer 0 — type check. Fast, free, catches dumb stuff first.
- name: Type check
run: npx tsc --noEmit
- name: Install Playwright browsers
run: npx playwright install --with-deps chromium
# Boot the app in the background, wait for it to answer, then test it.
- name: Boot app
run: |
npm run start &
echo $! > app.pid
npx wait-on http://localhost:3000/api/health --timeout 60000
# Layer 1 — golden path smoke test.
- name: Playwright golden path
run: npx playwright test vibe-check/
env:
VIBE_CHECK_URL: http://localhost:3000
# Layer 2 — LLM judge. Only runs when something prompt-related changed,
# otherwise we burn money on every typo PR.
- name: Detect prompt changes
id: prompts
run: |
if git diff --name-only origin/${{ github.base_ref }}...HEAD \
| grep -E '^(prompts|vibe-check/baselines)/'; then
echo "changed=true" >> "$GITHUB_OUTPUT"
else
echo "changed=false" >> "$GITHUB_OUTPUT"
fi
- name: LLM judge
if: steps.prompts.outputs.changed == 'true'
run: node vibe-check/judge-runner.mjs
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
VIBE_BUDGET_USD: '0.50'
# Always upload Playwright trace, even on success — debugging flake later
# is much cheaper if you have the trace.
- uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-trace-${{ github.run_id }}
path: test-results/
retention-days: 7
- name: Tear down app
if: always()
run: kill "$(cat app.pid)" || true
- name: Slack vibe report
if: always()
uses: slackapi/slack-github-action@v1.27.0
with:
channel-id: ${{ secrets.VIBE_SLACK_CHANNEL }}
slack-message: |
${{ job.status == 'success' && '✅' || '❌' }} vibe check on ${{ github.head_ref }}
run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}The workflow embodies the philosophy: type-check first because it is free, run the golden path because it covers the most users, run the judge only when prompts change because it costs money, and always post to Slack because the founder is not watching the GitHub UI. The concurrency block is a small detail that matters a lot — without it, three quick pushes spawn three full runs, blowing through your free CI minutes by lunch.
How Do You Keep Vibe Checks From Going Flaky?
Answer capsule: Stub LLM and payment APIs in golden paths, fix temperature=0 for judges, and seed timestamps — flake disappears once randomness is bounded.
Flake is the silent killer of vibe checking. A flaky golden path that fails on one in five runs trains you to hit "re-run" instead of debug — which means the next real regression sails through. The fix is identifying the sources of non-determinism and bounding each one:
- Time. If your test relies on
Date.now(), it will pass today and fail on a daylight-savings boundary. Mock time, or make assertions time-relative. - Network. Stub all external HTTP in the golden path with
page.route. The exception: your own API, which is what you are testing. - Concurrency. Do not race two assertions if you only need one. Use
Promise.raceonly when both outcomes are valid (like the Stripe redirect example above). - Animation. Set
prefers-reduced-motion: reducein the Playwright config. CSS animations are a top source of click-before-handler-attached bugs.
Troubleshooting: When Your Vibe Check Lies to You
Five failure modes you will hit and how to triage them. Save this section.
1. The judge keeps returning "flaky" on identical inputs. First, confirm temperature: 0. Then check the rubric — overly strict rubrics like "the answer must be word-for-word identical" force the model into hairsplitting. Loosen the rubric to capture intent: "the candidate must convey the same factual claims as the baseline." If the verdict still oscillates, switch to a deterministic diff (e.g., json-diff) for that field and let the LLM judge the rest.
2. Playwright passes locally, fails in CI. Almost always a timing issue. The CI runner is slower than your laptop; bump the first-load timeout to 30 seconds. If the test fails with a screenshot showing a spinner, the app booted slower than wait-on believed. Add a /api/health endpoint that returns 200 only when the database connection is live, and wait on that — not the root URL.
3. The CI workflow burns budget mysteriously. Check that the prompt-change detector actually short-circuits — a misconfigured github.base_ref can make every diff look like it touched prompts. Echo steps.prompts.outputs.changed in the log so you can see what it computed, and use fetch-depth: 0 on checkout so the diff has history to walk against.
4. Sentry shows a crash that the golden path did not catch. That is the design, not a bug. The vibe check covers the path users hit; Sentry covers everything else. If the crash is on a path more than 1% of users hit, promote it to the golden path. If it is less, fix forward and move on.
5. The Anthropic SDK throws an overloaded_error instead of 429. That is real and surfaces as err.status === 529. The judge above already treats 429 and 529 identically — back off and retry. If it persists, your provider is having a bad afternoon; failing the run open (warning, not block) is a reasonable response, since the judge is not the only signal.
Edge Cases and Gotchas
- Authentication walls. If your golden path requires a real OAuth flow (Google, GitHub), use a long-lived test account with a refresh token stored as a CI secret. Do not try to drive the OAuth consent screen — providers actively block automated browsers.
- Database state pollution. Each golden-path run creates a real user. Either use a tear-down step that deletes by email pattern (the
vibe-*@example.testprefix above is designed for exactly this), or run against a database that resets between runs (Supabase branch databases, Neon branches, or a Docker Compose Postgres that gets nuked at end of job). - LLM judge cost on PR storms. If a contributor opens 12 PRs in an hour, you will burn through
VIBE_BUDGET_USDtwelve times. Add a per-day cap by tagging spend in your billing dashboard and circuit-breaking on the workflow side, or move the judge to a nightly cron rather than per-PR. - Visual regressions. Vibe checking does not catch CSS-only bugs. If your product is design-sensitive, add
page.screenshot()and compare via Percy or a manual eyeball diff on PRs. Two minutes of human review beats ten minutes of pixel-perfect snapshot maintenance. - Multi-region apps. A golden path against your default region will not catch a Cloudflare Worker that breaks in eu-west. Add a second job that runs the same spec against the production URL with a regional
CF-IPCountryheader, on a nightly schedule rather than on every PR.
The Post-Agile Manifesto: Pick Your Layers, Ship Daily
Agile gave us iteration; CI gave us automation; QA gates gave us safety. The post-Agile reality for indie hackers and small teams is that we do not need the safety of a 200-test integration suite — we need the safety of catching the regressions a paying user would notice, in under twelve minutes, without burning an LLM budget on every typo PR.
That is vibe checking. Three layers: a golden path that proves the app boots, an LLM judge that handles the probabilistic surface, and a CI workflow that ties them together with a budget. A fraction of the maintenance cost of a traditional QA gate, most of the regression coverage that actually matters at your stage. When the regressions you are missing start costing more than a sprint of revenue, add a layer back. Until then: ship daily, vibe check on every PR, and let Sentry handle the long tail.
The QA-gate playbook is not wrong — it is just optimized for a different company. You are not that company yet. Build for the company you are.
Ready to ship your next project faster?
Desplega.ai helps indie hackers and solopreneurs build and ship faster — from Barcelona to Madrid, Valencia to Malaga. Skip the QA-gate theater and get a vibe-check stack tuned to your actual product.
Get StartedFrequently Asked Questions
Is vibe checking the same as not testing?
No. Vibe checking is selective testing — golden-path smoke tests, LLM judges for AI features, and type checks. You skip the 80% of unit tests that rarely catch real regressions in solo projects.
How do I keep an LLM judge from blocking deploys with false positives?
Set temperature to zero, validate the JSON verdict with Zod, re-run the judge once on disagreement, and require two failures before blocking. Treat a single failure as a warning, not a gate.
Can vibe checks replace full integration tests for a B2B SaaS?
Up to roughly $50K MRR, in our experience yes. Past that, regulated buyers and uptime SLAs justify deeper integration tests. Until then, golden paths plus monitoring catch most regressions.
What's the minimum vibe-check stack for a solo founder?
TypeScript strict mode, one Playwright golden path per critical flow, and Sentry. That trio runs in under three minutes on every PR and catches the regressions that actually lose customers.
When should I add traditional QA gates back into my workflow?
When a single regression costs more than a sprint of revenue, or when an enterprise buyer asks for SOC 2. Add QA gates one layer at a time — never wholesale, never preemptively.
Related Posts
Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai
Stop losing 2-3 hours daily to dev server restarts. Master HMR configuration in Vite and Next.js to maintain flow state, preserve component state, and boost coding velocity by 80%.
The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai
Discover how flaky tests create a hidden operational tax that costs CTOs millions in wasted compute, developer time, and delayed releases. Calculate your flakiness cost today.
The QA Death Spiral: When Your Test Suite Becomes Your Product | desplega.ai
An executive guide to recognizing when quality initiatives consume engineering capacity. Learn to identify test suite bloat, balance coverage vs velocity, and implement pragmatic quality gates.