Should I fix flaky tests before adding more parallelism?

Yes. More workers amplify hidden shared-state bugs, timing races, and bad waits. Stabilize the suite first, then scale parallelism while measuring queue time, retries, and failure clustering.

Are retries always bad in distributed test environments?

No. Retries are useful for proving a failure is nondeterministic and for collecting artifacts, but they are diagnostic tools. Keeping them as the fix hides root causes and slows every pipeline.

What is the fastest way to reduce flakiness in an event-driven stack?

Start by isolating data per test run, replacing sleep-based waits with contract polling, and attaching correlation IDs to every request and event. Those three changes remove most blind spots fast.

How do I know whether a failure is infrastructure flake or a real bug?

Compare artifacts across repeated runs: trace ID, seed, worker ID, image tag, queue lag, and event timeline. If identical inputs diverge, investigate environment or timing before product code.

From Flaky to Rock Solid: Eliminating Non-Deterministic Failures in Playwright-Based Distributed Test Environments

The vibe-coder version of test infrastructure sounds harmless: spin up a preview app, run some browser tests, rerun failures once, and move on. That works until your product becomes distributed. The moment a single user action fans out across an API, a queue, a worker, a cache, and a read model, the old strategy collapses. Tests stop failing for useful reasons. They fail because one worker reused someone else's state, because a queue drained slower on one runner, or because your assertion raced a projection that was still catching up.

This is where teams either level up or stay trapped in superstition. Beginner tooling treats nondeterministic failures as something you absorb with reruns. Professional tooling treats them as an engineering signal: something in the system is uncontrolled, unobservable, or incorrectly shared. The fix is rarely "add a bigger sleep." The fix is designing your tests and environments so timing, state, and causality become explicit.

The scale of the problem is real. In A Study on the Lifecycle of Flaky Tests (ICSE 2020), Microsoft researchers cited Google reporting that 1.5% of all test runs were flaky and nearly 16% of 4.2 million individual tests failed independently of code changes. In The Effects of Computational Resources on Flaky Tests (ICST 2024), researchers found that 46.5% of flaky tests in their dataset were resource-availability flaky tests. That is the punchline: a large share of "product bugs" in CI are really environment bugs, scheduling bugs, or observability bugs.

This post is for developers moving from "it usually passes on my branch" to professional-grade distributed testing. We will focus on three upgrades that reliably move a suite from flaky to trustworthy: isolate state per worker, wait on contracts instead of wall-clock time, and make every failure traceable across services.

Why do distributed tests go flaky even when your code is correct?

Distributed flake is usually deterministic behavior plus missing context: shared state, weak readiness, queue lag, or CPU starvation.

In a monolith, a test typically makes one call and asserts one result. In a distributed system, the same test crosses process boundaries, clocks, schedulers, and storage models. Every layer adds a new source of nondeterminism:

Shared mutable state. Two workers write to the same tenant, topic, S3 prefix, Redis key, or email inbox and create false coupling.
Eventual consistency. The API acknowledges a write before the read model or search index catches up.
Weak readiness checks. A container answers `200 OK` while migrations, consumers, or sidecars are still cold.
Resource jitter. CPU steal, noisy neighbors, or low memory change timing enough to expose races.
Opaque failures. Without correlation IDs and artifacts, the same symptom gets triaged as five different bugs.

The Level Up mindset shift

Stop asking, "How do I make this test eventually pass?" Start asking, "What part of the system is currently uncontrolled?" That shift turns flaky tests from an annoyance into architecture feedback.

Vibe-coder pattern	Professional pattern	Why it wins
`await sleep(5000)`	Poll a contract with a deadline and diagnostics	Adapts to slow and fast paths without hiding timeouts
One shared staging database	Worker-scoped namespaces or ephemeral databases	Removes cross-test coupling under parallel load
Retry and forget	Retry with trace IDs, artifacts, and classification	Turns reruns into evidence instead of denial
`/healthz` only	Dependency-aware readiness and schema checks	Catches half-booted environments before tests start

How do you make an eventually consistent system testable without hiding bugs?

Don't wait longer. Wait smarter: assert the causal contract, keep a hard deadline, and capture enough context to explain every timeout.

The trick is to preserve reality while removing ambiguity. Your test should still exercise queues, caches, and async workers. But it should observe them through explicit contracts, not through wishful timing. The next three examples are the core of a production-ready approach.

1. Isolate every test worker so parallelism stops creating fake bugs

Shared staging environments are the fastest route to flake. If one test suite creates `acme@example.com` and another deletes it, both tests are technically correct and still interfere with each other. The fix is not serial execution. The fix is deterministic ownership of data.

// playwright/fixtures/test-env.ts
import { test as base, request, expect } from '@playwright/test'
import crypto from 'node:crypto'

type TestEnv = {
  tenantId: string
  traceId: string
}

async function cleanupTenant(apiBaseUrl: string, tenantId: string, traceId: string) {
  const api = await request.newContext({
    baseURL: apiBaseUrl,
    extraHTTPHeaders: { 'x-trace-id': traceId },
  })

  const response = await api.delete(\`/internal/test-tenants/\${tenantId}\`)
  if (!response.ok() && response.status() !== 404) {
    throw new Error(\`Cleanup failed for \${tenantId}: \${response.status()} \${await response.text()}\`)
  }
}

export const test = base.extend<{ env: TestEnv }>({
  env: [async ({}, use, workerInfo) => {
    const tenantId = \`e2e-\${workerInfo.parallelIndex}-\${crypto.randomUUID().slice(0, 8)}\`
    const traceId = crypto.randomUUID()
    const apiBaseUrl = process.env.API_BASE_URL

    if (!apiBaseUrl) {
      throw new Error('API_BASE_URL is required')
    }

    const api = await request.newContext({
      baseURL: apiBaseUrl,
      extraHTTPHeaders: { 'x-trace-id': traceId },
    })

    const create = await api.post('/internal/test-tenants', {
      data: { tenantId, seedPlan: 'starter', region: 'eu-west-1' },
    })

    if (!create.ok()) {
      throw new Error(\`Tenant bootstrap failed: \${create.status()} \${await create.text()}\`)
    }

    try {
      await use({ tenantId, traceId })
    } finally {
      await cleanupTenant(apiBaseUrl, tenantId, traceId)
    }
  }, { scope: 'test' }],
})

export { expect }

// usage in a spec
import { test, expect } from './fixtures/test-env'

test('checkout flow survives duplicate webhook delivery', async ({ page, env }) => {
  await page.goto(\`/login?tenant=\${env.tenantId}\`)
  await page.getByLabel('Email').fill(\`owner+\${env.tenantId}@example.com\`)
  await page.getByRole('button', { name: 'Start trial' }).click()

  await expect(page.getByText('Workspace created')).toBeVisible()
})

Why this works: worker-scoped state turns parallel execution from a race into independent experiments. The edge case here is cleanup failure. Notice the `404` branch during teardown. In distributed systems, cleanup can be partially complete because a background reaper already removed the tenant. Treating that as fatal creates new flake.

Another important detail is `x-trace-id`. Even when the test fails in the browser, the same trace ID can be propagated into API logs, queue consumers, and job runners. That turns a failing E2E test into a debuggable distributed trace instead of a screenshot and a shrug.

2. Replace readiness theater with a real dependency-aware probe

The classic `docker compose up && sleep 15` setup is fine for demos and terrible for CI. A service can bind a port before its schema migration finishes, before the Kafka consumer joins its group, or before the cache warmer populates reference data. If your tests start during that window, you manufactured flake before the first assertion ran.

// scripts/wait-for-system.ts
type HealthPayload = {
  status: 'ok' | 'degraded' | 'down'
  checks: Record<string, { status: 'ok' | 'down'; details?: string }>
  buildSha?: string
}

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms))
}

export async function waitForSystemReady(url: string, expectedSha?: string, timeoutMs = 60_000) {
  const startedAt = Date.now()
  let attempt = 0
  let lastBody = ''

  while (Date.now() - startedAt < timeoutMs) {
    attempt += 1

    try {
      const response = await fetch(url, { headers: { accept: 'application/json' } })
      lastBody = await response.text()

      if (!response.ok()) {
        throw new Error(\`health endpoint returned \${response.status}\`)
      }

      const payload = JSON.parse(lastBody) as HealthPayload
      const failedChecks = Object.entries(payload.checks).filter(([, check]) => check.status !== 'ok')

      if (payload.status === 'ok' && failedChecks.length === 0) {
        if (expectedSha && payload.buildSha && payload.buildSha !== expectedSha) {
          throw new Error(\`wrong build deployed: expected \${expectedSha}, got \${payload.buildSha}\`)
        }
        return payload
      }
    } catch (error) {
      const delay = Math.min(250 * 2 ** attempt, 5_000)
      if (Date.now() - startedAt + delay >= timeoutMs) {
        const reason = error instanceof Error ? error.message : String(error)
        throw new Error(
          \`System did not become ready within \${timeoutMs}ms after \${attempt} attempts. Last error: \${reason}. Last body: \${lastBody}\`
        )
      }
      await sleep(delay)
      continue
    }
  }

  throw new Error(\`Timed out waiting for \${url}\`)
}

// node --env-file=.env scripts/wait-for-system.ts
const expectedSha = process.env.GITHUB_SHA
await waitForSystemReady(process.env.READINESS_URL ?? 'http://localhost:8080/ready', expectedSha)

The key improvement is that readiness now means something. The probe verifies downstream checks and can reject the wrong build SHA, which is an edge case that becomes common in preview platforms when stale environments are reused. Exponential backoff prevents the check itself from becoming the load spike that delays startup.

Professional test infrastructure treats environment boot as part of the product under test. If the environment cannot describe its own readiness honestly, your pipeline cannot trust its own results.

3. Wait for causal completion, not arbitrary time

The hardest distributed tests are usually not request-response flows. They are workflows where write-side success is immediate but read-side correctness appears later. Think order placement, billing sync, search indexing, analytics pipelines, or email delivery. If you solve those tests with `sleep(10000)`, you created a suite that is both slow and unreliable.

// test-support/wait-for-projection.ts
type ProjectionState = {
  orderId: string
  status: 'pending' | 'confirmed' | 'failed'
  version: number
  lastProcessedEventId?: string
}

type WaitOptions = {
  apiBaseUrl: string
  orderId: string
  minimumVersion: number
  traceId: string
  timeoutMs?: number
}

function delay(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms))
}

export async function waitForOrderProjection({
  apiBaseUrl,
  orderId,
  minimumVersion,
  traceId,
  timeoutMs = 20_000,
}: WaitOptions): Promise<ProjectionState> {
  const deadline = Date.now() + timeoutMs
  let attempts = 0
  let lastState: ProjectionState | undefined

  while (Date.now() < deadline) {
    attempts += 1

    const response = await fetch(\`\${apiBaseUrl}/internal/test-projections/orders/\${orderId}\`, {
      headers: { 'x-trace-id': traceId, accept: 'application/json' },
    })

    if (response.status === 404) {
      await delay(200)
      continue
    }

    if (!response.ok()) {
      throw new Error(\`Projection lookup failed with \${response.status}: \${await response.text()}\`)
    }

    const state = (await response.json()) as ProjectionState
    lastState = state

    if (state.status === 'failed') {
      throw new Error(\`Projection entered failed state for order \${orderId}\`)
    }

    if (state.version >= minimumVersion) {
      return state
    }

    await delay(Math.min(200 * attempts, 1_000))
  }

  throw new Error(
    \`Timed out waiting for projection. orderId=\${orderId} minVersion=\${minimumVersion} lastState=\${JSON.stringify(lastState)} traceId=\${traceId}\`
  )
}

This is fundamentally different from a fixed sleep. The helper waits for a domain contract: a versioned projection to reach at least the expected state. It handles the `404` edge case that appears before the projection exists, and it fails fast if the projection enters a known terminal error state. Your test remains honest because it still times out. It just times out with evidence.

That evidence matters because the same symptom can have very different causes. A timeout might mean queue lag, a poisoned event, a stuck consumer rebalance, or a projection bug. Returning `lastState` and `traceId` makes the distinction visible in one failure instead of three hours of guessing.

What the internals are teaching you

At a protocol level, distributed flake usually comes from violating one of three assumptions:

Causality assumption: the test assumes the read side is current immediately after the write side acknowledges work.
Isolation assumption: the test assumes no other worker can mutate the same resource.
Scheduling assumption: the test assumes CPU, I/O, and network latency stay inside a narrow band.

That is why flaky tests often cluster around queues, search indexing, payment webhooks, cache invalidation, and UI assertions built on asynchronous rendering. The UI study An Empirical Analysis of UI-based Flaky Tests analyzed 235 flaky UI test samples across 62 projects, which should kill the myth that UI flake is just user error. These are systems problems, and they need systems answers.

Common gotchas teams miss on the first stabilization pass

Clock drift and fake timers. Freezing time in one process but not another creates impossible states in token expiry, scheduled jobs, and signed URLs.
Random seeds only in the app. If tests, fixtures, and workers generate random values differently, reproducing the same run is still impossible.
Idempotency ignored. Duplicate webhook or event delivery is not an exotic case in distributed systems. Your tests should model it explicitly.
Caches not namespaced. Teams isolate databases but forget Redis, CDN prefixes, blob storage, or search indexes.
Retries without classification. If a rerun passes, you still need to record whether the original failure was infrastructure, product, or unknown.

Troubleshooting nondeterministic failures in the real world

When a test fails only in CI, do not start by adding sleep. Start by reducing uncertainty. The fastest path is a short, consistent triage loop:

Capture the run fingerprint: git SHA, image tag, worker ID, seed, tenant ID, trace ID, and region.
Compare first failure and rerun outcome. Did state differ, or only timing?
Inspect dependency readiness logs before the first assertion, not just after the failure.
Check queue lag, consumer rebalance events, and resource throttling on the failing worker.
Re-run the single test against a fresh namespace. If it stabilizes, you likely have leaked shared state.

Failure mode map

Fails only under high parallelism: suspect shared state, rate limits, or resource starvation.
Fails only on preview deployments: suspect stale images, wrong build SHA, or incomplete startup hooks.
Fails with 404 before eventually passing: suspect eventual consistency and replace sleep with contract polling.
Fails with different symptoms on each rerun: suspect missing trace IDs, unseeded randomness, or multiple root causes hiding behind one test name.

A practical migration path for indie teams

You do not need a platform team to start. If you are moving from beginner tools to professional ones, implement the upgrades in this order:

Add worker-scoped namespaces for every mutable dependency: database, cache, object storage, and third-party inboxes.
Replace all fixed sleeps with polling helpers that assert business contracts and emit debug context on timeout.
Make readiness dependency-aware and verify the deployed build identity before any test starts.
Propagate a trace ID from the test runner through HTTP requests and background jobs.
Keep retries, but only as a classified signal with artifacts, never as the final fix.

The real win is not a greener dashboard. It is trust. Once your suite stops failing for mysterious reasons, every red build becomes worth investigating. That changes developer behavior. People stop ignoring failures because failures stop behaving like noise.

That is the professional jump. Not more tests. Better test physics. Distributed systems are allowed to be asynchronous, concurrent, and failure-prone. Your infrastructure just has to model that reality honestly. When it does, flake stops feeling random and starts looking like ordinary engineering again.

From Flaky to Rock Solid: Eliminating Non-Deterministic Failures in Distributed Test Environments

If your distributed tests only pass after a lucky rerun, your tooling is lying to you. Here is how to make the whole system tell the truth.