Back to Blog
May 14, 2026

Escape Integration Test Purgatory: Implementing a Quarantine Service to Stabilize Your CI Pipeline

If one flaky integration test can freeze your whole release train, you do not need more retries, you need a control plane.

Quarantine service control plane separating known flaky failures from actionable CI failures

Every team hits this moment. A merge request is ready, product is waiting, and CI is red again because one integration test decided today was the day to become philosophical. You rerun the job. It passes. Nobody trusts the green build, but everybody wants to merge anyway. That is integration test purgatory: the suite is too important to ignore and too noisy to obey.

Vibe-coder tooling usually answers this with more retries, bigger sleeps, or a quick `test.skip()` on the worst offender. Professional tooling treats it as an infrastructure problem. The real question is not "how do I hide this failure?" It is "how do I keep known noise from blocking the pipeline while still preserving signal, ownership, and evidence?"

A quarantine service is the missing control plane. Instead of encoding flaky-test policy inside each test framework, you centralize it: CI reports failures to a service, the service decides whether the failure is already quarantined, and the pipeline downgrades only the known noise while keeping everything visible. Microsoft describes exactly this style of mitigation: run all tests, suppress failures from quarantined tests, and automatically remove them once the noise stops. That is the mindset to copy.

The scale argument is not theoretical. Google wrote that about 84% of observed pass-to-fail transitions in their post-submit system involved a flaky test. Microsoft says its internal flaky-test management system is used by more than 100 product teams, has identified roughly 49,000 flaky tests, and helped pass 160,000 sessions that would otherwise have failed. If you are still handling this with ad hoc `skip` tags, you are solving a control-plane problem with source-code graffiti.

What does a quarantine service actually do?

It keeps suspect failures visible, owned, and expiring while preventing already-known flakes from blocking every merge in day-to-day delivery.

A quarantine service sits between your test runner and your CI verdict. It usually has four jobs:

The subtle benefit is emotional as much as technical. Developers stop negotiating with a red build on every PR. Reviewers stop asking whether a failure is "real enough" to block. Release managers stop normalizing the ritual of "rerun until green." The service creates a shared definition of trustworthy signal, and that definition is what makes the pipeline usable again.

  • Store quarantine rules keyed by a stable fingerprint such as `suite + file + test title + project`.
  • Track owner, reason, creation time, expiry time, and optional issue URL so every quarantine has accountability.
  • Accept execution events from CI and decide whether a given failure is actionable, already quarantined, or suspiciously changed.
  • Emit a verdict that preserves visibility: pass, fail, neutral, or warning, plus summary artifacts for GitHub Checks, Slack, or dashboards.

Control-plane rule

Quarantine should downgrade failures, not delete them. If you stop running quarantined tests, you cannot tell whether the test recovered, regressed, or started failing for a brand-new reason.

ApproachWhat happens in CIFailure modeProfessional verdict
Retry onlyCI burns time until something turns greenReal bugs and flakes become indistinguishableUseful for detection, not policy
`test.skip()` in codeThe noisy test disappears entirelyNo evidence, no expiry, no owner pressureFast but dangerous
Central quarantine serviceAll tests run, known flakes are downgradedRequires stable fingerprints and governanceThe right long-term design

Why not just retry and move on?

Because retries can detect flakiness, but they cannot own it, expire it, or explain it to the rest of your delivery system.

Even Playwright treats retries as classification, not absolution. In its docs, a test that fails first and passes on retry is reported as "flaky", and the framework can even fail the run if any test is flagged that way. That is an important distinction: retries are evidence that non-determinism exists. They are not a substitute for policy.

This is where a quarantine service earns its keep. You can infer flakiness with retries, just like Microsoft does in rolling sessions on main, but store the outcome in a service that understands TTLs, issue links, owners, and CI presentation. Google Research also notes the sheer volume of the problem: one study evaluated flaky culprit-finding across 13,000+ test breakages. Once the problem is that common, it stops being a test-case tweak and becomes platform engineering.

The minimum architecture that actually works

Keep the first version small. You do not need machine learning, distributed tracing, or a custom dashboard on day one. You need a stable fingerprint, a durable store, and a clear CI contract.

  • PostgreSQL for rules and execution events.
  • A tiny HTTP service that exposes `upsertRule`, `getRulesForCommit`, and `classifyRun`.
  • A CI adapter that uploads test results and renders the summary into GitHub Checks or job summaries.
  • A stable fingerprint strategy so renamed tests do not silently inherit the wrong quarantine.

Fingerprints deserve special attention. If you key only on test title, a rename can orphan the old rule. If you key only on file path, parameterized tests can collide. A practical compromise is to hash the tuple {project, file, fullTitle, runner} and optionally attach a source-line hint for debugging. If the fingerprint changes, treat it as a new test until a human confirms otherwise.

This is one of the sharpest differences between beginner and professional tooling. Vibe coding tends to treat identity as whatever string happens to be easy to reach. Production infrastructure treats identity as a contract. If your quarantine key is unstable, your policy layer will randomly forgive the wrong failure or fail the wrong test. That is not a small bug. That is the entire point of the system failing at its one job.

Code Example 1: A production-safe PostgreSQL schema

The schema below does three things beginners usually miss: enforces TTLs, prevents multiple active rules for the same fingerprint, and preserves execution history for later debugging.

-- migration.sql
CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE TABLE quarantine_rules (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  fingerprint TEXT NOT NULL,
  test_name TEXT NOT NULL,
  suite_name TEXT NOT NULL,
  runner TEXT NOT NULL,
  owner_email TEXT NOT NULL,
  reason TEXT NOT NULL,
  issue_url TEXT,
  active BOOLEAN NOT NULL DEFAULT TRUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at TIMESTAMPTZ NOT NULL,
  resolved_at TIMESTAMPTZ,
  metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
  CHECK (expires_at > created_at),
  CHECK (resolved_at IS NULL OR resolved_at >= created_at)
);

-- Prevent two active quarantine rules for the same fingerprint.
CREATE UNIQUE INDEX quarantine_rules_one_active_rule
  ON quarantine_rules (fingerprint)
  WHERE active = TRUE;

CREATE INDEX quarantine_rules_expiry_lookup
  ON quarantine_rules (active, expires_at);

CREATE TABLE quarantine_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  fingerprint TEXT NOT NULL,
  commit_sha TEXT NOT NULL,
  branch_name TEXT NOT NULL,
  run_id TEXT NOT NULL,
  outcome TEXT NOT NULL CHECK (outcome IN ('passed', 'failed', 'flaky', 'quarantined')),
  error_signature TEXT,
  observed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  payload JSONB NOT NULL DEFAULT '{}'::jsonb
);

CREATE INDEX quarantine_events_run_lookup
  ON quarantine_events (run_id, fingerprint);

CREATE INDEX quarantine_events_recent_failures
  ON quarantine_events (fingerprint, observed_at DESC)
  WHERE outcome IN ('failed', 'flaky', 'quarantined');

-- Edge case: immediately expire stale rules after long outages.
UPDATE quarantine_rules
SET active = FALSE, resolved_at = now()
WHERE active = TRUE
  AND expires_at < now();

Why start here? Because service logic is only as good as the invariants below it. A partial unique index stops duplicate active rules during double-clicks in an admin UI. The expiry index gives you a cheap janitor query. Execution events let you answer the question your team will ask within a week: "Was this failure already noisy last Tuesday, or is this a new regression wearing the same test name?"

Code Example 2: A TypeScript service that classifies failures safely

This example uses transaction-level advisory locks so two CI jobs cannot race and create conflicting active rules. PostgreSQL explicitly documents that session-level advisory locks survive rollbacks, which is exactly why you should prefer transaction-level locks here.

// quarantine-service.ts
import crypto from 'node:crypto';
import { Pool, PoolClient } from 'pg';

type Classification = 'actionable' | 'quarantined' | 'expired';

type FailureEvent = {
  fingerprint: string;
  testName: string;
  suiteName: string;
  runner: string;
  branchName: string;
  commitSha: string;
  runId: string;
  errorSignature?: string;
};

const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,
});

function advisoryKey(input: string): bigint {
  const hex = crypto.createHash('sha256').update(input).digest('hex').slice(0, 16);
  return BigInt('0x' + hex);
}

async function withClient<T>(fn: (client: PoolClient) => Promise<T>): Promise<T> {
  const client = await pool.connect();
  try {
    return await fn(client);
  } finally {
    client.release();
  }
}

export async function classifyFailure(event: FailureEvent): Promise<Classification> {
  return withClient(async (client) => {
    await client.query('BEGIN');
    try {
      await client.query('SELECT pg_advisory_xact_lock($1)', [advisoryKey(event.fingerprint).toString()]);

      const activeRule = await client.query(
        `SELECT id, expires_at
         FROM quarantine_rules
         WHERE fingerprint = $1 AND active = TRUE
         LIMIT 1`,
        [event.fingerprint],
      );

      let classification: Classification = 'actionable';

      if (activeRule.rowCount === 1) {
        const expiresAt = new Date(activeRule.rows[0].expires_at);
        if (expiresAt > new Date()) {
          classification = 'quarantined';
        } else {
          classification = 'expired';
          await client.query(
            `UPDATE quarantine_rules
             SET active = FALSE, resolved_at = now()
             WHERE id = $1 AND active = TRUE`,
            [activeRule.rows[0].id],
          );
        }
      }

      await client.query(
        `INSERT INTO quarantine_events
           (fingerprint, commit_sha, branch_name, run_id, outcome, error_signature, payload)
         VALUES ($1, $2, $3, $4, $5, $6, $7::jsonb)`,
        [
          event.fingerprint,
          event.commitSha,
          event.branchName,
          event.runId,
          classification === 'quarantined' ? 'quarantined' : 'failed',
          event.errorSignature ?? null,
          JSON.stringify({
            testName: event.testName,
            suiteName: event.suiteName,
            runner: event.runner,
          }),
        ],
      );

      await client.query('COMMIT');
      return classification;
    } catch (error) {
      await client.query('ROLLBACK');
      console.error('Failed to classify quarantine event', {
        runId: event.runId,
        fingerprint: event.fingerprint,
        error,
      });
      throw error;
    }
  });
}

// Edge case: missing fingerprint is a contract bug, not a quarantinable failure.
export function assertFingerprint(input: string | undefined): string {
  if (!input || input.trim().length < 12) {
    throw new Error('Missing or unstable test fingerprint');
  }
  return input;
}

The non-obvious part is the lock choice. If you accidentally use session-level advisory locks and the request crashes after `ROLLBACK`, the lock can stay alive until the connection dies. Under load, that turns your quarantine service into a distributed shrug. Transaction-scoped locks align the lifecycle with the classification decision.

Also notice the refusal to quarantine events with missing fingerprints. Teams often try to be helpful here by falling back to `test.title`. Do not. A weak fallback turns a data quality problem into a policy corruption problem. If the runner cannot identify a test stably, fail that branch of the integration and fix the contract before you trust the quarantine verdict again.

Code Example 3: A Playwright reporter that downgrades only known flakes

Keep runner integration thin. The reporter should upload structured results, ask the service for classification, and then fail the build only for actionable problems. It should not own quarantine policy itself.

// playwright-quarantine-reporter.ts
import crypto from 'node:crypto';
import type {
  FullConfig,
  FullResult,
  Reporter,
  Suite,
  TestCase,
  TestResult,
} from '@playwright/test/reporter';

type RemoteResult = {
  classification: 'actionable' | 'quarantined' | 'expired';
};

function fingerprint(test: TestCase): string {
  return crypto
    .createHash('sha256')
    .update(JSON.stringify({
      file: test.location.file,
      line: test.location.line,
      project: test.parent.project()?.name ?? 'default',
      title: test.titlePath(),
      runner: 'playwright',
    }))
    .digest('hex');
}

async function classify(test: TestCase, result: TestResult): Promise<RemoteResult> {
  const response = await fetch(process.env.QUARANTINE_SERVICE_URL + '/classify', {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      fingerprint: fingerprint(test),
      testName: test.title,
      suiteName: test.parent.title,
      runner: 'playwright',
      branchName: process.env.GITHUB_REF_NAME,
      commitSha: process.env.GITHUB_SHA,
      runId: process.env.GITHUB_RUN_ID,
      errorSignature: result.error?.message?.slice(0, 500) ?? null,
    }),
  });

  if (!response.ok) {
    throw new Error(`Quarantine service returned ${response.status}`);
  }

  return (await response.json()) as RemoteResult;
}

class QuarantineReporter implements Reporter {
  private actionableFailures = 0;
  private quarantinedFailures = 0;

  async onTestEnd(test: TestCase, result: TestResult) {
    if (result.status !== 'failed') return;

    try {
      const remote = await classify(test, result);

      if (remote.classification === 'quarantined') {
        this.quarantinedFailures += 1;
        console.log('[quarantine] downgraded known flaky failure:', test.titlePath().join(' > '));
        return;
      }

      // Edge case: expired quarantine should fail loudly so the owner re-triages it.
      this.actionableFailures += 1;
      console.error('[quarantine] actionable failure:', {
        classification: remote.classification,
        test: test.titlePath(),
      });
    } catch (error) {
      // If the service is down, fail closed. Silent degradation hides real regressions.
      this.actionableFailures += 1;
      console.error('[quarantine] classification error', {
        test: test.titlePath(),
        error,
      });
    }
  }

  async onEnd(_: FullResult) {
    if (this.quarantinedFailures > 0) {
      console.log(`[quarantine] downgraded ${this.quarantinedFailures} known flaky failure(s)`);
    }

    if (this.actionableFailures > 0) {
      process.exitCode = 1;
    }
  }
}

export default QuarantineReporter;

Notice the deliberate failure mode. If the service is down, the reporter fails closed. That is usually the right default for protected branches because a silently unavailable control plane is worse than a temporarily red build. If you need softer behavior on feature branches, make the fallback policy environment-specific and explicit.

Code Example 4: A GitHub Actions workflow that preserves signal

GitHub Actions can fan out a large matrix, but the docs cap a matrix at 256 jobs per workflow run. That matters because a quarantine system should aggregate results back into one verdict instead of letting every shard invent its own policy.

# .github/workflows/integration.yml
name: integration

on:
  pull_request:
  push:
    branches: [main]

jobs:
  tests:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      - run: npx playwright test --shard=${{ matrix.shard }}/4
        env:
          QUARANTINE_SERVICE_URL: ${{ secrets.QUARANTINE_SERVICE_URL }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: playwright-results-${{ matrix.shard }}
          path: test-results/

  summarize:
    needs: tests
    if: always()
    runs-on: ubuntu-latest
    steps:
      - uses: actions/download-artifact@v4
      - name: Build quarantine summary
        run: node scripts/build-quarantine-summary.mjs
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          QUARANTINE_SERVICE_URL: ${{ secrets.QUARANTINE_SERVICE_URL }}

      # Edge case: do not publish "success" if summary generation itself failed.
      - name: Fail when summary step marked actionable regressions
        run: test -f .quarantine-green

The summarizer job is where you write one coherent story back to GitHub: new actionable failures, downgraded quarantined failures, expired rules, and the direct links the owner needs to fix them. If you are using the Checks API, GitHub’s docs show you can attach annotations to the check run, which makes the quarantine summary visible without asking developers to hunt through logs.

That summary layer is not decoration. It is how you stop quarantine from becoming invisible. A good summary answers four questions in under ten seconds: Which failures were downgraded? Which rules expired? Which tests are failing for the first time? Who owns the noisy ones? If a developer still needs to open four artifacts and scroll through raw JSON, the system is technically correct and operationally useless.

A useful verification trick

Firefox's test verification system reruns modified tests aggressively: ten runs, then five more in fresh browser instances, then repeats under chaos mode. You do not need to copy that exactly, but you should steal the principle: quarantine decisions should be based on repeated evidence, not one irritated rerun from a developer laptop.

How to roll this out without training the team to game it

The first rollout mistake is turning the service on for every branch, every suite, and every rule path at once. Start smaller. Pick one flaky integration suite, require owner and expiry on each rule, and ship a read-only summary before you let the service downgrade anything. That gives the team time to inspect classifications before policy starts changing merge outcomes.

The second mistake is measuring the wrong thing. Your goal is not "fewer red builds" in isolation. A terrible quarantine policy can make the dashboard greener while hiding product regressions. Measure these instead:

  • First-attempt green rate: are developers getting a trustworthy result before manual reruns?
  • Active quarantine count: if it rises forever, your service is collecting debt rather than reducing it.
  • Mean quarantine age: old rules reveal a broken ownership loop even when CI looks healthier.
  • Expired-rule failure rate: this tells you whether rules are healing or simply timing out and returning as real blockers.

In our experience, the cleanest rollout policy is this: on feature branches, show quarantine warnings and let the PR proceed if the only failures are already quarantined; on `main`, continue running quarantined tests but alert loudly on expired rules and on any failure whose signature changed. That keeps velocity high while protecting trunk from becoming a museum of stale assumptions.

A rule worth enforcing

Nobody should be able to create a quarantine without an owner, a reason, and an expiry date. If your UI or API allows anonymous forever-quarantines, the system is not unfinished. It is already broken.

Troubleshooting and debugging the quarantine layer

Once you add a service, the flake can move. That is normal. The goal is to move it into a place you can reason about.

  • Symptom: a known flaky test still fails the build. Usually the fingerprint changed. Compare the stored fingerprint with the reporter payload. Renames, parametrized titles, and different project names are the common culprits.
  • Symptom: a brand-new regression gets downgraded. Your fingerprint is too coarse. Add project, line number, or suite path so unrelated tests cannot collide.
  • Symptom: duplicate active quarantine rules. Missing partial unique index or no transactional lock around rule creation. Fix the data model before patching the UI.
  • Symptom: the service intermittently deadlocks. Check whether you accidentally use session-level advisory locks or mixed lock ordering across endpoints.
  • Symptom: CI turns green but nobody fixes the test. That is a governance bug. Add TTLs, owner fields, and an "expired quarantine fails closed" rule.

In our experience, the best debugging artifact is a single JSON payload per failing test containing `fingerprint`, `branch`, `commit`, `runner`, `errorSignature`, and the raw test title path. With that one object you can replay the classification logic locally without waiting for another flaky run.

If you want one more practical debugging move, log the service verdict next to the raw runner status in the same line-oriented artifact. For example: failed -> quarantined, failed -> actionable, or failed -> expired. That tiny breadcrumb saves a surprising amount of time because you can tell immediately whether you are debugging the application, the test, or the policy service itself.

Edge cases and gotchas that usually bite on week two

  • Monorepos: include package or workspace identity in the fingerprint, or identical filenames across apps will collide.
  • Branch-specific noise: a rule inferred on `main` should usually not auto-apply to a long-lived migration branch unless you opt into it.
  • Real bugs that look flaky: if a test fails with two signatures, do not assume infrastructure noise. Non-deterministic app code is still app code.
  • Silent expiry: never let TTL cleanup just delete a rule. Mark it resolved and keep the audit trail, or you will lose context during re-triage.
  • Service outages: decide early whether protected branches fail closed and feature branches fail open. Ambiguity here creates policy drift fast.

The professional upgrade is not the service, it is the policy

The service is just the mechanism. The upgrade is the discipline around it: quarantined tests still run, every rule has an owner, every rule expires, and fresh failures always win over stale assumptions. That is how you stop a flaky integration suite from training your team to ignore red builds.

If you are moving from beginner tooling to professional infrastructure, this is one of the clearest leveling-up moments you can make. You stop treating CI as a slot machine and start treating it as a decision system. Once that happens, green means something again.

That is the real payoff. A quarantine service does not magically fix flaky tests. It fixes the operating model around flaky tests, which is why teams can finally address the root causes in a controlled order instead of firefighting whichever ghost happened to appear in the latest pipeline run.

References

  1. Flaky Tests at Google and How We Mitigate Them Google Testing Blog
  2. Improving developer productivity via flaky test management Engineering at Microsoft
  3. Flake Aware Culprit Finding Google Research
  4. Test Verification Firefox Source Docs
  5. Retries Playwright Docs
  6. Workflow syntax for GitHub Actions GitHub Docs
  7. PostgreSQL explicit locking PostgreSQL Docs

Ready to level up your dev toolkit?

Desplega.ai helps developers transition to professional tools smoothly, from fragile CI habits to reliable test infrastructure and observability.

Get Started

Frequently Asked Questions

Should quarantined tests still run in CI?

Yes. Run them and downgrade only their known failures. If you skip them entirely, you lose evidence, cannot measure recovery, and never learn whether the underlying flake disappeared.

How long should a quarantine rule live?

Give every rule a TTL and an owner. Seven to fourteen days is a practical starting range because it forces triage quickly without letting a noisy incident block every merge that week.

Can a quarantine service work without Playwright?

Absolutely. The service should be runner-agnostic. Feed it stable test fingerprints from JUnit, pytest, Jest, Cypress, or custom harnesses, then apply the same policy at CI decision time.

What is the biggest mistake teams make with quarantine?

Turning quarantine into a graveyard. Silent skips, no owner, and no expiry date convert a short-term safety valve into permanent test debt that slowly erodes trust in the suite again.