How much historical test data do I need to train a flakiness model?

Aim for at least 30 CI runs per test for minimal signal. Models trained on 100+ runs per test achieve meaningfully higher precision — especially for rare edge-case patterns.

Can I use this approach with Cypress or Selenium instead of Playwright?

Yes — the ML pipeline is framework-agnostic. Collect structured run results from any framework into CSV, then train the same scikit-learn model on the extracted features.

How do I avoid quarantining stable tests with false positives?

Set a high precision threshold (0.80+), require 3 consecutive positive predictions before quarantine, and build a manual override review step for all flagged tests.

Predicting Flaky Tests: Using Machine Learning to Proactively Identify Test Instability

Q: What is a flaky test and why does it matter?

A flaky test randomly passes or fails on identical code. It erodes CI trust, wastes engineer time on false alarms, and masks real regressions hiding behind noise.

Q: What features are most predictive of test flakiness?

Historical pass rate variance, execution time standard deviation, async operation count, network call frequency, and test file change rate are the strongest flakiness signals.

Your CI pipeline just failed for the third time today. Same commit, same config, same tests — different outcome. The engineer on call sighs, clicks “Retry”, and merges when the dice land green. That's a flaky test, and according to Google's engineering blog, roughly 16% of test failures in their internal infrastructure fall into this category — not real bugs, just non-deterministic noise.

The traditional response is reactive: wait for tests to flake, quarantine them, patch them individually. But what if you could predict which tests will go flaky before they disrupt your pipeline? That's what this Foundation post walks through — a concrete ML workflow that scores your existing test suite for flakiness risk, using data your CI already collects.

What Is a Flaky Test and Why Does It Matter?

Quick Answer: A flaky test passes and fails intermittently on identical code. It poisons trust in CI, costs engineer time on false alarms, and silently hides real regressions beneath the noise.

The JetBrains 2023 Developer Ecosystem Survey reported that 59% of developers encounter flaky tests at least weekly. The cost is not just wasted retries. When a suite flakes 2–3% of runs, engineers learn to treat red builds as “probably fine, rerun it.” That learned helplessness is exactly how genuine regressions slip into production.

QA teams from Barcelona to Madrid to Valencia to Malaga have all hit the same wall: the longer you tolerate flakiness, the more expensive it becomes to root out. Post-hoc quarantine is the wrong unit of economics. Prediction is the right one — and modern ML tooling makes it cheap.

What Features Are Most Predictive of Test Flakiness?

Quick Answer: The top five signals are pass-rate variance, execution-time standard deviation, async operation density, network call frequency, and how often the test file itself changes. Together they explain most flakiness variance in well-labelled datasets.

You do not need exotic features. Plain CI telemetry, once you put a proper schema on it, carries a surprising amount of signal. The table below maps the features we extract to the root causes they most commonly correlate with — a grounding artefact for any team starting from scratch.

Feature	Typical Root Cause	Why It Predicts Flakiness
Pass-rate variance (30-run window)	Inconsistent outcomes on identical code	Direct measurement of instability
Execution time stddev	Race conditions, thread contention	Highly variable duration = timing bug
Async operation count	Missing awaits, unbounded timeouts	Every await is a flakiness surface
Network call count	Third-party latency, DNS jitter	External dependencies are non-deterministic
Test file change rate (last 90d)	Churn reveals unclear requirements	High churn correlates with future churn
Shared fixture count	Cross-test state bleed	Coupling → order dependence → flakes
Test execution parallelism	Worker contention, port collisions	Parallel runs expose hidden shared state

Traditional Quarantine vs. ML-Based Prediction

Most teams are stuck in what we call reactive quarantine: wait for a test to flake three times, mark it skip, file a ticket. This pattern has served us for a decade, but it scales poorly past a few hundred specs. ML-based prediction shifts the cost curve.

Dimension	Reactive Quarantine	ML-Based Prediction
Detection lag	3–10 failures before action	Pre-run score on every commit
Developer friction	Red build → retry ritual	Risky tests isolated before they run
Coverage	Only tests that have already flaked	Every test scored, even new ones
Investment	Engineer hours per quarantined test	One-time pipeline + monthly retraining
Scales to	~500 tests before drowning	10,000+ tests with the same infra

Step 1: Collect Historical Test Run Data from CI

Every CI vendor emits structured run metadata — GitHub Actions, GitLab CI, CircleCI, Jenkins. The first job is to normalize it into a single row-per-test-per-run shape. The Python collector below reads GitHub Actions check runs via the REST API and appends to a Parquet file for later feature engineering.

# scripts/collect_runs.py
import os
from datetime import datetime, timedelta

import pandas as pd
import requests

GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = "desplega-ai/landing"  # owner/repo
WINDOW_DAYS = 90


def fetch_workflow_runs(repo: str, since: datetime) -> list[dict]:
    headers = {"Authorization": f"Bearer {GITHUB_TOKEN}"}
    url = f"https://api.github.com/repos/{repo}/actions/runs"
    params = {"per_page": 100, "created": f">{since.isoformat()}"}
    resp = requests.get(url, headers=headers, params=params, timeout=30)
    resp.raise_for_status()
    return resp.json()["workflow_runs"]


def fetch_jobs(repo: str, run_id: int) -> list[dict]:
    headers = {"Authorization": f"Bearer {GITHUB_TOKEN}"}
    url = f"https://api.github.com/repos/{repo}/actions/runs/{run_id}/jobs"
    resp = requests.get(url, headers=headers, timeout=30)
    resp.raise_for_status()
    return resp.json()["jobs"]


def to_rows(run: dict, jobs: list[dict]) -> list[dict]:
    rows = []
    for job in jobs:
        for step in job.get("steps", []):
            if not step["name"].startswith("Test:"):
                continue
            rows.append({
                "run_id": run["id"],
                "commit": run["head_sha"],
                "test_name": step["name"],
                "status": step["conclusion"],
                "started_at": step["started_at"],
                "completed_at": step["completed_at"],
                "runner": job["runner_name"],
            })
    return rows


if __name__ == "__main__":
    since = datetime.utcnow() - timedelta(days=WINDOW_DAYS)
    rows: list[dict] = []
    for run in fetch_workflow_runs(REPO, since):
        rows.extend(to_rows(run, fetch_jobs(REPO, run["id"])))
    df = pd.DataFrame(rows)
    df["started_at"] = pd.to_datetime(df["started_at"])
    df["completed_at"] = pd.to_datetime(df["completed_at"])
    df["duration_s"] = (df["completed_at"] - df["started_at"]).dt.total_seconds()
    df.to_parquet("data/runs.parquet", index=False)
    print(f"Wrote {len(df):,} rows to data/runs.parquet")

Run this as a nightly GitHub Action with a write-access PAT, commit the Parquet to a data branch, or ship to S3. The schema is deliberately narrow — name, run, status, duration — because every downstream feature is derived.

Step 2: Engineer Features from Raw Run Data

Feature engineering is where most ML projects win or lose. For flakiness, the rule is simple: aggregate per test over a rolling window. The 30-run window catches the sweet spot between recency (recent code matters) and sample size (variance estimates need data).

# scripts/build_features.py
from pathlib import Path

import pandas as pd


def build_features(runs: pd.DataFrame, ast_stats: pd.DataFrame) -> pd.DataFrame:
    runs = runs.sort_values(["test_name", "started_at"])
    grouped = runs.groupby("test_name")

    features = grouped.agg(
        runs_seen=("run_id", "nunique"),
        pass_rate=("status", lambda s: (s == "success").mean()),
        pass_rate_var=("status", lambda s: (s == "success").astype(int).rolling(30).var().mean()),
        duration_mean=("duration_s", "mean"),
        duration_stddev=("duration_s", "std"),
        duration_p95=("duration_s", lambda s: s.quantile(0.95)),
        distinct_runners=("runner", "nunique"),
    ).reset_index()

    # Attach static code signals from AST parsing pass (separate collector)
    features = features.merge(ast_stats, on="test_name", how="left")

    # Label: flaky if pass-rate between 10% and 90% with >=5 runs seen
    features["is_flaky"] = (
        (features["pass_rate"] > 0.1)
        & (features["pass_rate"] < 0.9)
        & (features["runs_seen"] >= 5)
    ).astype(int)
    return features


if __name__ == "__main__":
    runs = pd.read_parquet("data/runs.parquet")
    ast_stats = pd.read_parquet("data/ast_stats.parquet")
    features = build_features(runs, ast_stats)
    Path("data").mkdir(exist_ok=True)
    features.to_parquet("data/features.parquet", index=False)
    print(f"Labeled {int(features['is_flaky'].sum()):,} flaky tests out of {len(features):,}")

The ast_stats dataframe captures static signals per test file — async call count, network mock usage, fixture fan-in — produced by a separate AST traversal over the spec files. Combining behavioural and structural features is what lifts precision above the naive pass-rate baseline.

Step 3: Train a scikit-learn Classifier

For flakiness detection, start with gradient boosting. It handles mixed numeric and categorical features, gracefully tolerates missing data, and produces calibrated probabilities you can threshold. A LightGBM or XGBoost classifier trained on as few as a few thousand rows routinely beats handcrafted heuristics.

# scripts/train_model.py
import joblib
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.model_selection import train_test_split

FEATURES = [
    "pass_rate_var",
    "duration_stddev",
    "duration_p95",
    "distinct_runners",
    "async_calls",
    "network_calls",
    "file_change_count",
    "shared_fixture_count",
]


def train(features_df: pd.DataFrame) -> lgb.LGBMClassifier:
    X = features_df[FEATURES].fillna(0)
    y = features_df["is_flaky"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    model = lgb.LGBMClassifier(
        n_estimators=300,
        learning_rate=0.05,
        num_leaves=31,
        class_weight="balanced",  # flaky tests are the minority
        random_state=42,
    )
    model.fit(X_train, y_train)

    # Pick a threshold that maximises precision at >= 80%
    probs = model.predict_proba(X_test)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test, probs)
    good = [(p, r, t) for p, r, t in zip(precision, recall, thresholds) if p >= 0.80]
    chosen = max(good, key=lambda x: x[1]) if good else (precision[-1], recall[-1], 0.5)
    print(f"Threshold={chosen[2]:.3f}  precision={chosen[0]:.3f}  recall={chosen[1]:.3f}")
    print(classification_report(y_test, probs >= chosen[2]))
    return model


if __name__ == "__main__":
    features_df = pd.read_parquet("data/features.parquet")
    model = train(features_df)
    joblib.dump(model, "artifacts/flakiness_model.joblib")

The key engineering choice is the threshold. A reckless 0.5 threshold floods your team with false positives. Anchor to precision ≥ 0.80, then accept whatever recall that implies. It's better to miss some flaky tests than to cry wolf on healthy ones — trust in the system is the real currency.

Step 4: Wire the Model Into Your Playwright Reporter

Inference needs to happen at runtime, not out-of-band. The cleanest integration point for Playwright is a custom reporter: it receives every test result and can annotate the CI output with a flakiness score in real time. The same shape works for Cypress and Selenium — only the reporter protocol differs.

// reporters/flakiness-reporter.ts
import type {
  FullConfig,
  FullResult,
  Reporter,
  Suite,
  TestCase,
  TestResult,
} from '@playwright/test/reporter';

interface FlakinessScore {
  testName: string;
  probability: number;
  threshold: number;
}

export default class FlakinessReporter implements Reporter {
  private readonly threshold: number;
  private readonly scores: Map<string, number> = new Map();

  constructor(opts: { scoresFile?: string; threshold?: number } = {}) {
    this.threshold = opts.threshold ?? 0.8;
    // Load prebuilt scores JSON produced by the Python pipeline
    const fs = require('node:fs');
    const path = opts.scoresFile ?? 'artifacts/flakiness_scores.json';
    const raw = JSON.parse(fs.readFileSync(path, 'utf-8')) as FlakinessScore[];
    raw.forEach((s) => this.scores.set(s.testName, s.probability));
  }

  onBegin(_config: FullConfig, suite: Suite): void {
    const at_risk = Array.from(this.scores.entries()).filter(
      ([, p]) => p >= this.threshold,
    );
    console.log(
      `[flakiness] ${suite.allTests().length} tests · ${at_risk.length} flagged above ${this.threshold}`,
    );
  }

  onTestEnd(test: TestCase, result: TestResult): void {
    const prob = this.scores.get(test.title);
    if (prob === undefined) return;

    if (result.status === 'failed' && prob >= this.threshold) {
      console.warn(`[flakiness] ${test.title} failed; predicted risk ${prob.toFixed(2)} — consider retry`);
      // Mark as 'warning' so CI doesn't fail the build on predicted flake
      (result as TestResult & { _flakinessWarning: boolean })._flakinessWarning = true;
    }
  }

  onEnd(_result: FullResult): Promise<void> | void {
    return;
  }
}

Register the reporter in your playwright.config.ts alongside your default reporter. The reporter does not replace your retry policy — it augments it. Retries remain your last line of defence; the predictor is the first filter.

Troubleshooting: When the Model Gets It Wrong

Scenario 1 — Model flags too many false positives

Symptom: engineers complain the reporter warns on tests that never flake. Fix: raise the threshold, re-evaluate precision on the last 30 days of data, and audit for label leakage (tests labelled flaky purely because of infrastructure outages).

Scenario 2 — Model misses new flaky tests

Symptom: a new test goes flaky but the model gives it a low score. Fix: add a “newness” feature (days since first seen) and ensure your retraining cadence is at least monthly — ideally triggered whenever a flake is quarantined.

Scenario 3 — Features look stale

Symptom: pass-rate variance is zero for most tests. Fix: confirm your rolling window is computed per test, not per run. A common bug is computing variance across the entire dataset instead of per-group.

Scenario 4 — Production drift degrades scores

Symptom: model was 85% precise last month, now 65%. Fix: monitor population-shift via simple Jensen-Shannon divergence on feature distributions and alert when drift crosses a fixed threshold. Retrain immediately when alerted.

Edge Cases and Gotchas

Parallel execution bias: If you run tests in parallel with workers=4, duration stddev balloons compared to sequential runs. Either stratify training by worker count, or collect separate scores for sequential and parallel pipelines.
Cold-start runners: GitHub Actions runners occasionally spin up on an image with an empty Docker cache. The first test on a cold runner is 3–5x slower and looks flaky. Add a runner_cold_start boolean feature or exclude runs that hit it.
Platform-specific flakes: A test is 100% stable on Linux but flakes on Windows due to path separators. Train per-platform models if you care about platform parity, or include runner_os as a feature.
Infrastructure flakes masquerading as test flakes: Half of “flaky” failures on many teams are really CI instability — runner disk pressure, DNS hiccups, container pulls timing out. Separate these with a failure reason taxonomy before labelling.
Label drift: Your labels (what counts as flaky) change as the team's tolerance evolves. Version your labelling function in git and retrain whenever it changes — otherwise you are comparing apples to oranges month over month.
Cypress auto-retries hide flakiness: Cypress retries fail silently by default (retries: 2). Enable experimentalMemoryManagement and use the JSON reporter to surface per-attempt results, otherwise your pass-rate variance will look artificially clean.

Rolling It Out: A Pragmatic Adoption Plan

You do not need a data platform to ship this. A scrappy pipeline can prove value in two weeks:

Week 1 — Observe. Ship the collector. Point it at the last 30 days of CI runs. Compute naive pass-rate per test. This baseline alone typically surfaces the worst 5% offenders.
Week 2 — Engineer features and train. Run the feature builder and the trainer. Pick a high-precision threshold. Commit the model artifact behind a feature flag.
Week 3 — Wire the reporter. Start in “warn only” mode — no CI behaviour changes, just visible scores. Watch how often the predictions match engineer intuition.
Week 4 — Enable soft quarantine. When a test fails and its score is above threshold, mark it non-blocking instead of failing the build. Track how many real regressions leak in (ideally zero at precision ≥ 0.8).
Month 2 — Automate retraining. A cron job that runs the full pipeline nightly and commits the updated scores JSON is enough for most teams. Full MLOps with model registry is nice-to-have, not a prerequisite.

Conclusion: Move Flaky Tests from Chore to Signal

Flaky tests are not an immutable cost of doing business. They are a measurable, predictable property of your suite — and ML gives you a cheap way to price that property. The pipeline in this post is intentionally boring: pull CI data, engineer a handful of features, train a gradient boosted model, thread it into your reporter.

The payoff is cultural as much as technical. Engineers stop treating red builds as a coin flip. QA leads stop losing hours to manual quarantine triage. Management gets a real lever to pull on CI reliability. That's the Foundation series thesis: boring, well-built tooling wins more fights than exotic heroics.

Want help putting this into your stack? The Desplega.ai team has shipped flaky-test prediction for QA groups across Barcelona, Madrid, Valencia, and Malaga — we can take you from zero to production in a sprint.

Predicting Flaky Tests: Using Machine Learning to Proactively Identify Test Instability

Stop chasing intermittent failures after the fact — train a model on your test history to flag unstable tests before they break your CI pipeline.

What Is a Flaky Test and Why Does It Matter?

What Features Are Most Predictive of Test Flakiness?

Traditional Quarantine vs. ML-Based Prediction

Step 1: Collect Historical Test Run Data from CI

Step 2: Engineer Features from Raw Run Data

Step 3: Train a scikit-learn Classifier

Step 4: Wire the Model Into Your Playwright Reporter

Troubleshooting: When the Model Gets It Wrong

Edge Cases and Gotchas

Rolling It Out: A Pragmatic Adoption Plan

Conclusion: Move Flaky Tests from Chore to Signal

Ready to strengthen your test automation?

Frequently Asked Questions

What is a flaky test and why does it matter?

What features are most predictive of test flakiness?

How much historical test data do I need to train a flakiness model?

Can I use this approach with Cypress or Selenium instead of Playwright?

How do I avoid quarantining stable tests with false positives?

Related Posts

When I Reject v0 Code: Pattern-Matching Rules for Safer UI Generation

Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI

Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai