Predicting Flaky Tests: Using Machine Learning to Proactively Identify Test Instability
Stop chasing intermittent failures after the fact — train a model on your test history to flag unstable tests before they break your CI pipeline.

Your CI pipeline just failed for the third time today. Same commit, same config, same tests — different outcome. The engineer on call sighs, clicks “Retry”, and merges when the dice land green. That's a flaky test, and according to Google's engineering blog, roughly 16% of test failures in their internal infrastructure fall into this category — not real bugs, just non-deterministic noise.
The traditional response is reactive: wait for tests to flake, quarantine them, patch them individually. But what if you could predict which tests will go flaky before they disrupt your pipeline? That's what this Foundation post walks through — a concrete ML workflow that scores your existing test suite for flakiness risk, using data your CI already collects.
What Is a Flaky Test and Why Does It Matter?
Quick Answer: A flaky test passes and fails intermittently on identical code. It poisons trust in CI, costs engineer time on false alarms, and silently hides real regressions beneath the noise.
The JetBrains 2023 Developer Ecosystem Survey reported that 59% of developers encounter flaky tests at least weekly. The cost is not just wasted retries. When a suite flakes 2–3% of runs, engineers learn to treat red builds as “probably fine, rerun it.” That learned helplessness is exactly how genuine regressions slip into production.
QA teams from Barcelona to Madrid to Valencia to Malaga have all hit the same wall: the longer you tolerate flakiness, the more expensive it becomes to root out. Post-hoc quarantine is the wrong unit of economics. Prediction is the right one — and modern ML tooling makes it cheap.
What Features Are Most Predictive of Test Flakiness?
Quick Answer: The top five signals are pass-rate variance, execution-time standard deviation, async operation density, network call frequency, and how often the test file itself changes. Together they explain most flakiness variance in well-labelled datasets.
You do not need exotic features. Plain CI telemetry, once you put a proper schema on it, carries a surprising amount of signal. The table below maps the features we extract to the root causes they most commonly correlate with — a grounding artefact for any team starting from scratch.
| Feature | Typical Root Cause | Why It Predicts Flakiness |
|---|---|---|
| Pass-rate variance (30-run window) | Inconsistent outcomes on identical code | Direct measurement of instability |
| Execution time stddev | Race conditions, thread contention | Highly variable duration = timing bug |
| Async operation count | Missing awaits, unbounded timeouts | Every await is a flakiness surface |
| Network call count | Third-party latency, DNS jitter | External dependencies are non-deterministic |
| Test file change rate (last 90d) | Churn reveals unclear requirements | High churn correlates with future churn |
| Shared fixture count | Cross-test state bleed | Coupling → order dependence → flakes |
| Test execution parallelism | Worker contention, port collisions | Parallel runs expose hidden shared state |
Traditional Quarantine vs. ML-Based Prediction
Most teams are stuck in what we call reactive quarantine: wait for a test to flake three times, mark it skip, file a ticket. This pattern has served us for a decade, but it scales poorly past a few hundred specs. ML-based prediction shifts the cost curve.
| Dimension | Reactive Quarantine | ML-Based Prediction |
|---|---|---|
| Detection lag | 3–10 failures before action | Pre-run score on every commit |
| Developer friction | Red build → retry ritual | Risky tests isolated before they run |
| Coverage | Only tests that have already flaked | Every test scored, even new ones |
| Investment | Engineer hours per quarantined test | One-time pipeline + monthly retraining |
| Scales to | ~500 tests before drowning | 10,000+ tests with the same infra |
Step 1: Collect Historical Test Run Data from CI
Every CI vendor emits structured run metadata — GitHub Actions, GitLab CI, CircleCI, Jenkins. The first job is to normalize it into a single row-per-test-per-run shape. The Python collector below reads GitHub Actions check runs via the REST API and appends to a Parquet file for later feature engineering.
# scripts/collect_runs.py
import os
from datetime import datetime, timedelta
import pandas as pd
import requests
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = "desplega-ai/landing" # owner/repo
WINDOW_DAYS = 90
def fetch_workflow_runs(repo: str, since: datetime) -> list[dict]:
headers = {"Authorization": f"Bearer {GITHUB_TOKEN}"}
url = f"https://api.github.com/repos/{repo}/actions/runs"
params = {"per_page": 100, "created": f">{since.isoformat()}"}
resp = requests.get(url, headers=headers, params=params, timeout=30)
resp.raise_for_status()
return resp.json()["workflow_runs"]
def fetch_jobs(repo: str, run_id: int) -> list[dict]:
headers = {"Authorization": f"Bearer {GITHUB_TOKEN}"}
url = f"https://api.github.com/repos/{repo}/actions/runs/{run_id}/jobs"
resp = requests.get(url, headers=headers, timeout=30)
resp.raise_for_status()
return resp.json()["jobs"]
def to_rows(run: dict, jobs: list[dict]) -> list[dict]:
rows = []
for job in jobs:
for step in job.get("steps", []):
if not step["name"].startswith("Test:"):
continue
rows.append({
"run_id": run["id"],
"commit": run["head_sha"],
"test_name": step["name"],
"status": step["conclusion"],
"started_at": step["started_at"],
"completed_at": step["completed_at"],
"runner": job["runner_name"],
})
return rows
if __name__ == "__main__":
since = datetime.utcnow() - timedelta(days=WINDOW_DAYS)
rows: list[dict] = []
for run in fetch_workflow_runs(REPO, since):
rows.extend(to_rows(run, fetch_jobs(REPO, run["id"])))
df = pd.DataFrame(rows)
df["started_at"] = pd.to_datetime(df["started_at"])
df["completed_at"] = pd.to_datetime(df["completed_at"])
df["duration_s"] = (df["completed_at"] - df["started_at"]).dt.total_seconds()
df.to_parquet("data/runs.parquet", index=False)
print(f"Wrote {len(df):,} rows to data/runs.parquet")Run this as a nightly GitHub Action with a write-access PAT, commit the Parquet to a data branch, or ship to S3. The schema is deliberately narrow — name, run, status, duration — because every downstream feature is derived.
Step 2: Engineer Features from Raw Run Data
Feature engineering is where most ML projects win or lose. For flakiness, the rule is simple: aggregate per test over a rolling window. The 30-run window catches the sweet spot between recency (recent code matters) and sample size (variance estimates need data).
# scripts/build_features.py
from pathlib import Path
import pandas as pd
def build_features(runs: pd.DataFrame, ast_stats: pd.DataFrame) -> pd.DataFrame:
runs = runs.sort_values(["test_name", "started_at"])
grouped = runs.groupby("test_name")
features = grouped.agg(
runs_seen=("run_id", "nunique"),
pass_rate=("status", lambda s: (s == "success").mean()),
pass_rate_var=("status", lambda s: (s == "success").astype(int).rolling(30).var().mean()),
duration_mean=("duration_s", "mean"),
duration_stddev=("duration_s", "std"),
duration_p95=("duration_s", lambda s: s.quantile(0.95)),
distinct_runners=("runner", "nunique"),
).reset_index()
# Attach static code signals from AST parsing pass (separate collector)
features = features.merge(ast_stats, on="test_name", how="left")
# Label: flaky if pass-rate between 10% and 90% with >=5 runs seen
features["is_flaky"] = (
(features["pass_rate"] > 0.1)
& (features["pass_rate"] < 0.9)
& (features["runs_seen"] >= 5)
).astype(int)
return features
if __name__ == "__main__":
runs = pd.read_parquet("data/runs.parquet")
ast_stats = pd.read_parquet("data/ast_stats.parquet")
features = build_features(runs, ast_stats)
Path("data").mkdir(exist_ok=True)
features.to_parquet("data/features.parquet", index=False)
print(f"Labeled {int(features['is_flaky'].sum()):,} flaky tests out of {len(features):,}")The ast_stats dataframe captures static signals per test file — async call count, network mock usage, fixture fan-in — produced by a separate AST traversal over the spec files. Combining behavioural and structural features is what lifts precision above the naive pass-rate baseline.
Step 3: Train a scikit-learn Classifier
For flakiness detection, start with gradient boosting. It handles mixed numeric and categorical features, gracefully tolerates missing data, and produces calibrated probabilities you can threshold. A LightGBM or XGBoost classifier trained on as few as a few thousand rows routinely beats handcrafted heuristics.
# scripts/train_model.py
import joblib
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.model_selection import train_test_split
FEATURES = [
"pass_rate_var",
"duration_stddev",
"duration_p95",
"distinct_runners",
"async_calls",
"network_calls",
"file_change_count",
"shared_fixture_count",
]
def train(features_df: pd.DataFrame) -> lgb.LGBMClassifier:
X = features_df[FEATURES].fillna(0)
y = features_df["is_flaky"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = lgb.LGBMClassifier(
n_estimators=300,
learning_rate=0.05,
num_leaves=31,
class_weight="balanced", # flaky tests are the minority
random_state=42,
)
model.fit(X_train, y_train)
# Pick a threshold that maximises precision at >= 80%
probs = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, probs)
good = [(p, r, t) for p, r, t in zip(precision, recall, thresholds) if p >= 0.80]
chosen = max(good, key=lambda x: x[1]) if good else (precision[-1], recall[-1], 0.5)
print(f"Threshold={chosen[2]:.3f} precision={chosen[0]:.3f} recall={chosen[1]:.3f}")
print(classification_report(y_test, probs >= chosen[2]))
return model
if __name__ == "__main__":
features_df = pd.read_parquet("data/features.parquet")
model = train(features_df)
joblib.dump(model, "artifacts/flakiness_model.joblib")The key engineering choice is the threshold. A reckless 0.5 threshold floods your team with false positives. Anchor to precision ≥ 0.80, then accept whatever recall that implies. It's better to miss some flaky tests than to cry wolf on healthy ones — trust in the system is the real currency.
Step 4: Wire the Model Into Your Playwright Reporter
Inference needs to happen at runtime, not out-of-band. The cleanest integration point for Playwright is a custom reporter: it receives every test result and can annotate the CI output with a flakiness score in real time. The same shape works for Cypress and Selenium — only the reporter protocol differs.
// reporters/flakiness-reporter.ts
import type {
FullConfig,
FullResult,
Reporter,
Suite,
TestCase,
TestResult,
} from '@playwright/test/reporter';
interface FlakinessScore {
testName: string;
probability: number;
threshold: number;
}
export default class FlakinessReporter implements Reporter {
private readonly threshold: number;
private readonly scores: Map<string, number> = new Map();
constructor(opts: { scoresFile?: string; threshold?: number } = {}) {
this.threshold = opts.threshold ?? 0.8;
// Load prebuilt scores JSON produced by the Python pipeline
const fs = require('node:fs');
const path = opts.scoresFile ?? 'artifacts/flakiness_scores.json';
const raw = JSON.parse(fs.readFileSync(path, 'utf-8')) as FlakinessScore[];
raw.forEach((s) => this.scores.set(s.testName, s.probability));
}
onBegin(_config: FullConfig, suite: Suite): void {
const at_risk = Array.from(this.scores.entries()).filter(
([, p]) => p >= this.threshold,
);
console.log(
`[flakiness] ${suite.allTests().length} tests · ${at_risk.length} flagged above ${this.threshold}`,
);
}
onTestEnd(test: TestCase, result: TestResult): void {
const prob = this.scores.get(test.title);
if (prob === undefined) return;
if (result.status === 'failed' && prob >= this.threshold) {
console.warn(`[flakiness] ${test.title} failed; predicted risk ${prob.toFixed(2)} — consider retry`);
// Mark as 'warning' so CI doesn't fail the build on predicted flake
(result as TestResult & { _flakinessWarning: boolean })._flakinessWarning = true;
}
}
onEnd(_result: FullResult): Promise<void> | void {
return;
}
}Register the reporter in your playwright.config.ts alongside your default reporter. The reporter does not replace your retry policy — it augments it. Retries remain your last line of defence; the predictor is the first filter.
Troubleshooting: When the Model Gets It Wrong
Scenario 1 — Model flags too many false positives
Symptom: engineers complain the reporter warns on tests that never flake. Fix: raise the threshold, re-evaluate precision on the last 30 days of data, and audit for label leakage (tests labelled flaky purely because of infrastructure outages).
Scenario 2 — Model misses new flaky tests
Symptom: a new test goes flaky but the model gives it a low score. Fix: add a “newness” feature (days since first seen) and ensure your retraining cadence is at least monthly — ideally triggered whenever a flake is quarantined.
Scenario 3 — Features look stale
Symptom: pass-rate variance is zero for most tests. Fix: confirm your rolling window is computed per test, not per run. A common bug is computing variance across the entire dataset instead of per-group.
Scenario 4 — Production drift degrades scores
Symptom: model was 85% precise last month, now 65%. Fix: monitor population-shift via simple Jensen-Shannon divergence on feature distributions and alert when drift crosses a fixed threshold. Retrain immediately when alerted.
Edge Cases and Gotchas
- Parallel execution bias: If you run tests in parallel with
workers=4, duration stddev balloons compared to sequential runs. Either stratify training by worker count, or collect separate scores for sequential and parallel pipelines. - Cold-start runners: GitHub Actions runners occasionally spin up on an image with an empty Docker cache. The first test on a cold runner is 3–5x slower and looks flaky. Add a
runner_cold_startboolean feature or exclude runs that hit it. - Platform-specific flakes: A test is 100% stable on Linux but flakes on Windows due to path separators. Train per-platform models if you care about platform parity, or include
runner_osas a feature. - Infrastructure flakes masquerading as test flakes: Half of “flaky” failures on many teams are really CI instability — runner disk pressure, DNS hiccups, container pulls timing out. Separate these with a failure reason taxonomy before labelling.
- Label drift: Your labels (what counts as flaky) change as the team's tolerance evolves. Version your labelling function in git and retrain whenever it changes — otherwise you are comparing apples to oranges month over month.
- Cypress auto-retries hide flakiness: Cypress retries fail silently by default (
retries: 2). EnableexperimentalMemoryManagementand use the JSON reporter to surface per-attempt results, otherwise your pass-rate variance will look artificially clean.
Rolling It Out: A Pragmatic Adoption Plan
You do not need a data platform to ship this. A scrappy pipeline can prove value in two weeks:
- Week 1 — Observe. Ship the collector. Point it at the last 30 days of CI runs. Compute naive pass-rate per test. This baseline alone typically surfaces the worst 5% offenders.
- Week 2 — Engineer features and train. Run the feature builder and the trainer. Pick a high-precision threshold. Commit the model artifact behind a feature flag.
- Week 3 — Wire the reporter. Start in “warn only” mode — no CI behaviour changes, just visible scores. Watch how often the predictions match engineer intuition.
- Week 4 — Enable soft quarantine. When a test fails and its score is above threshold, mark it non-blocking instead of failing the build. Track how many real regressions leak in (ideally zero at precision ≥ 0.8).
- Month 2 — Automate retraining. A cron job that runs the full pipeline nightly and commits the updated scores JSON is enough for most teams. Full MLOps with model registry is nice-to-have, not a prerequisite.
Conclusion: Move Flaky Tests from Chore to Signal
Flaky tests are not an immutable cost of doing business. They are a measurable, predictable property of your suite — and ML gives you a cheap way to price that property. The pipeline in this post is intentionally boring: pull CI data, engineer a handful of features, train a gradient boosted model, thread it into your reporter.
The payoff is cultural as much as technical. Engineers stop treating red builds as a coin flip. QA leads stop losing hours to manual quarantine triage. Management gets a real lever to pull on CI reliability. That's the Foundation series thesis: boring, well-built tooling wins more fights than exotic heroics.
Want help putting this into your stack? The Desplega.ai team has shipped flaky-test prediction for QA groups across Barcelona, Madrid, Valencia, and Malaga — we can take you from zero to production in a sprint.
Ready to strengthen your test automation?
Desplega.ai helps QA teams build robust test automation frameworks that scale with your product.
Get StartedFrequently Asked Questions
What is a flaky test and why does it matter?
A flaky test randomly passes or fails on identical code. It erodes CI trust, wastes engineer time on false alarms, and masks real regressions hiding behind noise.
What features are most predictive of test flakiness?
Historical pass rate variance, execution time standard deviation, async operation count, network call frequency, and test file change rate are the strongest flakiness signals.
How much historical test data do I need to train a flakiness model?
Aim for at least 30 CI runs per test for minimal signal. Models trained on 100+ runs per test achieve meaningfully higher precision — especially for rare edge-case patterns.
Can I use this approach with Cypress or Selenium instead of Playwright?
Yes — the ML pipeline is framework-agnostic. Collect structured run results from any framework into CSV, then train the same scikit-learn model on the extracted features.
How do I avoid quarantining stable tests with false positives?
Set a high precision threshold (0.80+), require 3 consecutive positive predictions before quarantine, and build a manual override review step for all flagged tests.
Related Posts
Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai
Stop losing 2-3 hours daily to dev server restarts. Master HMR configuration in Vite and Next.js to maintain flow state, preserve component state, and boost coding velocity by 80%.
The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai
Discover how flaky tests create a hidden operational tax that costs CTOs millions in wasted compute, developer time, and delayed releases. Calculate your flakiness cost today.
The QA Death Spiral: When Your Test Suite Becomes Your Product | desplega.ai
An executive guide to recognizing when quality initiatives consume engineering capacity. Learn to identify test suite bloat, balance coverage vs velocity, and implement pragmatic quality gates.