Test Wars – Episode V: The Non-Determinism Menace
"In an AI-driven galaxy, randomness surrounds us and binds our bugs together."
TL;DR: As AI and distributed systems push deeper into our pipelines, traditional deterministic tests are losing the battle against flakiness. These random failures hide real regressions, waste developer cycles, and erode team trust. This post breaks down the new sources of non-determinism, what today's tooling can (and can't) do, and provides a practical framework for deciding when random is too random.
A Disturbance in the Force
8% of your team time is spent just fixing flaky tests. Flakiness isn't just an annoyance; it's a significant drain on resources. This is a tax on your team's velocity, and AI is only exacerbating it. Why?
Generative AI introduces a powerful new variable: probabilistic outputs. When you integrate features based on Large Language Models (LLMs), you are intentionally injecting randomness. Every token-sampling decision can create subtle variations that ripple through downstream assertions. This isn't a bug; it's a feature that our deterministic test harnesses were never designed for.
While AI promises huge productivity gains, the latest industry data reveals a trade-off. The 2024 DORA report found that while 75.9% of developers are adopting AI, this adoption is linked to a 7.2% decrease in delivery stability. We are, as an industry, "moving fast and breaking things".
Mapping the Flakiness Constellations
Martin Fowler's classic taxonomy of non-determinism—time, concurrency, networking, and randomness—still forms the core of our galaxy map. But modern stacks have added new, volatile nebulae:
Flakiness Vector | Modern Twist | Mitigation |
---|---|---|
Async orchestration | Thousands of micro-services scheduled by K8s | Inject virtual-time controls |
Heisen-mocks | AI-generated tests mutate faster than infra | Golden-dataset pinning + snapshot governance |
Data skews | Non-IID inputs from RLHF or RAG pipelines | Property-based fuzzing |
Infra noise | GKE pre-emptible nodes, spot VMs | Chaos-monkey style soak runners |
- Async Orchestration: With thousands of microservices scheduled by platforms like Kubernetes, the exact order of operations is never guaranteed, creating timing issues that are nearly impossible to replicate locally.
- Heisen-mocks: AI-powered code-generation tools create and mutate tests faster than infrastructure can be updated to support them, leading to tests that fail due to environmental drift.
- Data Skews: Non-IID (Not Independent and Identically Distributed) inputs from Reinforcement Learning from Human Feedback (RLHF) or Retrieval-Augmented Generation (RAG) pipelines make golden-dataset testing insufficient.
- Infrastructure Noise: Ephemeral resources like GKE pre-emptible nodes and spot VMs introduce a layer of hardware-level unpredictability that can terminate a test runner mid-execution.
- The browser front: GUI tests are notoriously flaky. Chrome engineers devote an entire "FlakyBot" pipeline to detect and auto-disable unstable test cases to protect the integrity of the main branch. This reinforces the wisdom of Fowler's Practical Test Pyramid: high-level UI tests are the most expensive to run and maintain and should be the thinnest layer of your test suite.
To manage this, Meta's engineering team formalized a Probabilistic Flakiness Score (PFS) to data-drivenly decide when a test should be quarantined versus immediately fixed (see below).
When Is Random Too Random? — A Decision Matrix
Not all flaky tests are created equal. Before investing engineering hours in stabilizing a test, apply a strategic lens by evaluating four factors:
- Severity: Does a failure block a production release or critical path?
- Masking Risk: Could this test's flakiness be hiding a real, subtle regression?
- Cost to Stabilize: What is the engineering effort required for isolation, retries, or deterministic seeding?
- Signal Value: If you remove the randomness, does the test still provide a valuable signal about user-facing quality?
Use this strategic calculation: only invest in making a test deterministic when the risk outweighs the cost.
(Severity × Masking Risk) > (Cost to Stabilize ÷ Signal Value)
Otherwise, quarantine or delete it. Remember: tests are code, and dead code is technical debt.
Battle-Tested Tactics
Here are proven strategies for managing non-determinism across your test suites:
- Statistical Retry Budget: Implement a threshold for retries based on a Probabilistic Flakiness Score, as pioneered by Meta. This is ideal for large, inherently flaky test suites where individual fixes are impractical.
- Deterministic Seeding: For randomness-heavy code (e.g., simulations, complex algorithms), force a consistent outcome by passing a fixed seed, like the
--seed
flag in Jest, used by the NestJS community to debug their test failures. - Quarantine Buckets: Automatically isolate and flag unstable tests using GitHub labels and CI filters. This prevents flaky tests from blocking releases while still tracking them for future analysis.
- Property-Based Testing: Instead of testing for one specific outcome, define the properties of a correct outcome and let a framework generate hundreds of random inputs to try and violate those properties. This is highly effective for testing AI/RAG pipelines.
- Ownership Rotation: Assign clear ownership for test health. The Chromium team uses an "OWNERS fix your flakes" bot that automatically assigns bugs to the relevant team, fostering org-wide accountability.
Tactic | Example / Tool | Best for |
---|---|---|
Statistical retry budget | Meta's Probabilistic Flakiness Score | Large flaky suites |
Deterministic seeding | Jest --seed bump in Nest repo | RNG-heavy code |
Quarantine buckets | GitHub labels + CI filters | Isolating unstable shards |
Property-based tests | AWS Beyond Traditional Testing | AI / RAG pipelines |
Ownership rotation | Chromium "OWNERS fix your flakes" | Org-wide accountability |
Toward Probabilistic Quality
The reality of an AI-driven galaxy is that we won't eliminate randomness; we'll learn to manage it.
The future of quality assurance is less about forcing perfect determinism and more about quantifying uncertainty. The goal is to build systems that surface distribution shifts early, tighten confidence intervals on quality, and empower product owners to make informed, data-driven decisions that balance risk with velocity.

References
- NestJS — "E2E tests are non-deterministic" Issue #15239, GitHub, 2025
- Martin Fowler — "Eradicating Non-Determinism in Tests"
- Meta Engineering — "Probabilistic flakiness: reliable test signals at scale"
- AWS / dev.to — "Beyond Traditional Testing: Addressing the Challenges of Non-Deterministic Software"
- Chrome Developers — "The Chromium Chronicle #2: Fighting Test Flakiness"
- Chromium Infrastructure — "Flaky Web Tests Documentation"
- DORA — "2024 Accelerate State of DevOps Report"
- LambdaTest — "Future of Quality Assurance Report"
Ready to tackle non-determinism in your test suite? Let's discuss your flakiness challenges.