Fixing Flaky CI Tests: Ownership, Pruning, and Smarter PR Gates
Flaky tests are the number one reason developers stop trusting CI pipelines. When the build fails for reasons unrelated to your change, you learn to ignore failures. Here is how to fix that.
“Teams with flaky test rates above 5% see a 3x increase in time-to-merge for pull requests.”
Google Engineering Practices, Testing on the Toilet
1. Why Flaky Tests Destroy Developer Trust
A flaky test is one that passes and fails on the same code without any changes. The causes vary: timing dependencies, shared state between tests, non-deterministic data, network calls to external services, or browser rendering inconsistencies. The root causes are technical, but the damage is organizational.
When a developer opens a pull request and the CI pipeline fails on a test that has nothing to do with their change, they have two options: investigate the flaky test (which could take hours) or re-run the pipeline and hope it passes. Almost everyone chooses the re-run. After this happens three or four times, developers learn that CI failures are not meaningful signals. They start merging with failing tests. They stop writing new tests because they assume the suite is unreliable.
Google has published extensively on this problem. Their internal research found that a test with a 1% flake rate will cause a false failure on roughly one in four CI runs for a typical team. At 5% flake rate, false failures happen on nearly every run. The compounding effect is severe because a suite of 500 tests where each has a 1% flake rate will fail falsely more often than it passes cleanly.
2. Flake Ownership: Assign It or Delete It
The most effective organizational intervention for flaky tests is mandatory ownership with a fix-or-delete deadline. When a test is identified as flaky (through CI analytics, developer reports, or automated detection), it gets assigned to a specific person with a clear deadline: fix the flakiness within two weeks, or the test gets deleted.
This sounds aggressive, and it is. The alternative is worse. An unowned flaky test will sit in your suite for months, causing false failures, eroding trust, and wasting developer time on every single PR. Deletion is not ideal, but a deleted test causes zero false failures. You can always rewrite it correctly later.
The ownership model works best with automated flake detection. Tools like BuildPulse, Trunk Flaky Tests, and Datadog CI Visibility can track which tests flake, how often, and on which branches. When a test crosses a flake threshold (say, more than 2 failures in 100 runs with no code changes), it automatically gets flagged and assigned.
Some teams quarantine flaky tests instead of deleting them. The quarantined tests still run in CI, but their results do not block the pipeline. This preserves the test coverage while removing the developer trust damage. The quarantine should still have an expiration date. If nobody fixes the test within the deadline, it gets deleted from the suite entirely.
3. File-Change-Based Test Path Analysis
Running your entire test suite on every pull request is a brute-force approach that does not scale. As your suite grows, CI times increase, flake probability compounds, and developers wait longer for feedback. The alternative is test path analysis: determine which tests are affected by the files changed in a PR, and only run those.
The simplest version of this is file-based matching. If a PR changes files in the /checkout directory, only run tests tagged with the checkout feature. This requires disciplined test organization (tagging tests by feature area) but dramatically reduces CI run times and flake exposure.
More sophisticated approaches use dependency analysis. Tools like Nx, Turborepo, and Bazel understand the dependency graph of your codebase and can determine exactly which test files are affected by a given source change. This is more accurate than tag-based matching but requires a build system that supports it.
For end-to-end tests specifically, the challenge is that a small backend change might affect any number of user flows. Tools like Assrt that generate tests from discovered user flows can help here by mapping which flows touch which API endpoints. When a backend file changes, you can determine which user flows might be affected and run only those end-to-end tests. Launchable and Predictive Test Selection by Gradle offer similar capabilities through machine learning models trained on your test history.
4. Separating Signal from Noise in CI
Beyond flake management and test selection, there are structural changes to your CI pipeline that improve signal quality. The most important is separating your test suite into tiers with different blocking behaviors.
Tier 1 (blocking): Fast, deterministic tests that must pass before a PR can merge. Unit tests, type checks, linting, and a small set of critical-path end-to-end tests. This tier should run in under 5 minutes and have a flake rate as close to zero as possible.
Tier 2 (advisory): Broader test coverage that runs on every PR but does not block merging. Results are reported as comments or status checks that developers can review. Integration tests, broader end-to-end coverage, and performance benchmarks belong here.
Tier 3 (scheduled): Comprehensive test runs that happen on a schedule (nightly, or on main branch merges) rather than per-PR. Full cross-browser testing, visual regression suites, and exploratory test generation belong here. If a Tier 3 failure is traced to a specific PR, the author is notified after the fact.
This tiered approach means that developers get fast, reliable feedback on their PR (Tier 1) without being blocked by slower or less reliable tests (Tiers 2 and 3). The full suite still runs, but it runs in a context where flakiness does not block anyone.
5. Rebuilding Trust in Your Test Suite
If your team has already lost trust in CI, recovering takes deliberate effort. The technical fixes (flake detection, quarantine, test selection) are necessary but not sufficient. You also need to rebuild the cultural expectation that CI failures are meaningful.
Start with a flake audit. Run your entire suite 10 times on the same commit with no changes. Every test that fails at least once is flaky. Quarantine or delete all of them immediately. This is painful if you have hundreds of flaky tests, but it is the fastest path to a clean baseline. You can write new, reliable tests to replace the deleted coverage later.
Next, establish a "green main" policy. The main branch test suite should always pass. When it fails, fixing it is the highest priority. No new features ship until main is green. This sounds extreme, but it sets the cultural standard that test results matter.
Finally, make test reliability a tracked metric. Measure flake rate weekly, track it on a dashboard, and celebrate when it goes down. Tools like BuildPulse, Trunk, and even simple scripts that parse CI logs can provide this data. When the team sees flake rate dropping from 8% to 2% to 0.5%, they start trusting the pipeline again.
Automated test generation tools (Assrt, Playwright codegen, QA Wolf) can accelerate the recovery by generating fresh, reliable tests to replace the deleted flaky ones. The generated tests start from a clean slate, without the accumulated state dependencies and timing assumptions that made the old tests flaky in the first place.
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.