Testing Guide
Flaky Test Detection and CI Optimization: Finding the 12% That Wastes Your Build Time
Flaky test detection is the most underrated CI optimization. Most teams configure automatic retries and move on, never investigating why tests fail intermittently. Systematic tracking reveals just how much build time and engineering confidence these tests silently consume.
“After implementing systematic flaky test tracking, a team discovered 12% of their test suite was consistently flaky. Fixing those tests saved more CI time than any other single optimization.”
CI optimization case study
2. The Retry Trap
The default response to flaky tests is enabling automatic retries. Most CI systems and test frameworks support this natively: if a test fails, re-run it up to N times, and only count it as a failure if all retries fail. This makes the green build appear more often, which satisfies the immediate desire for pipeline stability. Teams enable retries, the build goes green, and everyone moves on to other work.
The problem is that retries mask the symptom without addressing the cause. Every retry consumes compute resources and adds latency to the pipeline. More importantly, retries remove the signal that something is wrong. A test that fails on the first attempt and passes on the second is telling you something valuable: there is a non-deterministic behavior in your test, your application, or the interaction between them. By retrying and ignoring the initial failure, you lose that signal.
Some teams compound the problem by setting high retry counts (3 or even 5 retries) with generous timeouts. A test with a 20% failure rate and 3 retries will eventually pass 99.2% of the time, making the flakiness nearly invisible in build results. But the underlying issues remain: the shared database state, the unwaited promise, the assumption about element rendering order. These issues do not get better over time. They get worse as the application grows and more tests inherit the same problematic patterns.
3. Systematic Flaky Test Tracking
The alternative to blind retries is systematic tracking. The approach is simple: log every test execution result, including retried attempts. When a test passes on retry, record it as a "flaky pass" rather than a clean pass. Over time, this data reveals which tests are consistently flaky, how often they fail, and whether the flakiness is getting better or worse.
The implementation varies by CI system. Some platforms (BuildPulse, Launchable, Trunk) offer dedicated flaky test tracking as a service. For teams that prefer to build their own, the approach is straightforward: modify your test reporter to emit structured results (including retry counts) to a database or analytics service, then build a dashboard that shows flakiness rates per test over time. Even a simple spreadsheet updated weekly with "tests that needed retries" provides more visibility than most teams have.
The results of systematic tracking are consistently eye-opening. Teams that implement tracking for the first time typically discover that a small number of tests are responsible for the vast majority of retries. In one well-documented case, 12% of the test suite accounted for over 80% of all retry attempts. These tests had been silently draining CI resources for months, with each individual test appearing "mostly green" due to retries.
Once you can see the data, prioritization becomes obvious. Sort tests by flakiness rate, start with the worst offenders, and fix them one by one. Each fix reduces total CI time, speeds up the feedback loop for developers, and improves trust in the test suite. This focused approach delivers measurable ROI quickly, often paying for the tracking investment within the first sprint.
4. Common Root Causes and How to Fix Them
Timing issues are the most common cause of flaky E2E tests. A test clicks a button and immediately asserts that a result appeared, but the result loads asynchronously and sometimes is not rendered within the assertion timeout. Playwright's auto-waiting mitigates this significantly compared to older frameworks, but tests can still be flaky if they wait for the wrong condition. The fix is to wait for the specific element or state that indicates the operation completed, not an arbitrary timeout.
Shared state between tests is the second most common cause. Test A creates a user, Test B assumes that user does not exist. If the tests run in a different order (or in parallel), Test B fails. The fix is test isolation: each test should set up its own state and clean up afterward, or use unique identifiers that prevent collisions. Playwright's browser context isolation helps with client-side state, but server-side state (databases, caches, message queues) requires explicit setup and teardown.
Network dependencies cause a third category of flakiness. Tests that hit real external APIs (payment processors, email services, third-party integrations) will occasionally fail due to network timeouts or service outages. The fix is mocking external dependencies at the network level using Playwright's route interception or a service like WireMock. This makes tests faster, deterministic, and independent of external service availability.
Resource contention is less obvious but affects teams with large parallel test suites. When 50 tests run simultaneously, they compete for CPU, memory, database connections, and port allocations. Tests that pass reliably in isolation fail intermittently under load. The fix involves a combination of better resource management, reduced parallelism for resource-heavy tests, and ensuring that test infrastructure scales with the suite.
5. The Quarantine Strategy
While you work on fixing flaky tests, quarantining them prevents them from blocking the pipeline. A quarantined test still runs on every build, but its results are tracked separately and do not affect the build status. This preserves the signal (you can still see the test results) while removing the noise (flaky failures no longer block merges or deployments).
The quarantine should have clear ownership and time limits. Each quarantined test should be assigned to a specific engineer with a deadline for fixing or removing it. Without accountability, the quarantine becomes a permanent dumping ground where flaky tests go to be forgotten. A weekly review of quarantined tests ensures they get fixed, and a policy of removing (not just quarantining) tests that have been flaky for more than a defined period keeps the quarantine from growing indefinitely.
Some teams combine quarantine with automatic detection. When the tracking system identifies a test that has failed intermittently more than a threshold number of times in the past week, it automatically quarantines the test and creates a ticket for investigation. This removes the manual step of identifying and quarantining flaky tests, making the system self-maintaining. Tools like Assrt can help by analyzing test execution patterns and flagging tests with high flakiness scores for review.
6. Preventing Flakiness Before It Starts
The most effective strategy is preventing flaky tests from entering the suite. This starts with test design patterns: always use explicit waits for specific conditions rather than fixed timeouts, always isolate test state, always mock external dependencies, and always verify that tests pass consistently by running them multiple times before merging.
A pre-merge flakiness check is particularly effective. Before a new test is merged, run it 10 to 20 times in CI. If it fails even once, it is likely flaky and should be fixed before merging. This check catches timing issues and state dependencies that a single run might miss. The additional CI time is a worthwhile investment because fixing a flaky test before it is merged is dramatically cheaper than diagnosing and fixing it after it has been failing intermittently for weeks.
AI-generated tests from tools like Assrt can help with prevention by generating tests that follow deterministic patterns from the start. Because Assrt analyzes the actual application behavior during discovery, the generated tests use appropriate wait conditions for each interaction rather than guessing at timeouts. Running npx @m13v/assrt discover https://your-app.com produces Playwright tests with self-healing selectors and built-in stability patterns, giving you a flakiness-resistant starting point that you can extend with confidence.
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.