Testing Guide

Flaky Test Detection and CI Optimization: Finding the 12% That Wastes Your Build Time

Name: Assrt
Availability: InStock
Author: Assrt

Flaky test detection is the most underrated CI optimization. Most teams configure automatic retries and move on, never investigating why tests fail intermittently. Systematic tracking reveals just how much build time and engineering confidence these tests silently consume.

12%

“After implementing systematic flaky test tracking, a team discovered 12% of their test suite was consistently flaky. Fixing those tests saved more CI time than any other single optimization.”

CI optimization case study

1. The Hidden Cost of Flaky Tests

A flaky test is one that sometimes passes and sometimes fails without any change to the code under test. The failure is caused by external factors: timing assumptions, shared state between tests, network latency, database ordering, or race conditions in asynchronous operations. Every team with a test suite of more than a few hundred tests has flaky tests. The question is whether they know which ones are flaky and what that flakiness costs them.

The direct cost is CI time. Every flaky failure triggers a retry, which means re-running the test (or sometimes the entire suite) and waiting for the result. If a flaky test fails on 20% of runs and the suite takes 15 minutes, that test alone adds an average of 3 minutes to every pipeline execution. Multiply that across a dozen flaky tests and the cumulative delay is significant. For teams running hundreds of CI builds per day, the wasted compute and developer wait time adds up to thousands of dollars monthly.

The indirect cost is worse. Flaky tests erode trust in the test suite. When developers see a red build, their first instinct should be to investigate the failure. But when the team knows that 10% of failures are flaky, the instinct shifts to "just re-run it." This learned behavior means real failures get dismissed as flakiness. A genuine regression hides behind the noise of intermittent failures, reaching staging or production before anyone notices. The test suite stops being a reliable quality signal and becomes a formality that everyone works around.

2. The Retry Trap

The default response to flaky tests is enabling automatic retries. Most CI systems and test frameworks support this natively: if a test fails, re-run it up to N times, and only count it as a failure if all retries fail. This makes the green build appear more often, which satisfies the immediate desire for pipeline stability. Teams enable retries, the build goes green, and everyone moves on to other work.

The problem is that retries mask the symptom without addressing the cause. Every retry consumes compute resources and adds latency to the pipeline. More importantly, retries remove the signal that something is wrong. A test that fails on the first attempt and passes on the second is telling you something valuable: there is a non-deterministic behavior in your test, your application, or the interaction between them. By retrying and ignoring the initial failure, you lose that signal.

Some teams compound the problem by setting high retry counts (3 or even 5 retries) with generous timeouts. A test with a 20% failure rate and 3 retries will eventually pass 99.2% of the time, making the flakiness nearly invisible in build results. But the underlying issues remain: the shared database state, the unwaited promise, the assumption about element rendering order. These issues do not get better over time. They get worse as the application grows and more tests inherit the same problematic patterns.

Try Assrt for free

Enter your email to access the dashboard. No credit card required.

3. Systematic Flaky Test Tracking

The alternative to blind retries is systematic tracking. The approach is simple: log every test execution result, including retried attempts. When a test passes on retry, record it as a "flaky pass" rather than a clean pass. Over time, this data reveals which tests are consistently flaky, how often they fail, and whether the flakiness is getting better or worse.

The implementation varies by CI system. Some platforms (BuildPulse, Launchable, Trunk) offer dedicated flaky test tracking as a service. For teams that prefer to build their own, the approach is straightforward: modify your test reporter to emit structured results (including retry counts) to a database or analytics service, then build a dashboard that shows flakiness rates per test over time. Even a simple spreadsheet updated weekly with "tests that needed retries" provides more visibility than most teams have.

The results of systematic tracking are consistently eye-opening. Teams that implement tracking for the first time typically discover that a small number of tests are responsible for the vast majority of retries. In one well-documented case, 12% of the test suite accounted for over 80% of all retry attempts. These tests had been silently draining CI resources for months, with each individual test appearing "mostly green" due to retries.

Once you can see the data, prioritization becomes obvious. Sort tests by flakiness rate, start with the worst offenders, and fix them one by one. Each fix reduces total CI time, speeds up the feedback loop for developers, and improves trust in the test suite. This focused approach delivers measurable ROI quickly, often paying for the tracking investment within the first sprint.

4. Common Root Causes and How to Fix Them

Timing issues are the most common cause of flaky E2E tests. A test clicks a button and immediately asserts that a result appeared, but the result loads asynchronously and sometimes is not rendered within the assertion timeout. Playwright's auto-waiting mitigates this significantly compared to older frameworks, but tests can still be flaky if they wait for the wrong condition. The fix is to wait for the specific element or state that indicates the operation completed, not an arbitrary timeout.

Shared state between tests is the second most common cause. Test A creates a user, Test B assumes that user does not exist. If the tests run in a different order (or in parallel), Test B fails. The fix is test isolation: each test should set up its own state and clean up afterward, or use unique identifiers that prevent collisions. Playwright's browser context isolation helps with client-side state, but server-side state (databases, caches, message queues) requires explicit setup and teardown.

Network dependencies cause a third category of flakiness. Tests that hit real external APIs (payment processors, email services, third-party integrations) will occasionally fail due to network timeouts or service outages. The fix is mocking external dependencies at the network level using Playwright's route interception or a service like WireMock. This makes tests faster, deterministic, and independent of external service availability.

Resource contention is less obvious but affects teams with large parallel test suites. When 50 tests run simultaneously, they compete for CPU, memory, database connections, and port allocations. Tests that pass reliably in isolation fail intermittently under load. The fix involves a combination of better resource management, reduced parallelism for resource-heavy tests, and ensuring that test infrastructure scales with the suite.

5. The Quarantine Strategy

While you work on fixing flaky tests, quarantining them prevents them from blocking the pipeline. A quarantined test still runs on every build, but its results are tracked separately and do not affect the build status. This preserves the signal (you can still see the test results) while removing the noise (flaky failures no longer block merges or deployments).

The quarantine should have clear ownership and time limits. Each quarantined test should be assigned to a specific engineer with a deadline for fixing or removing it. Without accountability, the quarantine becomes a permanent dumping ground where flaky tests go to be forgotten. A weekly review of quarantined tests ensures they get fixed, and a policy of removing (not just quarantining) tests that have been flaky for more than a defined period keeps the quarantine from growing indefinitely.

Some teams combine quarantine with automatic detection. When the tracking system identifies a test that has failed intermittently more than a threshold number of times in the past week, it automatically quarantines the test and creates a ticket for investigation. This removes the manual step of identifying and quarantining flaky tests, making the system self-maintaining. Tools like Assrt can help by analyzing test execution patterns and flagging tests with high flakiness scores for review.

6. Preventing Flakiness Before It Starts

The most effective strategy is preventing flaky tests from entering the suite. This starts with test design patterns: always use explicit waits for specific conditions rather than fixed timeouts, always isolate test state, always mock external dependencies, and always verify that tests pass consistently by running them multiple times before merging.

A pre-merge flakiness check is particularly effective. Before a new test is merged, run it 10 to 20 times in CI. If it fails even once, it is likely flaky and should be fixed before merging. This check catches timing issues and state dependencies that a single run might miss. The additional CI time is a worthwhile investment because fixing a flaky test before it is merged is dramatically cheaper than diagnosing and fixing it after it has been failing intermittently for weeks.

AI-generated tests from tools like Assrt can help with prevention by generating tests that follow deterministic patterns from the start. Because Assrt analyzes the actual application behavior during discovery, the generated tests use appropriate wait conditions for each interaction rather than guessing at timeouts. Running npx @m13v/assrt discover https://your-app.com produces Playwright tests with self-healing selectors and built-in stability patterns, giving you a flakiness-resistant starting point that you can extend with confidence.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

View on GitHub