Test Reliability

Flaky Tests: How to Find and Fix the Root Cause Instead of Just Retrying

Retrying flaky tests is a painkiller, not a cure. Every retry masks a real problem in your test infrastructure, your application, or both. Here is how to diagnose and fix the actual root cause.

0

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Open-source test automation

1. Synchronization Issues: The Number One Cause

Research from Google's test infrastructure team and multiple industry surveys consistently identify synchronization issues as the leading cause of flaky tests, responsible for 45 to 60 percent of all flakiness. Synchronization flakiness occurs when a test interacts with the application before the application is ready. The test clicks a button that has not rendered yet, reads text from an element that is still loading, or asserts on a value that has not been updated by an asynchronous operation.

The pattern is predictable. A test passes on a developer's fast local machine because the application renders quickly enough. The same test fails intermittently on CI because the shared CI runner is slower, has less memory, or is handling multiple jobs simultaneously. The application takes 200 milliseconds longer to render, and the test's implicit timing assumption breaks.

The root cause is almost always a test that assumes synchronous behavior from an asynchronous system. Web applications are inherently asynchronous: network requests, DOM updates, animations, and JavaScript event loops all introduce variable timing. Any test that does not explicitly wait for the expected state before asserting is a flaky test waiting to happen.

2. Web-First Assertions vs Hardcoded Waits

The naive solution to synchronization flakiness is adding hardcoded waits: await page.waitForTimeout(3000). This is the worst possible fix. It makes the test slower (it always waits the full duration), it is still flaky (the operation might take longer than 3 seconds on a slow CI runner), and it accumulates across the suite. A hundred tests with 3-second waits add 5 minutes of pure waiting time.

The correct approach is web-first assertions, which Playwright supports natively. Instead of waiting a fixed duration and then checking, a web-first assertion polls the condition until it becomes true or a timeout expires. For example, await expect(page.getByText('Order confirmed')).toBeVisible()will repeatedly check for the text's visibility, resolving immediately when it appears. On a fast machine, this takes 50 milliseconds. On a slow CI runner, it might take 2 seconds. But it never waits longer than necessary, and it never fails because of arbitrary timing.

Converting hardcoded waits to web-first assertions is the single highest-impact change you can make to reduce flakiness. Audit your test suite for waitForTimeout, sleep, and setTimeout calls. Replace each one with an explicit wait for the specific condition the test needs. This typically reduces flakiness rates by 50 to 70 percent.

Stop writing tests manually

Assrt auto-discovers scenarios and generates real Playwright code. Open-source, free.

Get Started

3. Tagging and Quarantining Flaky Tests in CI

While you work on fixing root causes, you need a system to prevent flaky tests from blocking your pipeline. The industry standard approach is quarantine tagging. When a test is identified as flaky, it receives a @flaky tag or annotation. CI pipelines are configured to run quarantined tests separately: their failures do not block merges, but their results are still recorded and reported.

The quarantine is not a permanent state. It is a triage mechanism with an expiration date. When a test is quarantined, a ticket is created to investigate and fix the root cause within a defined SLA (typically one to two weeks). If the SLA expires without a fix, the team must decide: fix the test, rewrite it, or delete it. Allowing tests to remain quarantined indefinitely defeats the purpose and lets flakiness accumulate.

Implement quarantine at the CI configuration level, not in the test code. Playwright supports test annotations and grep filtering, so you can run npx playwright test --grep-invert @flaky for the blocking pipeline and npx playwright test --grep @flaky for the quarantine report. This separation keeps flaky tests visible without letting them disrupt development velocity.

4. Tracking Retry Rates Per Test

Most CI systems support automatic test retries, and most teams enable them. The problem is that retries hide information. A test that passes on the second attempt looks green in the pipeline, and nobody investigates. Over time, the suite accumulates dozens of tests that "pass" only because they are retried, each one adding latency and masking a real issue.

The solution is per-test retry rate tracking. For every test in your suite, maintain a metric: the percentage of runs where the test required at least one retry to pass. This data transforms flakiness from an invisible problem into a measurable one. You can sort tests by retry rate, identify the worst offenders, and prioritize fixes based on impact. Tools like Allure, Playwright's built-in reporter, and CI analytics platforms (BuildPulse, Trunk) can aggregate this data automatically.

Assrt generates tests using Playwright's web-first assertions by default, which means generated tests start with proper synchronization patterns and are less likely to become flaky in the first place. When combined with retry rate tracking, teams can distinguish between flakiness in hand-written tests (which often have synchronization issues) and flakiness in generated tests (which typically indicates an application-level race condition worth investigating).

5. The 10% Retry Threshold for Investigation

Not every test that occasionally fails deserves immediate attention. Infrastructure hiccups, transient network issues, and CI resource contention can cause occasional failures that are not worth investigating. The question is: at what retry rate does a test warrant investigation? Based on data from teams running large Playwright suites, the practical threshold is 10 percent.

A test with a retry rate below 10 percent (fewer than 1 in 10 runs require a retry) is likely experiencing environmental noise. Monitor it, but do not prioritize a fix. A test with a retry rate above 10 percent has a systemic issue: a synchronization problem, a data dependency, a race condition in the application, or flawed test isolation. These tests should be investigated, and the root cause should be fixed or the test should be quarantined until it can be fixed.

Apply this threshold consistently. Run a weekly report that lists all tests above the 10 percent retry threshold, sorted by retry rate. Assign the top five to engineers for investigation. Over the course of a quarter, this practice steadily reduces suite-wide flakiness. Teams that adopt this discipline typically see their overall retry rate drop from 15 to 20 percent down to 2 to 3 percent within three months, which translates directly into faster CI times and higher developer confidence in test results.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests, and self-heals when your UI changes.

$npx @assrt-ai/assrt discover https://your-app.com