How to Audit Flaky E2E Tests Without Burning Your Entire Sprint
Flaky tests are the single most common line item that blows up codebase audit estimates. The root causes are scattered across test data management, selector strategy, and missing wait conditions, which means a single "fix flaky tests" ticket can hide three separate workstreams. This guide walks through how to audit flaky E2E tests systematically so you can scope the work accurately and actually fix it.
“Percentage of E2E test failures in typical CI pipelines that are caused by test flakiness, not actual application bugs.”
1. Why Flaky Tests Resist Accurate Estimation
When someone says "we have flaky E2E tests," what they actually have is three or four different problems wearing a trench coat. A test that fails intermittently on CI but passes locally could be a race condition in the application, a missing wait for an API response, stale test data from a previous run, or a CSS animation that finishes at different speeds on different hardware. Each of those has a completely different fix.
This is why "fix flaky tests" as a single line item in a codebase audit is almost always underestimated. The person scoping the work sees 15 flaky tests and estimates two days. What they actually find is that 5 tests have selector problems, 4 have race conditions tied to application state, 3 depend on test data that is shared between parallel runs, and 3 have hardcoded timeouts that only work on the developer's machine. That is four different workstreams, not one ticket.
The audit itself is the most important deliverable. Before anyone writes a single fix, you need a categorized inventory of what is actually wrong. Without that, you are debugging your CI pipeline for free and calling it a codebase audit.
2. The Three Root Causes of E2E Flakiness
Almost every flaky E2E test falls into one of three buckets. Categorizing each failure by root cause is what turns a vague "tests are flaky" complaint into an actionable plan.
Test data management. Tests that share a database, use hardcoded IDs, or depend on data created by other tests will fail when run in parallel or in a different order. The classic symptom: tests pass individually but fail when the full suite runs. The fix is usually test isolation through per-test data seeding and cleanup, or using unique identifiers generated at runtime. This category often takes the most calendar time because it requires understanding the application's data model and identifying every shared dependency.
Selector strategy. Tests that use fragile CSS selectors like .container > div:nth-child(2) > button break whenever the DOM structure changes. But selector flakiness also shows up in subtler ways: an element that matches multiple nodes depending on viewport size, a dynamic class name that changes between builds, or a selector that finds the right element before it is in an interactive state. Moving to getByRole, getByText, or data-testid attributes eliminates most of these issues. This is usually the fastest category to fix because each test can be updated independently.
Missing wait conditions. This is the most common root cause and the most misunderstood. Playwright has robust auto-wait behavior, but it only works when you tell it what to wait for. If your test clicks a button and immediately checks for a result that depends on an API call, the assertion will sometimes pass (the API was fast) and sometimes fail (the API was slow). The fix is explicit waits: waitForResponse for specific API endpoints, waitForSelector for UI state changes, or Playwright's built-in expect().toBeVisible() with auto-retry. Hardcoded waitForTimeout(3000) calls are a red flag that wait conditions were never properly analyzed.
Spending more time debugging tests than writing features?
Assrt auto-generates Playwright tests with proper wait conditions and resilient selectors. Open-source, no vendor lock-in.
Get Started →3. A Repeatable Audit Methodology
A good flaky test audit produces a spreadsheet, not a fix. The output is a categorized list of every flaky test with its root cause, severity, and estimated fix effort. Here is how to build that list.
Step 1: Collect failure data. Pull the last 30 days of CI runs. Most CI systems (GitHub Actions, CircleCI, GitLab CI) let you export test results as JUnit XML. Parse these to find every test that has failed at least once in the last month. Sort by failure frequency. A test that fails 30% of the time is a different problem than one that fails 2% of the time.
Step 2: Reproduce locally with parallelism. Run the full suite locally with the same parallelism setting as CI. Many flaky tests only surface under parallel execution because they depend on shared state. If a test passes in isolation but fails in the suite, you have a data isolation problem.
Step 3: Categorize each failure. For each flaky test, read the code and assign it to one of the three categories above: data management, selector strategy, or wait conditions. Some tests will have multiple issues. Flag those separately because the fix order matters: there is no point fixing a selector if the test will still fail due to a data race.
Step 4: Estimate per category, not per test. Selector fixes are usually mechanical, roughly 15 to 30 minutes per test. Wait condition fixes require understanding the application flow, typically 30 minutes to 2 hours per test. Data isolation fixes can require structural changes to test setup and may take a full day per shared dependency. Sum these up by category to get a realistic total.
4. Scoping the Fix: Separate Workstreams, Not One Ticket
The biggest mistake in scoping flaky test fixes is treating them as a single deliverable. If you quote a client (or your own management) a flat number for "fixing flaky tests," you will either underbid and end up working for free, or overbid and lose the engagement. Split it.
Workstream A: Selector migration. Scope this as a batch. Count the number of tests using fragile selectors, estimate 20 minutes per test on average, add 20% buffer. This is predictable work. A junior engineer can do most of it with a style guide and code review.
Workstream B: Wait condition fixes. This requires someone who understands the application's async behavior. Each fix is different. Budget more time per test and expect some yak shaving, because sometimes the test reveals that the application itself has a race condition that was previously hidden by a hardcoded wait.
Workstream C: Test data isolation. This is the one to flag as a follow up engagement if you are doing contract work. Data isolation often requires changes to the test infrastructure (custom fixtures, database seeding scripts, container-based test environments) that are not scoped as part of a typical codebase audit. Bundling it in leads to scope creep. Call it out as a separate phase with its own estimate.
Each workstream should have clear acceptance criteria. For selector migration: zero tests using positional CSS selectors. For wait conditions: zero waitForTimeout calls in the test suite. For data isolation: full test suite passes 10 consecutive runs with full parallelism.
5. Tools That Help with Test Audits and Regeneration
Once you have your audit spreadsheet, the fix phase can be partially automated. There are several approaches depending on the size of your test suite and how much of it needs rewriting.
Playwright's built-in tracing. Enable trace-on-first-retry in your Playwright config. This captures a full trace (screenshots, network requests, DOM snapshots) for every flaky test failure. Instead of guessing why a test failed, you can open the trace viewer and see exactly what the browser was doing at the moment of failure. This alone can cut your diagnosis time by half.
AI test generation tools. For tests that need a full rewrite, AI generation tools can produce a solid first draft. Assrt, for example, crawls your application and generates Playwright tests with proper selector strategy and wait conditions built in. The output is standard Playwright code that you can review, modify, and commit. Other tools in this space include Playwright Codegen (records browser interactions and generates code), Testim (visual test builder), and Katalon (record and playback with AI suggestions). The key difference between these tools is whether they output standard framework code or lock you into a proprietary format.
CI analytics platforms. Tools like Buildkite Test Analytics, Datadog CI Visibility, and Trunk Flaky Tests track test flakiness over time and surface trends. If you are doing an audit for a client, pulling flakiness percentages from one of these tools adds concrete data to your report. "Your checkout flow test has a 23% failure rate over the last 30 days" is more compelling than "some tests are flaky."
The important thing is that auditing flaky tests and fixing them are two distinct activities with different skill requirements and different timelines. The audit should produce a clear, categorized inventory. The fixes should be scoped as separate workstreams with their own estimates. Treat them as one blob and you will spend your first week just figuring out what you are dealing with, which is exactly the audit you should have done upfront.