How to Fix Flaky Tests: Causes, Solutions & Prevention
Flaky tests erode trust in your test suite, slow down CI pipelines, and waste engineering hours on false failures. This guide covers the root causes of flakiness, systematic diagnosis techniques, targeted fixes for each cause, and strategies to prevent flakiness from recurring.
“Engineering teams report spending a large share of their sprint time investigating and fixing flaky test failures, with most of that time spent on tests that were not actually detecting bugs.”
1. What Makes a Test Flaky?
A flaky test is a test that produces different results (pass or fail) when run against the same code, without any changes to the application or the test itself. The test passes on one run and fails on the next, with no code changes in between. This non-deterministic behavior is what distinguishes a flaky test from a genuinely broken test.
Flakiness is especially damaging because it is ambiguous. When a test fails intermittently, developers cannot tell whether the failure represents a real bug or a test infrastructure issue. This ambiguity leads to one of two harmful behaviors: either developers spend hours investigating false failures, or they start ignoring all failures from that test. Both outcomes are costly.
Impact on CI/CD Pipelines
In a continuous integration environment, flaky tests create a cascade of problems. A flaky test that fails on a pull request blocks the merge until someone investigates, retries, or skips the test. If the team adopts a "just retry it" culture, CI run times double or triple as builds are re-triggered repeatedly. If the team starts skipping flaky tests, the test suite gradually loses coverage.
Google's internal research found that approximately 16% of their test suite exhibited flaky behavior, and that flaky tests were responsible for a disproportionate share of developer frustration and lost productivity. At scale, even a small percentage of flaky tests can consume enormous engineering resources.
2. The Real Cost of Flaky Tests
The cost of flaky tests extends far beyond the time spent investigating individual failures. Flakiness creates compounding costs across the entire development lifecycle.
Direct Engineering Time
Studies consistently show that teams with significant flaky test problems spend 30% to 40% of their sprint time on test maintenance. For a team of ten engineers at an average cost of $180,000 per year, that translates to $540,000 to $720,000 annually in lost productivity. Most of this time is spent on investigation rather than actual fixes, because diagnosing the root cause of intermittent failures is inherently difficult.
Trust Erosion
This is the most insidious cost. When developers lose trust in the test suite, they stop relying on it as a safety net. They merge code with failing tests, skip test runs before deployment, and eventually stop writing tests altogether. The entire investment in test automation depreciates to zero when nobody trusts the results.
Pipeline Slowdown
Retries are the most common short-term response to flakiness, and they have a direct impact on CI throughput. If a pipeline normally takes 15 minutes and flaky tests cause an average of 1.5 retries per build, the effective pipeline time becomes 22.5 minutes. Across 50 builds per day, that is 375 extra minutes of CI compute time, plus the developer wait time.
Escaped Bugs
Perhaps the most dangerous cost: when flaky tests mask real failures. A test that detects a genuine regression may be dismissed as "just flaky" and skipped. The bug reaches production, where fixing it costs 10x to 100x more than catching it during development. Multiple industry studies confirm that defects found in production cost at least 30 times more to fix than defects caught during testing.
3. Common Causes of Flaky Tests
Understanding the root causes of flakiness is essential for fixing it. Most flaky tests fall into one of five categories.
Timing and Race Conditions
The most common cause of flakiness. A test clicks a button before the page has finished loading. An assertion checks a value before an API response has been processed. An animation is still running when the test tries to interact with an element. These timing issues are especially prevalent in single-page applications where rendering and data loading happen asynchronously.
// FLAKY: No wait for the element to be ready
await page.click('#submit-button');
await expect(page.locator('.success-message')).toBeVisible();
// This fails ~20% of the time because the success message
// takes 200-500ms to appear after the click
// FIXED: Wait for the specific condition
await page.click('#submit-button');
await expect(page.locator('.success-message')).toBeVisible({
timeout: 5000,
});Test Data Dependencies
Tests that depend on specific data existing in a database or external service are inherently fragile. If another test modifies the data, if the database is reset between runs, or if the external service returns different results, the test fails. Shared mutable state is one of the most difficult sources of flakiness to eliminate because it often involves interactions between tests that are not obvious from reading any single test in isolation.
Environment Differences
Tests that pass on a developer's machine but fail in CI often suffer from environment-specific issues. Different screen resolutions, browser versions, operating systems, timezone settings, locale configurations, and available system resources all contribute to environment-dependent flakiness. A test that relies on a specific viewport width may pass on a 1920x1080 display but fail in a headless CI environment with a default viewport of 800x600.
Shared State Between Tests
When tests share browser state (cookies, local storage, session data), database records, or global variables, they become order-dependent. Test A may set a cookie that Test B relies on. If the execution order changes (as happens with parallel execution or test shuffling), Test B fails. This is particularly common in end-to-end test suites where each test does not start with a clean browser context.
Network Flakiness
Tests that make real network requests to external services are at the mercy of network latency, service availability, and rate limiting. An API that responds in 100ms during local development may take 2 seconds in CI, or may occasionally return a 429 (rate limited) or 503 (service unavailable) response. Even internal service-to-service calls within a test environment can exhibit variable latency under load.
4. Diagnosis Techniques
Before you can fix a flaky test, you need to understand why it is flaky. These techniques help you identify the root cause efficiently.
Retry Analysis
Run the failing test 50 to 100 times in isolation and record the results. If the test fails consistently when run alone, the problem is within the test itself (usually timing). If it only fails when run as part of the full suite, the problem is likely shared state or test ordering. The failure rate also provides useful signal: a 5% failure rate suggests a subtle timing issue, while a 50% failure rate points to a more fundamental problem.
# Run a single test 100 times and collect results
for i in $(seq 1 100); do
npx playwright test tests/checkout.spec.ts \
--reporter=json 2>/dev/null | \
jq -r '.suites[0].specs[0].tests[0].results[0].status' >> results.txt
done
# Count pass/fail ratio
echo "Pass: $(grep -c 'passed' results.txt)"
echo "Fail: $(grep -c 'failed' results.txt)"
# Playwright has a built-in repeat option
npx playwright test tests/checkout.spec.ts --repeat-each=100Bisecting Failures
If a test only fails when run with other tests, bisect the test suite to find the interfering test. Divide the suite in half, run each half with the flaky test, and see which half causes the failure. Continue bisecting until you isolate the specific test that creates the problematic shared state.
# Playwright supports running specific test files together # First, identify which test file causes interference npx playwright test tests/checkout.spec.ts tests/auth.spec.ts npx playwright test tests/checkout.spec.ts tests/cart.spec.ts # Use --shard to split the suite for bisection npx playwright test --shard=1/2 npx playwright test --shard=2/2
Enhanced Logging
Add detailed logging around the failing assertion. Log the page URL, the DOM state, pending network requests, and any relevant application state at the point of failure. Compare logs from passing and failing runs to identify the difference. Often, the logs reveal that the application was in an unexpected state (such as showing a loading spinner or error message) when the assertion ran.
Video and Trace Recording
Playwright and other modern frameworks support video recording and trace capture for test runs. Enable these for flaky tests in CI. The visual recording often reveals the issue instantly: you can see the page was still loading, a popup was blocking the element, or the wrong page was displayed. Traces are especially valuable because they capture DOM snapshots, network requests, and console logs alongside the video.
// playwright.config.ts: enable traces for retries
import { defineConfig } from '@playwright/test';
export default defineConfig({
retries: 2,
use: {
// Record trace on first retry for debugging
trace: 'on-first-retry',
// Record video on failure
video: 'retain-on-failure',
// Capture screenshots on failure
screenshot: 'only-on-failure',
},
});5. Solutions by Cause
Once you have identified the cause, apply the targeted fix. Here are proven solutions for each category of flakiness.
Timing Issues: Use Explicit Waits
Replace all fixed-duration sleeps with explicit condition waits. Instead of await page.waitForTimeout(2000), wait for the specific condition you need:
// BAD: Fixed sleep (arbitrary, slow, still flaky)
await page.waitForTimeout(3000);
await page.click('.results-list li:first-child');
// GOOD: Wait for the specific condition
await page.waitForSelector('.results-list li', {
state: 'visible',
});
await page.click('.results-list li:first-child');
// BETTER: Wait for network idle + element visible
await page.waitForLoadState('networkidle');
await expect(page.locator('.results-list li')).toHaveCount(10);
await page.click('.results-list li:first-child');
// BEST: Use Playwright's auto-waiting with locators
// Playwright locators automatically wait for elements
const firstResult = page.locator('.results-list li').first();
await firstResult.click(); // Auto-waits for visible + stableTest Data: Use Data Factories
Each test should create its own data and clean it up afterward. Use factory functions that generate unique test data for every run.
// test-helpers/factories.ts
import { randomUUID } from 'crypto';
export function createTestUser() {
const id = randomUUID().slice(0, 8);
return {
email: `testuser-${id}@example.com`,
password: 'TestPass123!',
name: `Test User ${id}`,
};
}
// In your test
test('user can update profile', async ({ page }) => {
const user = createTestUser();
// Create user via API (fast, no UI needed)
await api.createUser(user);
// Test the actual UI flow
await page.goto('/login');
await page.fill('[name="email"]', user.email);
await page.fill('[name="password"]', user.password);
await page.click('button[type="submit"]');
// ... test continues with isolated data
});Shared State: Isolate Browser Contexts
Use a fresh browser context for each test. This ensures cookies, local storage, and session data do not leak between tests.
// playwright.config.ts: fresh context per test (default)
export default defineConfig({
use: {
// Each test gets a new browser context
contextOptions: {
// Clear storage between tests
storageState: undefined,
},
},
});
// For tests that need authentication, use a shared setup
// but still isolate the actual test context
test.describe('authenticated flows', () => {
test.use({ storageState: 'auth-state.json' });
test('can view dashboard', async ({ page }) => {
// Starts with auth cookies but clean local state
await page.goto('/dashboard');
});
});Network Issues: Mock External Services
For end-to-end tests that interact with third-party APIs, use route interception to return deterministic responses.
// Intercept external API calls with deterministic responses
test('displays weather widget', async ({ page }) => {
// Mock the weather API
await page.route('**/api.weather.com/**', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
temperature: 72,
conditions: 'sunny',
city: 'San Francisco',
}),
});
});
await page.goto('/dashboard');
await expect(page.locator('.weather-widget')).toContainText('72');
});Environment Issues: Standardize Configuration
Set explicit viewport sizes, timezones, locales, and browser versions in your configuration. This eliminates environment-dependent behavior.
// playwright.config.ts: standardize environment
export default defineConfig({
use: {
viewport: { width: 1280, height: 720 },
locale: 'en-US',
timezoneId: 'America/New_York',
colorScheme: 'dark',
// Use a specific browser version
channel: 'chrome',
},
// Run tests sequentially if parallel execution causes issues
workers: process.env.CI ? 1 : undefined,
});6. Prevention Strategies
Fixing flaky tests is important, but preventing them from being written in the first place is far more effective. These strategies reduce the likelihood of introducing flakiness.
Test Independence
Every test should be able to run in complete isolation. It should not depend on any other test having run before it, and it should not leave behind state that affects other tests. This means each test creates its own data, authenticates its own session, and cleans up after itself. The overhead of this independence is small compared to the cost of debugging shared-state flakiness.
Deterministic Data
Tests should produce the same result every time they run. This requires deterministic test data. Avoid using real dates (use fixed dates or mock the clock), random values (use seeded random generators), or external data sources (use factories or fixtures). If a test depends on the current time, mock the system clock.
// Mock the system clock for time-dependent tests
test('shows "posted 5 minutes ago"', async ({ page }) => {
// Set a fixed time
await page.clock.install({
time: new Date('2026-03-20T10:00:00Z'),
});
await page.goto('/feed');
// The post was created at 9:55, so "5 minutes ago"
await expect(
page.locator('[data-testid="post-timestamp"]')
).toContainText('5 minutes ago');
});Proper Assertions
Use assertions that wait for conditions rather than checking instant state. Playwright's expect assertions with auto-retrying are designed for this. They repeatedly check the condition until it is met or the timeout expires, which eliminates most timing-related flakiness.
// BAD: Instant assertion (flaky)
const text = await page.textContent('.status');
expect(text).toBe('Complete');
// GOOD: Auto-retrying assertion (stable)
await expect(page.locator('.status')).toHaveText('Complete');
// BAD: Checking element count at a single point in time
const items = await page.$$('.list-item');
expect(items.length).toBe(5);
// GOOD: Auto-retrying count assertion
await expect(page.locator('.list-item')).toHaveCount(5);Parallel-Safe Architecture
Design your test suite to run in parallel from the start. This forces good practices: isolated data, independent contexts, and no shared mutable state. Tests that are parallel-safe are almost never flaky due to ordering issues.
7. How AI Tools Help
AI-powered testing tools address flakiness at a fundamental level by removing the brittle assumptions that cause tests to break. Here is how modern AI approaches tackle each category of flakiness.
Self-Healing Selectors
When a selector changes, AI tools detect the change and find the correct element through alternative strategies. This eliminates the entire category of selector-related flakiness. Instead of failing with "element not found," the test adapts and continues. The healing event is logged for team review.
Smart Waits
AI tools observe application behavior patterns to determine the optimal wait strategy. Rather than relying on developers to manually add waits at every interaction point, the AI learns when the application is ready for the next action. It monitors network requests, DOM mutations, and animation states to make intelligent decisions about timing.
The Assrt Approach
Assrt takes flakiness prevention further by using intent-based testing. Instead of writing tests with specific selectors and hardcoded waits, you describe what the test should do in natural language. Assrt interprets the intent at runtime and handles element finding, waiting, and interaction automatically.
// Traditional test: multiple flakiness vectors
test('checkout flow', async ({ page }) => {
await page.goto('/products');
await page.waitForSelector('.product-grid'); // timing issue?
await page.click('.product-card:first-child'); // selector change?
await page.click('#add-to-cart'); // ID removed?
await page.waitForTimeout(1000); // arbitrary wait
await page.click('[data-testid="cart-icon"]'); // test-id renamed?
await page.click('.checkout-btn'); // class changed?
await expect(page.locator('.order-confirm'))
.toBeVisible(); // timing?
});
// Assrt test: zero flakiness vectors
test('checkout flow', async ({ assrt }) => {
await assrt.do('Go to the products page');
await assrt.do('Click the first product');
await assrt.do('Add it to the cart');
await assrt.do('Open the cart');
await assrt.do('Proceed to checkout');
await assrt.expect('Order confirmation is displayed');
});The Assrt test has no selectors that can break, no hardcoded waits that can be too short, and no assumptions about DOM structure. It describes the user's intent, and Assrt figures out how to execute it against the current state of the application.
8. CI/CD Integration for Flaky Test Detection
Detecting flaky tests early requires integrating flakiness detection into your CI/CD pipeline. Here are the key strategies.
Automatic Retry with Reporting
Configure your test runner to retry failed tests automatically, but report the retry as a flakiness signal rather than silently passing.
// playwright.config.ts
export default defineConfig({
// Retry failed tests twice
retries: process.env.CI ? 2 : 0,
// Custom reporter that flags flaky tests
reporter: [
['html'],
['json', { outputFile: 'test-results.json' }],
],
});
// Post-pipeline script: detect and report flaky tests
// parse-flaky.js
const results = require('./test-results.json');
const flaky = results.suites
.flatMap(s => s.specs)
.filter(spec =>
spec.tests.some(t =>
t.results.length > 1 &&
t.results.at(-1).status === 'passed'
)
);
if (flaky.length > 0) {
console.log('Flaky tests detected:');
flaky.forEach(f => console.log(` - ${f.title}`));
// Send to Slack, create a tracking issue, etc.
}Flaky Test Quarantine
When a test is identified as flaky beyond a threshold (for example, failing more than 3 times in 7 days without code changes), automatically quarantine it. Quarantined tests still run, but their failures do not block the pipeline. This keeps CI green while preserving visibility into the flaky test. Create an automated issue or ticket for the quarantined test so it gets assigned and fixed.
Historical Trend Tracking
Store test results over time and track the flakiness rate per test. A dashboard showing which tests are most flaky, when they started failing, and which commits correlate with the flakiness helps prioritize fixes. Tools like Playwright's HTML reporter, Allure, and custom dashboards built on test result JSON all support this.
Pre-Merge Flakiness Check
Before merging a pull request that adds or modifies tests, run the affected tests multiple times (10 to 20 repetitions) to catch flakiness before it enters the main branch. This is significantly cheaper than detecting and fixing flakiness after the merge.
# GitHub Actions workflow: pre-merge flakiness check
name: Flaky Test Check
on:
pull_request:
paths:
- 'tests/**'
- 'e2e/**'
jobs:
flaky-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npx playwright install --with-deps
# Run changed tests 15 times each
- name: Check for flakiness
run: |
CHANGED=$(git diff --name-only origin/main -- tests/ e2e/)
if [ -n "$CHANGED" ]; then
npx playwright test $CHANGED --repeat-each=15
fi
- name: Report
if: failure()
run: echo "New or modified tests appear to be flaky. Fix before merging."Related Guides
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.