CI/CD

Managing Flaky Tests in CI/CD: Quarantine, Ownership, and the CI Sheriff

Flaky tests erode trust in your entire CI pipeline. When developers learn to ignore test failures, real bugs slip through unnoticed.

0

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Assrt SDK

1. The real cost of flaky tests

A flaky test is one that passes and fails intermittently without any code change. The failure is not deterministic; running the same test on the same code produces different results. Industry data suggests that 2% to 5% of tests in a typical suite are flaky, and that number grows as suites age.

The direct cost is wasted CI compute. Every flaky failure triggers a pipeline retry. If your pipeline takes 15 minutes and your flaky rate is 5%, roughly one in every 20 runs fails unnecessarily. Over a month, this adds up to hours of wasted compute time and developer attention.

The indirect cost is far worse: erosion of trust. When developers see a red CI pipeline, they check whether it is a "real" failure or "just a flaky test." After a few false alarms, they stop checking entirely. They click "re-run" without investigating, or they merge despite failures. This behavioral change means that real bugs now pass through the same pipeline undetected. The quality gate that was supposed to catch regressions becomes decoration.

Google published research showing that flaky tests cause developers to ignore approximately 15% of genuine test failures. At that scale, flaky tests are not just an annoyance; they are a systemic quality risk that undermines the entire testing strategy.

2. The ownership gap: why nobody fixes flaky tests

Flaky tests persist because of an ownership gap. The developer who encounters a flaky failure did not write the test and is not responsible for the code it covers. They have no context on why the test exists, what it is supposed to verify, or why it might be flaky. Their rational response is to re-run the pipeline and move on.

The developer who originally wrote the test may have moved to a different team or left the company. Even if they are still around, they are working on new features and have no incentive to go back and fix a test they wrote six months ago. The QA team, if one exists, is often focused on new test development rather than maintenance of existing tests.

This creates a tragedy of the commons. Everyone suffers from flaky tests, but nobody is specifically responsible for fixing them. The cost is distributed across the entire team (slower pipelines, eroded trust, missed bugs), but the effort to fix a flaky test falls on a single person who gets no recognition for doing maintenance work.

The solution requires making flaky test ownership explicit and visible. This means either assigning ownership to individuals or teams, or creating a rotation where someone is always responsible for test health. Both approaches work, and the right choice depends on your team size and structure.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. Quarantine strategies that actually work

Quarantining a flaky test means moving it out of the critical path so it no longer blocks merges, while keeping it running separately so the flakiness is tracked and eventually fixed. The key principle is that quarantine is a temporary state, not a permanent exile. Every quarantined test should have a deadline and an owner.

The simplest quarantine implementation uses test tags or annotations. Mark flaky tests with a tag like @flaky or test.skip with a tracking issue. Configure your CI pipeline to run two jobs: one for stable tests (blocking) and one for quarantined tests (non-blocking). The quarantine job runs the same tests but reports results without blocking the merge.

A more sophisticated approach uses automatic flaky detection. Run each test multiple times (3 to 5 repetitions) on each CI run. If a test passes on some runs and fails on others within the same commit, it is automatically flagged as flaky and moved to the quarantine bucket. This removes the need for manual triage and catches new flaky tests as soon as they appear.

The quarantine must have an exit strategy. Set a maximum quarantine period (e.g., two weeks). If a test is not fixed within that period, it is deleted. This creates urgency. Deleting tests sounds extreme, but a flaky test that nobody fixes provides negative value: it consumes CI resources, clutters reports, and teaches the team that test quality does not matter.

4. The CI sheriff rotation model

The CI sheriff (sometimes called the "build cop" or "green keeper") is a rotating role where one team member is responsible for CI health for a defined period, typically one week. The sheriff's job is to investigate every CI failure, determine whether it is a real failure or a flaky test, and take action: either notify the responsible developer or quarantine the flaky test.

The rotation works because it distributes the burden fairly and builds empathy. Every team member experiences the pain of flaky tests firsthand during their sheriff week. This creates internal pressure to write reliable tests. A developer who has spent a week investigating flaky failures is much less likely to write flaky tests themselves.

During their rotation, the sheriff has authority to quarantine any test without needing approval. They also have a budget of time (typically 2 to 4 hours per day) dedicated to CI health. This time is not available for feature work. Making the cost explicit helps managers understand the investment required to maintain a healthy CI pipeline.

Some teams enhance the sheriff role with tooling. A Slack bot monitors CI failures and notifies the current sheriff. A dashboard shows the flaky test rate, quarantine queue size, and mean time to investigate. Tools like Assrt can reduce the sheriff's burden by generating stable, self-healing tests that are less likely to become flaky in the first place, since the selectors adapt to UI changes automatically.

5. Preventing flaky tests from entering the suite

Prevention is more effective than cure. The most common causes of flaky E2E tests are timing issues (not waiting for the right condition), shared state (tests depending on data from previous tests), and environmental differences (tests that pass locally but fail in CI due to different resource constraints).

For timing issues, use Playwright's auto-wait and avoid any fixed timeouts. If you must wait for a condition, wait for a specific observable event: a network response, an element becoming visible, or a URL change. Never use page.waitForTimeout() in a test that will run in CI.

For shared state, ensure each test creates its own data and cleans up after itself. Use Playwright's test.describe.configure({ mode: 'parallel' }) to run tests in isolation. If tests share a database, use unique identifiers per test (timestamps, UUIDs) to prevent data collisions.

Add a flaky detection step to your CI pipeline for new tests. When a pull request adds or modifies tests, run those specific tests 5 to 10 times. If any run fails, the PR is blocked until the flakiness is resolved. This prevents new flaky tests from ever entering the suite, which is far easier than fixing them after they have been merged and forgotten.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk