Guide
Why Flaky Tests Happen and How Senior QA Engineers Fix Them
If you have spent six or seven years in QA automation, you have fought flaky tests more times than you can count. This guide breaks down the real root causes behind test flakiness, the selector strategies that prevent it, and how self-healing locators are changing the game for senior SDETs.
“In large CI pipelines, a significant share of test failures are caused by flaky tests rather than actual product defects.”
1. What Actually Makes a Test Flaky
A flaky test is one that passes and fails on the same code without any changes. It is the single most destructive force in a test suite, because it erodes trust. When engineers see a test fail intermittently, they start ignoring all failures. Once that habit sets in, real bugs slip through unnoticed.
After years of debugging flaky tests across dozens of codebases, the root causes fall into a surprisingly small number of categories. Timing issues account for roughly half of all flakiness. Brittle selectors cause another quarter. The remaining quarter splits between test data pollution, environment differences, and non-deterministic third-party dependencies.
Understanding these categories matters because each one requires a different fix. Retrying a test that fails due to a race condition might make it pass, but it does not solve the problem. The underlying race condition still exists and will surface again at the worst possible moment, usually during a release.
Senior SDETs are expected to not only identify flaky tests but to categorize them, prioritize them, and implement systemic fixes. The sections below walk through each category with practical solutions.
2. Timing and Race Conditions
The most common source of flakiness is timing. A test clicks a button before the page has finished rendering. An assertion runs before an API response has returned. A form submission fires before the validation logic has initialized. These are all race conditions between your test code and the application under test.
The worst solution is sleep(). Hard-coded waits are the hallmark of junior automation code. They are slow (you always wait longer than necessary), unreliable (the timing varies across environments), and they hide the real problem. If you see sleep(3000) in a test file, treat it as a bug.
The correct approach is explicit waits tied to observable state changes. In Playwright, this means using waitForSelector, waitForResponse, or assertion-based waits like expect(locator).toBeVisible(). Playwright's auto-waiting mechanism handles most cases automatically, but you need to understand when it applies and when it does not.
Network-dependent tests require special attention. If your test relies on an API call completing, wait for that specific network response rather than waiting for a UI element to appear. The UI might render optimistically before the API returns, giving you a false positive.
Animation and transition timing is another subtle source of flakiness. A button might be visible but not yet clickable because a CSS transition is still running. Playwright's actionability checks handle most of these cases, but custom animations or third-party component libraries can still cause issues. The fix is to either disable animations in your test environment or wait for the specific CSS property to stabilize.
3. Selector Strategies That Survive Refactors
Brittle selectors are the second leading cause of test flakiness. When a test targets an element by its CSS class, XPath position, or auto-generated attribute, any front-end refactor can break dozens of tests simultaneously. This is not flakiness in the traditional sense (the tests fail consistently), but it creates the same trust erosion when it happens repeatedly.
The hierarchy of selector reliability, from most stable to least stable, looks like this: data-testid attributes are the most resilient because they exist solely for testing and are unlikely to change during refactors. ARIA roles and labels are the next best option because they serve double duty for accessibility and testing. Text content selectors work well for user-facing text that rarely changes. CSS selectors are fragile because classes change during styling updates. XPath selectors are the most brittle because they depend on the exact DOM structure.
Playwright encourages a user-centric approach to selectors. Instead of targeting implementation details, target what the user sees and interacts with. The getByRole, getByText, and getByLabel locators align tests with user behavior, making them both more readable and more resilient.
Selector Examples: Fragile vs. Resilient
// Fragile: depends on CSS class names
await page.click('.btn-primary.submit-form');
// Fragile: depends on DOM structure
await page.click('div > form > div:nth-child(3) > button');
// Resilient: uses role and accessible name
await page.getByRole('button', { name: 'Submit' }).click();
// Resilient: uses test ID
await page.getByTestId('checkout-submit').click();
// Resilient: uses label text
await page.getByLabel('Email address').fill('user@test.com');A practical strategy for large codebases: adopt a selector policy and enforce it through code review. Require that all new tests use role-based or test-id selectors. Gradually migrate existing tests during maintenance windows. This incremental approach is more realistic than attempting a full migration at once.
4. Self-Healing Locators Explained
Even with the best selector strategies, UI changes will eventually break some tests. This is where self-healing locators come in. The concept is straightforward: when a primary selector fails, the system automatically tries alternative selectors to find the same element, then updates the test to use the new selector.
Self-healing works by recording multiple attributes of each target element: its role, text content, nearby labels, position relative to other elements, and visual appearance. When the primary selector fails, the system scores candidate elements against these stored attributes and selects the best match. If the confidence score is high enough, the test continues. If not, it fails cleanly with a helpful error message.
Several tools implement self-healing in different ways. Proprietary solutions like Momentic use YAML-based test definitions and Chrome-only execution, which limits flexibility. Manual Playwright tests have no self-healing capability at all, requiring manual updates for every broken selector. Open-source tools like Assrt take a different approach: they generate standard Playwright code with self-healing capabilities built in, and when selectors break, they automatically submit pull requests with the fix. Because the output is standard Playwright TypeScript, there is no vendor lock-in and no proprietary runtime to maintain.
The key advantage of self-healing is not eliminating maintenance entirely. It is reducing the maintenance burden to a manageable level. Instead of spending hours each week fixing broken selectors after a front-end release, your team reviews auto-generated fix PRs and merges them in minutes.
For senior SDET interviews, understanding self-healing is increasingly important. Interviewers want to know that you understand both the promise and the limitations. Self-healing cannot fix tests that are fundamentally wrong, tests that assert the wrong behavior, or tests that rely on specific data states. It excels at handling the mechanical problem of selectors drifting as the UI evolves.
5. Environment and Test Data Isolation
The third major category of flakiness is test data and environment pollution. When tests share state, they become order-dependent. Test A creates a user, test B assumes that user exists, and test C deletes it. Run them in order and everything passes. Run them in parallel or shuffle the order and you get random failures.
The golden rule of test data isolation is simple: every test must create its own data and clean up after itself. Never rely on pre-existing data in the database. Never assume a specific application state at the start of a test. Treat every test as if it is running against a fresh environment.
In practice, this means each test should set up its preconditions explicitly. If a test needs a logged-in user, it should create that user (or use a dedicated test account) and log in as the first step. If a test needs specific data in the database, it should seed that data through an API or a database fixture, not through the UI.
Environment differences between local development and CI are another common source of flakiness. Tests pass on your machine because you have a fast SSD and plenty of RAM, but fail in CI because the runner is slower and the timeouts are tighter. The fix is to run tests in CI-like environments during development. Docker containers, or even just matching the CI runner's resource limits locally, can surface these issues early.
For teams running Playwright, the test.describe.configure method allows you to set parallelism and isolation at the suite level. Use it to ensure that tests within a describe block do not interfere with each other, even when running in parallel.
6. CI-Specific Flakiness and How to Debug It
Some flaky tests only manifest in CI. They pass every time on your machine, but fail intermittently in the pipeline. This is one of the most frustrating debugging experiences in QA automation.
The most common CI-specific causes are resource constraints (slower CPUs, less memory, shared runners), network latency (CI runners may have different network characteristics), display configuration (headless mode behaves differently from headed mode for certain interactions), and font rendering differences (text measurements vary across operating systems, which can affect visual tests).
Playwright's trace viewer is your best debugging tool for CI-specific failures. Enable tracing in your CI configuration with trace: 'on-first-retry' and upload the trace files as CI artifacts. When a test fails, you can open the trace locally and see exactly what happened: every network request, every DOM snapshot, every action and its timing.
A practical strategy for persistent CI flakiness: add retry logic at the CI level (not the test level) and track retry rates over time. If a specific test is retried more than 10% of the time, flag it for investigation. This approach lets you keep the pipeline green while building a backlog of tests that need structural fixes.
7. The Senior SDET Interview Perspective
For SDET candidates with six to seven years of experience, interviewers expect more than textbook answers about flaky tests. They want to hear about your systematic approach to identifying, categorizing, and resolving flakiness at scale.
A strong interview answer demonstrates awareness of the full spectrum: timing issues, selector fragility, data pollution, environment differences, and third-party dependencies. It also shows pragmatism. You should be able to explain when to fix a flaky test, when to quarantine it, and when to delete it entirely.
Interviewers increasingly ask about tooling decisions. Be prepared to discuss why you chose Playwright over Cypress or Selenium, how you evaluate new tools like AI-powered test generators, and what tradeoffs you consider when adopting self-healing locators. The ability to articulate tradeoffs (not just preferences) is what separates senior candidates from mid-level ones.
Another common interview topic is metrics. Senior SDETs should track flakiness rate (percentage of test runs that include at least one flaky failure), mean time to detect flakiness, mean time to resolve, and the cost of flakiness in terms of delayed releases and wasted CI minutes. These metrics help you make the business case for investing in test infrastructure improvements.
Finally, be ready to discuss automation architecture at scale. How do you organize tests across hundreds of microservices? How do you handle test execution across multiple browsers and devices? How do you balance speed and coverage in your CI pipeline? These architectural questions reveal whether you can lead a QA organization, not just write test scripts.