Root causePlaywrightHeadless ChromeCI flakiness

Headless chrome test flakiness is a static-script problem

Almost every article on this topic teaches you patterns to add to a handwritten Playwright file: auto-retry assertions, web-first locators, never use waitForTimeout, isolate test data. Those are all real, and they all treat the same script you keep editing as the place the fix has to live. This guide takes the other side. The flake is not in Chrome. It is in the script. Two specific bugs cause most of it: locators that aged out between the millisecond they were resolved and the millisecond they fired, and waits keyed on a clock instead of on actual DOM behaviour. We will show the exact code Assrt injects to replace both, why it is a runtime decision and not a syntax decision, and how a runner that has no static script has nothing left to keep flaky.

Matthew Diakonov, Written with AI

Published April 26, 202611 min read

The script is the flake

Two bugs cause most of it. One file fixes both.

Stale locators: a handle resolved milliseconds ago, then re-rendered

Clock waits: waitForTimeout is decoupled from what the page is doing

Assrt re-snapshots the a11y tree on every action

And waits on a real MutationObserver, not a sleep

agent.ts lines 962-994. Read it. It is 30 lines.

0:00 / 0:05

4.9from Assrt MCP users

Real Playwright code, not proprietary YAML

MIT-licensed, self-hosted, no cloud dependency

Tests are yours to keep — zero vendor lock-in

The two bugs that cause most CI flakes

Pull up the last twenty intermittent failures from your CI. Read the stack traces. The vast majority will fall into one of two buckets: either the locator the script captured no longer points where it thought (the React tree re-mounted, a Suspense boundary swapped, a modal stole focus), or the wait that preceded the action returned before the page actually finished doing what the wait was supposed to be waiting for. Both bugs come from the same place: the test script is a snapshot of an assumption made at write time, and the page in CI is operating on its own schedule. Everything else, the retries, the locator best practices, the auto-waiting helpers, is downstream of trying to keep that snapshot synchronised with reality.

Same scenario, two abstractions

// The flaky pattern almost every guide tells you to "fix" with retries.
test("dashboard loads", async ({ page }) => {
  await page.goto("https://staging.app/dashboard");

  // Hardcoded sleep. Works on a laptop, fails on a 2 vCPU CI runner.
  await page.waitForTimeout(2000);

  // Locator resolved here. By the time we click, a refetch may
  // have replaced the node and this handle points at a detached element.
  const card = page.locator('[data-testid="metric-card"]').first();
  await card.click();

  // Same problem on the assertion. The element is "found" but stale.
  await expect(card.locator(".value")).toContainText("42");
});

60% lines you maintain

The 30-line file that replaces every waitForTimeout

The fix for the second bug is not a different sleep duration. It is to stop using a sleep at all. Inside the Assrt MCP runner there is a tool called wait_for_stable that injects a real MutationObserver into the page via the underlying Playwright browser_evaluate tool, increments a counter on every batch of mutations, polls the counter every 500 milliseconds, and proceeds only after the counter has been unchanged for the configured stable window (default two seconds, hard cap one minute). Then it disconnects the observer and removes the globals so the next call starts clean. The whole implementation lives at /Users/matthewdi/assrt-mcp/src/core/agent.ts on lines 962 through 994.

Anchor fact

wait_for_stable observesdocument.body with childList, subtree, and characterData set, so it counts every node insertion, removal, attribute change, and text edit anywhere under the body. The signal it acts on is the absence of those events for 0 ms of wall-clock time. If your app never stops mutating (a heartbeat clock in the corner, an analytics queue), the wait correctly times out at the configured maximum and the log line tells you which page is guilty. You no longer have to guess.

agent.ts:962-994

What happens between an action and its wait

The agent loop is small enough to draw. The agent decides what to do, the runner asks the page what it currently looks like, the runner performs the action, the runner watches mutations until the page settles, and only then does the loop continue. There is no place inside that loop where a stale handle survives or where the wait fires before the work is done.

Agent loop, one full cycle

The diagram is the loop in one screenful. Source: agent.ts lines 693 through 1058.

Six specific things that cause headless flake

In rough order of frequency. The first two are bugs in the test script. The third is a bug in choosing the wrong wait primitive. The last three are real environmental bugs that you can fix once. None of them are a reason to retry.

Stale locators

The locator was resolved milliseconds ago, the framework re-rendered, and the handle now points at a detached node. The `[ref=eN]` accessibility tree from snapshot() is bound to that snapshot, never reused, and resolved fresh per action.

Clock-based waits

page.waitForTimeout(2000) is the second-most-common cause of flake. wait_for_stable replaces it with a MutationObserver count that only proceeds when the page actually stops changing for stable_seconds.

networkidle on chatty apps

Persistent WebSockets and analytics heartbeats mean networkidle never fires. DOM-stability is the right question: did my UI stop changing, not did every TCP socket go quiet.

Headed vs headless RAF

Headless Chrome on CI without a real GPU schedules requestAnimationFrame callbacks differently. Animation-tied assertions race. Stability waits adapt because the mutation count drops to zero regardless of the rAF cadence.

Browser profile leftovers

SingletonLock symlinks from a crashed previous run block the next launch. The launcher at browser.ts:326-342 evicts SingletonLock, SingletonSocket, and SingletonCookie before every spawn.

60s SDK timeout cliff

A contended CI navigation that takes 90s aborts at 60 and looks like a flake. browser.ts:381 raises TOOL_TIMEOUT_MS to 120s so the log records the real number instead of an artificial cutoff.

What the run actually looks like

A real Assrt run prints every Playwright MCP tool call on a single line with the elapsed millisecond count. There is no hidden retry layer; there is no proprietary log format. If a step takes 92 seconds, the log says 92000ms, and the next step starts on a fresh snapshot.

assrt_test on staging.app

Side by side: where the flake hides

Feature	Hand-written Playwright	Assrt MCP
How a selector is resolved	Locator captured ahead of time and reused; goes stale on re-render	Fresh accessibility tree per action; ref bound to current snapshot (agent.ts:218-219)
How a wait works	waitForTimeout(N) or networkidle; both decoupled from actual DOM stability	Real MutationObserver injected via browser_evaluate; counts mutations until stable_seconds of zero (agent.ts:962-994)
Failure recovery	test.retry(N) hides the cause and burns CI minutes	Auto re-snapshot and retry on tool failure (agent.ts:1014-1020); the agent rewrites its plan from current DOM
Per-action timeout	60s SDK default; contended runs report timeouts that look like flakes	120s (TOOL_TIMEOUT_MS, browser.ts:381); slow CI navigations complete and log the real duration
Test artifact format	Proprietary YAML or DSL; locks you into the vendor	Markdown #Case scenarios + real Playwright MCP tool calls
Pricing	Closed cloud tools up to 7,500 per month	Free, open source, self-hosted; pay LLM tokens only

The habits worth adopting whether or not you ever run Assrt

Re-resolve every selector at the moment of action. Never reuse an element handle across an await boundary.
Replace every waitForTimeout with either waitForFunction over a precise predicate or a MutationObserver-based stability wait.
Treat networkidle as a hint, not a contract. SaaS pages with persistent sockets break it permanently.
Log the real action duration. A 92000ms navigation is information; a 60000ms 'timed out' is noise.
Evict SingletonLock, SingletonSocket, and SingletonCookie before every browser launch in CI. lstatSync, not existsSync.
Stop writing tests in vendor YAML. Real Playwright tool calls and Markdown scenarios survive a vendor switch.
Treat repeated retries as a code smell. Retries hide stale-selector and waitForTimeout bugs that you could just delete.

The list above works in any Playwright codebase. Adopting it before you switch runners is the cheapest way to cut your CI flake rate. Adopting all of it leaves you with a script that already behaves most of the way like the agent does, which makes the migration trivial.

0 ms

“The right wait is not a sleep. It is a question the page can answer for itself.”

agent.ts:962-994 (the wait_for_stable tool)

The numbers that actually matter in this code

These are the load-bearing constants in the runner. Every one of them is a deliberate choice that shows up in the failure mode you do or do not get to see in CI.

0 msstability poll interval

0 msdefault stable window

0 swait_for_stable hard cap

0 sMCP per-call timeout

Bring your flakiest scenario

We will run it through the Assrt MCP runner on a live call and show you the diff between the failure mode you see today and an a11y-tree-fresh-per-action loop.

Frequently asked questions

Why is my Playwright test passing locally but flaking in headless CI?

Three reasons in roughly this order. First, your locator was written against the DOM at record time, and the DOM in CI is one Suspense boundary or one feature flag away from being slightly different, so the locator either resolves to nothing or resolves to the wrong node. Second, your wait is a `page.waitForTimeout(2000)` that worked locally because your machine is fast and a 2 GHz CI runner needs 2.4s for the same animation. Third, headless Chrome on CI lacks a real GPU, which makes `requestAnimationFrame` callbacks fire on a different schedule, which makes elements you assumed had finished moving still be in transit. The first two are bugs in your script. The third is a real Chromium quirk, and it is much rarer than the first two combined.

Does waitForLoadState('networkidle') fix flakiness?

It papers over it for a class of flakes (XHR-driven content loads) and creates a different class of flake on apps with persistent connections. SaaS dashboards routinely keep an open WebSocket, a Sentry session, a PostHog autocapture queue, and an analytics heartbeat going. There is no `networkidle` on those pages: there is always a request in flight. Tests built on networkidle then time out at the worst time, which is usually right before an assertion you needed to run. A DOM-mutation-based wait is more honest because it asks the question you actually have, which is `has the part of the page I care about stopped changing`. That is what `wait_for_stable` does at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 956 to 1009.

What does Assrt actually do differently when an element is not yet on the page?

It calls `snapshot()`, which returns the accessibility tree of the page in its current state, with each interactive element annotated with a fresh ref like `[ref=e7]`. Then it picks the element by ref. If the action fails because the ref is stale (the snapshot was taken a fraction of a second ago and the DOM moved), the runner catches the failure and re-snapshots automatically — see the catch at agent.ts:1014-1020 which builds a new accessibility tree and feeds it back to the agent. There is no element handle to keep alive across renders, because the abstraction the agent uses is not a handle, it is a label that gets re-resolved every step.

How is the MutationObserver wait different from page.waitForFunction?

Mechanically they overlap, but the failure modes differ. `waitForFunction` lets you write any predicate and re-evaluates on a polling interval and on requestAnimationFrame. The trap is that you have to design the predicate yourself, and most teams either write `() => document.querySelector('.spinner') === null` (which fires the moment the spinner unmounts, before the next render flushes) or `() => document.body.innerText.includes('Loaded')` (which fails the first time the loaded copy is conditional). The MutationObserver-based wait at agent.ts:962-994 sidesteps this by treating absence of mutation as the signal: it counts mutations on `document.body` with `childList`, `subtree`, and `characterData` set, and considers the page stable only after `stable_seconds` of unchanged count. You do not write a predicate; you let the page tell you when it is done.

Will a 2-second stable window catch infinite-loop flakes?

It will not, by design. If your app has a setInterval that fires every 1.5 seconds and updates a clock in the corner, `wait_for_stable` will time out at the configured maximum (default 30s, capped at 60s) because mutations never fall below the threshold. That is correct behavior: a page that never stops mutating has no stable point to assert against. The fix is in the app, not the test, and the timeout in the log is the signal that tells you which it is. The cleanup at agent.ts:990-994 disconnects the observer and deletes `window.__assrt_mutations` and `window.__assrt_observer` so the next `wait_for_stable` call gets a fresh count, even if the previous one timed out.

Does this remove the need for retry logic in CI?

Mostly. Retry logic in CI exists as a band-aid for two specific kinds of flake: stale-selector failures and timing failures. If both are gone, the only flakes left are real bugs (the app actually broke under load) and infrastructure failures (the runner died, the network dropped). Both deserve a real failure, not a retry. That said, the per-call MCP timeout at /Users/matthewdi/assrt-mcp/src/core/browser.ts line 381 is set to 120 seconds (vs the SDK default of 60), specifically so a contended-CI navigation that takes 90s completes and shows up in the log as `(92000ms)` rather than dying as `Request timed out`. The point is to surface the real number, not to retry around it.

Why a Markdown #Case format instead of a YAML test plan?

Because YAML test plans cannot be reused. The moment you switch vendors, every line of test logic is captured in their proprietary syntax and has to be rewritten. Real Playwright code, expressed as named scenarios in plain English `#Case` blocks that the agent compiles to `browser_navigate` / `browser_click` / `browser_snapshot` calls at runtime, is portable: the scenarios document intent, the agent picks the actions, and the artifacts are the actual MCP calls a Playwright engineer can read. The parser for the `#Case` format is at agent.ts:620-631 and is two regexes long.

How would I reproduce the stale-selector flake locally to convince myself it is real?

Take any React app with a list that loads from an API. Write a Playwright test that does `await page.locator('text=ItemA').click()`. Add a `await new Promise(r => setTimeout(r, 50))` between getting the locator and clicking. Now make the list re-render asynchronously (a refetch on focus, a websocket update, anything). Run it 50 times. The locator will resolve to the old node, the new render will reparent that node, and the click will hit a detached element that does nothing. The `[ref=eN]` model in the accessibility tree avoids this because the ref is bound to the snapshot, not the DOM, and a fresh snapshot is one tool call away.

What is the smallest change I can make to my existing Playwright suite to start moving toward this model?

Two things, in order. First, replace every `waitForTimeout` with either `waitForFunction` (if you have a precise predicate) or with the equivalent of `wait_for_stable` (a MutationObserver-based wait, which is about 30 lines of helper code; the implementation at agent.ts:962-994 is a working reference). Second, stop reusing element handles across awaits — re-resolve the locator at the moment of each action. Both moves remove flake without changing your test runner. They are also the prerequisites for moving to an agent-driven runner later, because the agent is doing the same two things automatically.

Why is this open source and free when other tools cost thousands a month?

Because the runner is one repository you can read in an afternoon: /Users/matthewdi/assrt-mcp/src for the MCP server and /Users/matthewdi/assrt for the web app. The expensive parts of test infrastructure are not the orchestrator; they are the LLM tokens, the CI minutes, and the engineer hours spent on triage. Assrt charges nothing for the orchestrator and lets you bring your own model and your own CI. Closed cloud vendors charge for the orchestrator (often 7,500 per month), make you write tests in their syntax, and own your test artifacts. When you switch off them you start over. With Assrt, the artifacts are real Playwright MCP tool calls and Markdown scenarios; switching means deleting one npm dependency.