Notes from a Reddit thread on streaming-AI test flake

Playwright auto-retry assertions, plus the one wait the docs do not give you

Every guide on this topic walks through the same three tools: web-first assertions, expect.poll, and expect.toPass. They are correct and they cover most cases. They also stop helping the moment you point Playwright at a streaming chat UI, an async dashboard, or any page where the assertion target is unknown ahead of time. This page covers the standard three, then walks through the fourth primitive most projects end up writing themselves: a MutationObserver-backed waitForStable, taken straight from Assrt's open-source agent.

A
Assrt Engineering
12 min read
4.9from Assrt MCP users
Default web-first assertion timeout: 5000 ms
expect.poll default intervals: 100, 250, 500, 1000 ms
wait_for_stable default quiet period: 2 seconds
Implementation lives at agent.ts:956-1009 in assrt-mcp, MIT-licensed

The three primitives the Playwright docs hand you

You know these. They are the right answer most of the time. The short version, then we move on to where they stop being enough.

  1. Web-first auto-retry assertions. About 30 matchers (toBeVisible, toContainText, toHaveAttribute, toHaveURL, toBeChecked, and so on). Each polls a locator on the live page until the matcher passes or the assertion timeout (default 5000 ms) elapses. This is what people mean by "auto-retry assertions" in 90 percent of conversations.
  2. expect.poll. Wraps an arbitrary async callback in the same retry harness. Useful when the truth is not on the page: an HTTP response, a database row count, a value in localStorage. Default polling intervals back off through 100, 250, 500, 1000 ms so a fast answer comes back quickly and a slow one does not hammer.
  3. expect.toPass. Wraps a block of code (which itself contains assertions) in a retry. Use when one assertion is not enough to express the wait, for example "the row count became 3 AND the first row says Submitted, atomically." Default timeout is 0 ms, so you almost always pass an explicit timeout.
tests/orders.spec.ts

Where the standard three stop helping

All three primitives assume you can name what you are waiting for: a locator and a matcher value, a callback returning a known result, or a block of assertions you wrote ahead of time. That assumption breaks under three increasingly common conditions.

Streaming output
The text is non-deterministic

You cannot toHaveText against an LLM's answer. You do not know it. You also do not know when it stops streaming, because the locator only exists once the bubble is appended.

Heartbeat traffic
networkidle never fires

A WebSocket pinging every 5 seconds, or a polling /events fetch, keeps the network just busy enough that idle is unreachable. Your wait runs to the navigation timeout, then fails.

AI-driven test
The locator is not known yet

Agentic test runners read the live accessibility tree between actions and pick the next element by role and label. There is no upfront locator string to assert against, only an instruction to wait until things settle.

Numbers from the implementation

These are the constants in the open-source code, not invented benchmarks. Every value below maps to a line in agent.ts.

0s
default quiet period
0ms
poll interval
0s
default timeout
0
lines of source code

The MutationObserver primitive, line for line

This is the block at assrt-mcp/src/core/agent.ts:956-1009. It runs inside the Assrt agent every time the model calls the wait_for_stable tool, but the logic is general. Read it once and you can paste the same idea into any Playwright suite.

agent.ts:956-1009

Three things to notice. First, the observer is attached to document.body with childList: true, subtree: true, characterData: true, which catches both element insertions and text changes (the two ways streaming output shows up). Second, the counter is read back via evaluate on every poll, so the observer state lives in the page, not in the test runner; that means it survives same-document navigations. Third, the cleanup block disconnects the observer and deletes the window globals, because leaking a long-lived observer on a single-page app is exactly how a clean test fails the next one.

How the four primitives compose in one test

Stack them. Get the page into a known state with waitForStable, then assert with the web-first matchers you already know. Three inputs, one stability gate, three assertion paths.

action -> stability gate -> web-first assertions

User action
Async work
Heartbeat traffic
waitForStable
expect(locator).toHaveText
expect.poll
expect.toPass

The swap that ends most flake

The single highest-yield change in a flaky e2e suite is replacing fixed sleeps and networkidle waits with a stability check. Same test, two different waits.

One spec file, two waits

await page.click('button:has-text("Send")'); await page.waitForTimeout(2000); await expect(page.getByTestId('reply')).toContainText('Done');

  • 2s fixed sleep on every run, even when the reply is instant
  • Random failure on slow CI when the reply takes 2.1s
  • page.waitForLoadState('networkidle') is no better here: heartbeat WebSocket keeps the network alive forever

What the swap looks like in a run log

Same spec, two runs of 20 each, same machine, same conditions on the dev server. The first run flakes 7 times. The second run flakes 0 times and the mean runtime does not move; only the tail collapses.

playwright run

Web-first assertions vs. waitForStable

They are not substitutes; they compose. But it is worth seeing what each tool gives you.

Featureweb-first / expect.toPasswaitForStable
What you specify up frontA locator and an expected matcher valueA quiet period (e.g. 2s of zero DOM mutations)
Default timeout5000 ms (web-first), 0 ms (toPass)30000 ms, capped at 60000 ms
Polling intervalInternal, ~30 ms for locators500 ms (a single evaluate roundtrip)
Returns whenMatcher passes against the locatorMutationObserver counter has not moved for stable_seconds
Survives heartbeat WebSocket / SSEnetworkidle never fires, hard timeout insteadYes; observes the DOM, not the network
Survives unknown locator (AI-streamed text)No; you cannot expect.toHaveText('?')Yes; you assert AFTER stability, not during
Cost on a fast deterministic pageAbout one re-query per 30 ms until matcherAbout 4 evaluate roundtrips (2s quiet) then return
Where it lives in your projectBuilt into @playwright/test30 lines in playwright/helpers/wait-for-stable.ts

The same idea, ported to a Playwright fixture

The Assrt agent is one place this lives. Your Playwright project is another. Here is the same pattern, expressed as a helper you can drop into any playwright/helpers/ folder. No new dependency, just @playwright/test.

playwright/helpers/wait-for-stable.ts

A six-step migration path for an existing suite

You do not have to rewrite anything. Add the helper, swap the worst offenders one at a time, and watch the tail of your test runtime collapse without giving up any of the auto-retry assertions you already trust.

1

Keep your web-first assertions exactly as they are

If a step has a known target and known matcher (toHaveText, toBeVisible, toHaveURL), do not touch it. Web-first auto-retry is the right tool for the boring 90 percent of cases. The whole point of the pattern is to compose with what you already have.

2

Find every page.waitForTimeout in your suite

Hardcoded sleeps are the biggest single source of test flake on async pages. They are slower than they need to be on fast machines and shorter than they need to be on slow CI. Grep for waitForTimeout, list every call, and earmark each for replacement.

3

Find every waitForLoadState('networkidle') that follows a user action

Page-load idle after a navigation is fine. networkidle after a button click on a modern app with WebSockets, polling fetches, or analytics beacons is a future flaky test. Same drill: list them.

4

Drop in waitForStable from the snippet above

Add the helper at playwright/helpers/wait-for-stable.ts. It has no dependency beyond @playwright/test. Re-run your suite once on the same hardware to confirm it does not break anything; the helper only adds at most stable_seconds of latency to a passing test.

5

Replace the sleeps and the networkidle waits one at a time

For each marked call, swap to await waitForStable(page, { stableSeconds: 2 }). On streaming pages, bump stable_seconds to 3 or 4. On pages with a known heartbeat that should not block stability, use a tighter MutationObserver target (e.g. document.querySelector('main')) instead of document.body.

6

Audit the new test runtime, then tune the quiet period

Most projects see total runtime drop because the previous fixed waits were padded for the slowest machine. Watch for tests that finish too fast and start failing on slow CI; those usually need a longer stable_seconds or an additional web-first assertion downstream. Tune per test, not globally.

Where the auto-retry approach quietly does the right thing

Every chip below is a real-world UI pattern that breaks networkidle and fixed sleeps but pairs cleanly with a stability gate plus web-first assertions.

Streaming chat repliesServer-sent events feedsHeartbeat WebSocketsPolling /events endpointsInfinite scroll listsAI agent step tracesLive dashboardsRealtime stock tickersOptimistic UI rollbacksSearch-as-you-typeMulti-step wizardsOTP / magic-link flowsSkeleton-then-content swapsToast notificationsShopify Hydrogen pages

Decision rules: which retry primitive for which case

A flat list of the situations and the right tool. Tape it to your monitor.

Pick the right primitive

  • Known locator, known text or attribute: use expect(locator).toHaveText / toHaveAttribute
  • Known locator, unknown text but observable shape: use expect(locator).toBeVisible then read .textContent
  • Value lives outside the DOM (HTTP, IndexedDB, JS): use expect.poll(asyncFn).toBe(value)
  • Multi-step block where any step might race: use expect(asyncBlock).toPass({ timeout })
  • Action triggered async DOM work, locator unknown ahead of time: use waitForStable(page, { stableSeconds: 2 })
  • Page has a heartbeat WebSocket or SSE: use waitForStable, never networkidle
  • Streaming AI response that grows token by token: use waitForStable with stableSeconds 3 to 4
  • Combination: stabilize first, then assert with web-first matchers against the resulting state
30 lines

The whole reason waitForStable exists is that we needed our AI agent to test streaming chat apps without ever knowing the locator up front. The MutationObserver was the only thing that worked.

Assrt agent commit history, src/core/agent.ts

Worked example: a streaming chat assertion that does not flake

Putting all four primitives together. Click send, wait for the DOM to settle, then assert against the resulting state with the web-first matchers and one expect.poll for the analytics beacon.

tests/chat-stream.spec.ts
0/20flakes before swap
0/20flakes after swap
0smean runtime
0stail runtime

Need this pattern wired into your CI today

30 minutes with the engineering team, walk through your flakiest spec, leave with a working waitForStable and a plan for the rest of your suite.

Frequently asked questions

What does 'auto-retry' actually mean in a Playwright assertion?

When you write expect(locator).toHaveText('Submitted'), Playwright does not check the DOM once. It re-fetches the element, evaluates the matcher, and if the assertion fails it waits a short interval and tries again. The poll loop runs until the matcher passes or the assertion timeout (default 5000 ms) is reached. The retry is invisible from your test code; you just await the expect and Playwright handles the rest. The full list of retrying matchers (toBeVisible, toContainText, toBeEnabled, toHaveAttribute, toHaveURL, and about 25 more) lives in the Playwright docs under test-assertions, and the matchers without auto-retry (toBe, toEqual, toContain) are listed in the same page so you know which ones are unsafe to use against the live page.

When should I use expect.poll versus expect.toPass versus a regular web-first assertion?

Use a web-first assertion (toHaveText, toBeVisible, etc.) for anything you can express against a single locator. Use expect.poll when the value comes from somewhere other than a locator (an HTTP response, a database row, window.localStorage) and you want polling semantics on top: expect.poll(async () => fetch('/api/orders').then(r => r.json()).then(j => j.length)).toBe(3). Use expect.toPass when you have a multi-step block that contains its own assertions and you want the entire block retried as a unit, for example a form submit followed by three expects, where any of them might be racy. The reason these three exist instead of one tool is that Playwright wants the cheap, fast path (a locator re-query) to stay cheap, and only fall back to wrapping arbitrary code in a retry harness when the test author asks for it.

What is the default timeout for Playwright's auto-retry assertions and where do I change it?

5 seconds at the assertion level. You can override per-assertion with expect(locator, { timeout: 15000 }).toHaveText('...'), per-file with test.use({ timeout: 60_000 }) for the test timeout (which is a different thing), or globally in playwright.config.ts under expect.timeout. expect.poll has its own default and accepts a timeout option plus an intervals array (defaults to [100, 250, 500, 1000] ms, exponentially backing off). expect.toPass defaults to a 0 ms timeout, which means it will retry until the surrounding test times out, so you almost always want to pass an explicit timeout to it.

Why do Playwright's auto-retry assertions still flake on streaming AI responses?

Auto-retry assertions assume you know what you are waiting for. expect(locator).toHaveText('Hello, Matt') needs both the locator and the expected text up front. On a chat UI that streams a non-deterministic answer from an LLM, neither is stable: the answer is whatever the model returned this run, and the locator might not even exist yet because the message bubble is appended mid-stream. Hardcoded waits (page.waitForTimeout(5000)) are flaky in the other direction. waitForLoadState('networkidle') sounds right but it gives up after 500 ms of zero network activity, which a heartbeat WebSocket or a polling /events endpoint will never satisfy. The fix that survives all three failure modes is a DOM-level stability wait: watch the page itself, declare it stable when nothing changes for a configurable quiet period, then run your assertions against the snapshot.

What is the MutationObserver pattern and where is it implemented?

It is a 30-line block at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 956 through 1009 in Assrt's open-source agent. The implementation injects a MutationObserver into the page via browser.evaluate, watches childList, subtree, and characterData mutations on document.body, and increments window.__assrt_mutations on every mutation batch. A polling loop checks the counter every 500 ms; when the counter has not changed for stable_seconds (default 2 seconds, configurable up to 10), it disconnects the observer and returns. If timeout_seconds elapses first (default 30 seconds, capped at 60), it returns with a 'still changing' result. The whole thing is plain JavaScript with no external dependency. You can copy the snippet straight into a Playwright test fixture and have a working wait_for_stable in your suite tonight.

Why not just use page.waitForLoadState('networkidle')?

Because 'idle' is defined as 500 ms with at most two outstanding network requests. Three patterns break it. (1) A heartbeat WebSocket or SSE keeps the network busy forever, so networkidle never fires and you wait until the navigation timeout. (2) A poll-every-three-seconds analytics or feature-flag fetch keeps the network just busy enough to never reach idle. (3) A page that finishes its requests in 200 ms and then runs heavy client-side rendering for two more seconds will be 'idle' before it is actually ready, and your next assertion races. The MutationObserver approach observes the symptom you actually care about (DOM has stopped changing) instead of a proxy (network has stopped). On the rare cases where you also want to wait for network calm, you stack them: await waitForLoadState('domcontentloaded'); await waitForStable({ stableSeconds: 2 }).

How does Assrt use this internally and why does it matter for AI-driven tests?

Assrt is an AI agent that drives a real Chromium browser through @playwright/mcp to run plain-Markdown #Case scenarios. The agent does not write locator strings up front; it calls a snapshot tool to read the live accessibility tree, picks the element by role and label, then acts. Between actions, the page often does something async: a modal opens, an AI message streams, a dashboard refreshes. Hardcoded waits would make tests slow on fast pages and broken on slow ones. So the agent has wait_for_stable in its tool list (defined at agent.ts lines 186 through 195) and the system prompt explicitly tells it to call wait_for_stable after submitting forms, sending chat messages, or triggering any async operation. That decision is what makes Assrt's tests survive on streaming chat apps where every other AI test runner fights flakiness with retry counts.

Can I use this pattern without Assrt? I just want it in my Playwright project.

Yes. The pattern is portable. Add a fixture or a helper, paste the MutationObserver injection block, expose it as await waitForStable(page, { stableSeconds, timeoutSeconds }). Call it after any action that triggers async DOM work. You will see test runtime drop on fast paths (because you are no longer paying a fixed sleep) and stop flaking on slow paths (because you are no longer guessing how long to wait). The Assrt source file is MIT-licensed at github.com/m13v/assrt-mcp; copying 30 lines is fine. If you want the rest of the agent (Markdown #Case format, MCP server, video recording) you can also install it via npx assrt-mcp setup and use it as your e2e harness. Either way, no vendor lock-in: the test artifacts are .md files on your disk.

Does this replace Playwright's auto-retry assertions?

No, it complements them. The right mental model is two layers: first, get the page into a known-good state (use waitForStable for unknown-target async work, or skip it for fast deterministic transitions), then assert against that state with web-first matchers. You still want expect(locator).toHaveText, expect.poll, and expect.toPass for everything they cover. What changes is that you stop using them as a substitute for waiting, which is what most flaky test suites end up doing. Web-first assertions are not a substitute for a stability check; they are a check that runs on top of one.

I came here from a Reddit thread about flaky Playwright tests on AI apps. What is the one change that helps the most?

Replace every page.waitForTimeout(5000) and every page.waitForLoadState('networkidle') with a MutationObserver-backed waitForStable, then leave your existing toHaveText / toBeVisible assertions alone. That single change cuts the two biggest sources of flake in AI-output testing: fixed sleeps that race the model on slow days, and networkidle waits that never resolve because the app has a heartbeat. You do not need to rewrite tests, you do not need to install a new framework, you do not need to give up Playwright. You just swap the wait. If you also want the rest of the AI testing flow (intent-based scenarios, automatic test plan generation, video recording), Assrt is the open-source path; if not, the snippet alone is enough.

assrtOpen-source AI testing framework
© 2026 Assrt. MIT License.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.