Playwright web-first assertions: when the selector is the flaky part

Playwright's web-first matchers auto-retry for 5000 ms by default. That fixes timing. It does not fix the case where an AI wrote getByTestId('submit-btn') and the button was renamed to submit-cta yesterday. Retry cannot rewrite a locator. This is a guide to what to use instead, anchored in the assert tool defined at agent.ts:132-143.

Matthew Diakonov, Written with AI

Published April 20, 202612 min read

Auto-retry fixes timing, not selectors

What replaces expect(locator).toBeVisible() when an agent writes the test

Playwright's 5s retry only re-queries the same locator

If the class name changed, 5000ms is wasted

Assrt exposes one assert(description, passed, evidence) tool

Verified against a live accessibility tree, not CSS paths

Evidence turns a failure into a diagnosis

0:00 / 0:05

4.9from open source · MIT

runs on @playwright/mcp@0.0.70

three-field assert primitive

wait_for_stable: 30s / 2s DOM-quiet

no .spec.ts to maintain

What web-first assertions actually do

A web-first assertion is any Playwright matcher that re-evaluates its expression on a timer instead of returning an immediate verdict. The canonical call site is await expect(locator).toBeVisible(). Under the hood, Playwright re-queries the locator, checks the condition, and retries until either the condition passes or expect.timeout (default 5000 ms) expires. The family is large. Here are every matcher that ships with it, for reference.

toBeVisibletoBeHiddentoBeEnabledtoBeDisabledtoBeCheckedtoBeAttachedtoBeEmptytoBeEditabletoBeFocusedtoBeInViewporttoHaveTexttoContainTexttoHaveValuetoHaveCounttoHaveClasstoHaveCSStoHaveAttributetoHaveJSPropertytoHaveURLtoHaveTitletoHaveScreenshottoHaveAccessibleNametoBeOK

23 web-first matchers share one retry engine. The engine re-queries the same locator; it does not reconsider the selector.

Where auto-retry quietly fails

Two kinds of flakiness show up in a real suite. One is timing: something was not ready yet. Web-first retry fixes that cleanly. The other is selector drift: the thing you pointed at does not exist any more, or exists with a different name. Retry cannot synthesize a new locator, so it spins the full 5000 ms budget and then fails with a message that just says the locator was missing.

This matters more when an AI agent wrote the test. LLMs are good at guessing plausible selectors. They are not infallible. The line between button.primary-cta and button.cta-primary is invisible to a model that has not seen the current DOM. When the assertion fails, the human triaging it gets a timeout, not a hint.

two ways the same failure is reported

Side by side: classic `expect()` vs scenario.md

Left: a familiar Playwright spec with three web-first assertions behind data-testids. Right: the same coverage as a markdown scenario with natural-language observables. The coverage is identical; the coupling is not. The left file breaks the moment someone renames a testid. The right file breaks only when the behavior actually changes.

// login.spec.ts  --  classic Playwright, web-first assertion
import { test, expect } from "@playwright/test";

test("sign in redirects to dashboard", async ({ page }) => {
  await page.goto("/login");
  await page.getByTestId("email-input").fill("matt@example.com");
  await page.getByTestId("password-input").fill("correct-horse");
  await page.getByTestId("submit-btn").click();

  // auto-retries for 5 seconds (expect.timeout default)
  await expect(page.getByRole("heading", { name: "Welcome back" })).toBeVisible();
  await expect(page).toHaveURL(/\/app\/dashboard/);
});

31% fewer lines

The anchor fact: the `assert` tool has three required fields

The whole design sits on one tool definition. Instead of a matcher family that couples a Locator type to a retry engine, Assrt exposes a single MCP tool with exactly three required fields. Everything else, every equivalent of toBeVisible, toHaveText, toContainText, etc., collapses into this one shape. You can read the schema yourself in the source tree.

assrt-mcp/src/core/agent.ts (lines 132-143)

Three fields. That is the whole assertion API.

description
What you are asserting, as a natural-language statement. Read by humans in execution.log and by the agent for self-check.
passed
Boolean. No soft/hard distinction at the tool level; the scenario runner decides aggregate verdict.
evidence
Free-text rationale the agent writes after observing the a11y tree. This is the field that turns a failure into a diagnosis.

Verified in /Users/matthewdi/assrt-mcp/src/core/agent.ts, lines 132-143. Twenty-three Playwright matchers map onto these three fields via English description.

What replaces `expect.timeout = 5000`

You do not set a matcher timeout because there is no matcher. You get a single wait_for_stable tool that waits until the DOM stops mutating for a stable window, or the timeout fires. Default: 30 seconds budget, 2 seconds of DOM-quiet required. The agent calls this after async actions, then re-snapshots, then asserts.

assrt-mcp/src/core/agent.ts (lines 186-194)

Why DOM quiet instead of a fixed 5 second retry? A slow-hydrating SPA gets the time it actually needs. A snappy page does not idle in a polling loop. The same code handles both.

The assertion surface, before and after

Twenty-three matchers, each with its own retry semantics. Every test call site names a specific Locator, typically via getByTestId, getByRole, or getByText. A renamed testid is silently flaky: the 5000 ms retry burns before surfacing a 'locator not found'. Soft and hard modes are separate API entry points.

23 matchers, one retry engine
expect.timeout default: 5000ms
Locator drift = timeout, not diagnosis
expect.soft is an opt-in wrapper
Assumes the test author picked the right selector

The pipeline from scenario step to verdict

Three inputs: the English step, the accessibility snapshot, and the observable fact the agent is looking for. One hub: the assert tool, fed by Claude Haiku 4.5. Four outputs, written to disk under /tmp/assrt/<runId>/.

step → a11y tree → assert → artifacts

Numbers to commit to memory

required fields on assert

Playwright web-first matchers

0ms

default expect.timeout

wait_for_stable DOM-quiet

assert: 3 required fields (description, passed, evidence)
wait_for_stable: 30s timeout, 2s DOM-quiet default
wait_for_verification_code: polls every 3s, up to 60s
snapshot runs before every interaction
18 total MCP tools in the agent surface
@playwright/mcp pinned at v0.0.70

What a real run looks like end to end

The same scenario as the markdown above, executed against a Chromium instance driven by @playwright/mcp. Nine tool calls, 4.7 seconds, two assertions, both passed with evidence.

npx assrt run --url demo.app

Six things to know about the assertion surface

One assert tool, not 23 matchers

Playwright ships 23+ web-first matchers, each with its own retry semantics. Assrt exposes a single assert tool with three required fields: description, passed, evidence. Every matcher in the family collapses into natural-language observation plus a verdict.

Default expect.timeout was 5000ms

Playwright's web-first retry budget is 5 seconds per matcher. If the selector is wrong, that 5 seconds is wasted before the test surfaces an error.

wait_for_stable defaults

timeout 30s, stable 2s. The agent waits for DOM quiet instead of polling a locator, so slow SPAs get time and fast pages do not idle. Defined agent.ts:186-194.

Accessibility tree, not CSS

The agent calls snapshot before every click and type, reads role + accessible name, and interacts via ref=eN IDs. "Click Sign in" resolves to role=button with name="Sign in", never to a class that just got renamed in a refactor.

Evidence is a first-class field

Required string on every assert call. A failing test does not just say "not visible"; it says what the agent saw instead. The closest a11y match is usually enough to diagnose.

Soft by default

Every assert records into the run and does not throw. complete_scenario rolls them up. If you want hard-fail, pass a pass_criteria string to assrt_test.

How to migrate one test, not the whole suite

You do not need to delete your Playwright project. Pick one scenario that has been flaky for the wrong reason and move only that one. Ten minutes, four steps.

Move one flaky assertion

Write the observable, not the selector

Skip `getByTestId(...)`. Describe what a user would see: 'Dashboard heading "Welcome back" is visible'. The agent reads the a11y tree at runtime; your plan never ties to a DOM path that could be renamed in the next refactor.

Let wait_for_stable replace the fixed 5 second retry

Where you used to rely on `expect.timeout = 5000`, you now get a 2-second DOM-quiet window inside a 30 second budget. The agent calls it automatically after async actions (agent.ts:186-194). Override via `wait_for_stable({ timeout_seconds, stable_seconds })` when you need more or less.

Trust the assert tool's three-field contract

Every assertion your agent makes records description + passed + evidence. The evidence string becomes your failure triage: 'closest match: button with name "Log in"' is more actionable than 'locator not found'.

Keep Playwright for pixel diffs and perf

Assrt does not try to replace toHaveScreenshot or performance traces. Keep those where they are. Use Assrt where the flaky class is behavioral (visibility, enablement, URL changes, text content) and the selector was the thing breaking.

Side by side: assert vs expect

Not a replacement for Playwright. A replacement for the assertion layer that assumes you already know the right selector.

Feature	Playwright expect	Assrt assert
Primitive	expect(locator).toBeVisible() + 22 siblings	assert(description, passed, evidence)
How flaky tests are handled	Matcher re-polls the same locator for 5000ms	Agent re-snapshots a11y tree; wait_for_stable adapts
When the selector is wrong	Timeout after 5s with generic "not found"	Evidence names the closest a11y match
Timeout surface	expect.timeout: 5000ms (per matcher)	wait_for_stable: timeout 30s, stable 2s
Selector language	CSS, XPath, data-testid, getByRole	Natural-language role + accessible name
Output artifact	.spec.ts you also have to maintain	scenario.md + events.json + webm video
Soft vs hard fail	Hard by default; expect.soft to opt into soft	Soft by default; configurable via pass_criteria
Price	Playwright is free; AI codegen SaaS up to $7,500/mo	Free, MIT-licensed, self-hosted

Assrt runs on Playwright under the hood. The split is about the assertion primitive, not the browser automation engine.

Stop debugging 5-second locator timeouts

15 minutes with the team. We'll look at your flakiest Playwright test and show you exactly what the scenario.md version looks like.

Frequently asked questions

What are Playwright's web-first assertions and how do they actually work under the hood?

A web-first assertion is any Playwright matcher that takes a Locator or Page and re-evaluates the expression on a timer until it passes or the timeout expires. The canonical form is `await expect(page.getByRole('button', { name: 'Sign in' })).toBeVisible()`. Under the hood, Playwright polls the matcher, re-queries the locator against the current DOM, checks the condition, and either resolves or retries. The default expect timeout is 5000 ms (not the 30 second action timeout, which is separate). You configure it per-project with `expect.timeout` in playwright.config.ts or per-call with `.toBeVisible({ timeout: 10_000 })`. The matcher family includes toBeVisible, toBeHidden, toBeEnabled, toBeDisabled, toBeChecked, toBeAttached, toBeEmpty, toBeEditable, toBeFocused, toBeInViewport, toHaveText, toContainText, toHaveValue, toHaveCount, toHaveClass, toHaveCSS, toHaveAttribute, toHaveJSProperty, toHaveURL, toHaveTitle, toHaveScreenshot, and toHaveAccessibleName, plus their response-level siblings like toBeOK. All of them share the same retry engine.

So what is wrong with web-first assertions? The docs say they fix flakiness.

They fix one kind of flakiness: timing. The agent clicks a button, the UI is in a loading state, the matcher retries for a few hundred milliseconds, the button becomes enabled, the assertion passes. Great. They do not fix the other kind: wrong selector. If the test was written against `button.primary-cta` and someone renamed it to `button.cta-primary`, web-first retry just burns 5000 ms and then fails the same way it would have failed in the first millisecond. The retry engine cannot synthesize a new locator. It re-runs the same query against a moving DOM. When an AI writes the test, this is the common failure mode: the LLM picked a selector that was plausible but not quite right, and the expect call cannot recover. The 5 second budget is lost before the test surfaces a useful error.

How does Assrt's assert primitive differ from expect(locator).toBeVisible()?

Assrt exposes one general-purpose `assert` tool with exactly three required fields: description (what you are asserting), passed (true or false), and evidence (free-text rationale). Defined at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 132 to 143. There is no Locator object, no matcher family, no timeout parameter on the assert call itself. The agent reads the live accessibility tree via the snapshot tool, finds the element whose role and accessible name match the English in the scenario, checks the observable property, and records the verdict with evidence. If the element is not there yet, the agent waits via `wait_for_stable` (default 30 seconds with a 2 second DOM-quiet window, agent.ts:186-194) before asserting. The retry happens at the agent's reasoning layer, not inside a matcher that is re-evaluating a fixed selector.

Is Assrt actually running real Playwright, or is it a separate browser runtime?

Real Playwright. The agent spawns the official Microsoft-maintained @playwright/mcp server, pinned to v0.0.70 in /Users/matthewdi/assrt/src/core/freestyle.ts (line 586, inside the baseImageSetup shell string: `npm install -g @playwright/mcp@0.0.70`). Clicks, types, navigations, screenshots, and the accessibility snapshot all go through the Playwright MCP toolset. What differs is the layer above: Assrt does not call Playwright's JavaScript expect API. It interprets an English plan, resolves a11y refs per step, and calls the assert tool once it has observable evidence. You keep Playwright's browser automation guarantees. You drop the part of the API that assumes you already know the right selector.

What does the 'evidence' field in assert look like in practice?

It is a one-line rationale the agent writes after observing the page. For a successful toBeVisible equivalent: evidence = 'Heading element with role=heading and accessible name "Welcome back" found at ref=e14 after snapshot'. For a failed toHaveURL equivalent: evidence = 'URL is /login after 3.2s wait_for_stable, expected /app/dashboard'. It is free-text, not a diff, because the audit trail is actually the /tmp/assrt/<runId>/events.json file, which logs every tool call with timestamps and payloads. The evidence string is for a human reading execution.log; the structured data is already captured per tool call. When a test fails, the failing assert plus the preceding snapshot is usually enough to diagnose without opening the .webm video. When it is not, the video auto-opens at 5x speed.

What about the built-in Playwright matchers that do not have an obvious assert equivalent, like toHaveScreenshot or toHaveCSS?

Pixel-level visual regression (toHaveScreenshot) and computed-style assertions (toHaveCSS) are deliberately not in the Assrt model. If you rely on pixel diffs, keep Playwright's visual regression alongside Assrt scenarios. They compose cleanly because Assrt does not monopolise the browser. For things that matter at the user level (element is visible, URL changed, text appeared, input accepts a value, modal dismissed), the single assert tool covers the same ground as the 20+ matcher family because the agent can phrase any of them as a natural-language observation. toBeChecked becomes 'Checkbox labeled "Remember me" is checked'. toHaveCount becomes 'Three product cards are rendered under the Featured heading'. The agent reads the a11y tree, verifies, and writes evidence. The split is: keep Playwright for pixel diffs and performance; use Assrt for behavior.

Does the agent flake when the page is still loading? What replaces the 5 second expect retry?

Two mechanisms. First, `wait_for_stable` (agent.ts:186-194): the agent calls this after any async action, and it blocks until no DOM mutations happen for a stable window (default 2 seconds) or the timeout fires (default 30 seconds, configurable per call). This is better than a fixed 5 second retry because a slow-hydrating SPA gets the time it actually needs, and a snappy page does not idle. Second, the agent is reasoning-based, so after a failed observation it can call snapshot again, scroll, wait longer, or try a different path. A Playwright expect retry can only re-query the same locator. The practical effect: under the same network conditions, Assrt runs tend to finish faster on the happy path (no fixed 5s polling floor) and more informatively on the sad path (evidence explains what the agent saw, not just that a locator was missing).

I use soft assertions (expect.soft) today to collect multiple failures per test. Is there an equivalent?

Yes, implicitly. In Playwright, expect.soft keeps the test running after a failed assertion so you collect all failures, then fail the test at the end. In Assrt, every assert call is soft by default: passed=false does not throw, it just records the result in the scenario run. A scenario finishes with complete_scenario (agent.ts:146-155), which takes a summary and a scenario-level passed boolean. If any individual assert was passed=false, the agent reports that in the summary. You can implement hard-fail semantics by having the scenario stop the moment an assert fails, via the pass_criteria string passed to assrt_test; or you can run through the whole plan and surface N failures at once. Both patterns work because assertions are data, not exceptions.

Does Assrt generate real Playwright code I can check into my repo, or is it proprietary?

Neither. Assrt deliberately does not emit a .spec.ts file. The plan you check into the repo is a markdown file (scenario.md) with #Case headers and English steps. The assertions that get executed are logged per run into /tmp/assrt/<runId>/events.json, and the final verdict is in /tmp/assrt/<runId>/results.json. No .spec.ts means no code-maintenance debt; no proprietary YAML means no vendor lock-in. If you cancel the vendor, the markdown is still a valid test description a human can read, and any Playwright-literate engineer can translate it back into a .spec.ts by hand in 15 minutes. Competitors that charge $7,500 a month typically store tests in their cloud database; you cannot grep them.

How do I migrate one flaky test from expect().toBeVisible() to Assrt without rewriting my whole suite?

Four steps, takes about 10 minutes for a single scenario. One: identify the test that is flaking because of selector drift, not because of a real regression. Two: write a scenario.md with one #Case block. The header is the test name, the body is 3 to 5 lines describing the actions and the observable you want to verify. Three: run `npx assrt run --url <your staging URL> --plan-file scenario.md`. The agent opens Chromium via @playwright/mcp@0.0.70, executes each step, and emits an execution.log plus a .webm. Four: if it passes deterministically across 5 runs, delete the flaky .spec.ts; if it fails, read the evidence strings to see whether the a11y role and accessible name are discoverable (that is a real accessibility bug) or whether the English step needs rephrasing. You can do this one test at a time. Assrt scenarios live next to .spec.ts files without interfering.

Related guides on Playwright, selectors, and agent-driven testing.

Keep going

Guide

Readable Playwright test generator

Why the most readable Playwright test is one that never compiled to .spec.ts. scenario.md runs directly on @playwright/mcp@0.0.70.

Read

Guide

Selector drift detection testing

How an AI agent catches the rename you forgot. Accessibility-first locators vs CSS class paths.

Read

Guide

E2E testing best practices

What actually matters in 2026: observable assertions, live a11y snapshots, and video evidence over stack traces.

Read

Playwright web-first assertions: when the selector is the flaky part

What web-first assertions actually do

Where auto-retry quietly fails

Side by side: classic expect() vs scenario.md

The anchor fact: the assert tool has three required fields

Three fields. That is the whole assertion API.

What replaces expect.timeout = 5000

The assertion surface, before and after

The pipeline from scenario step to verdict

step → a11y tree → assert → artifacts

Numbers to commit to memory

What a real run looks like end to end

Six things to know about the assertion surface

One assert tool, not 23 matchers

Default expect.timeout was 5000ms

wait_for_stable defaults

Accessibility tree, not CSS

Evidence is a first-class field

Soft by default

How to migrate one test, not the whole suite

Move one flaky assertion

Write the observable, not the selector

Let wait_for_stable replace the fixed 5 second retry

Trust the assert tool's three-field contract

Keep Playwright for pixel diffs and perf

Side by side: assert vs expect

Stop debugging 5-second locator timeouts

Frequently asked questions

Keep going

Readable Playwright test generator

Selector drift detection testing

E2E testing best practices

Comments (••)

Side by side: classic `expect()` vs scenario.md

The anchor fact: the `assert` tool has three required fields

What replaces `expect.timeout = 5000`

Comments ()