Playwright web-first assertions: when the selector is the flaky part
Playwright's web-first matchers auto-retry for 5000 ms by default. That fixes timing. It does not fix the case where an AI wrote getByTestId('submit-btn') and the button was renamed to submit-cta yesterday. Retry cannot rewrite a locator. This is a guide to what to use instead, anchored in the assert tool defined at agent.ts:132-143.
What web-first assertions actually do
A web-first assertion is any Playwright matcher that re-evaluates its expression on a timer instead of returning an immediate verdict. The canonical call site is await expect(locator).toBeVisible(). Under the hood, Playwright re-queries the locator, checks the condition, and retries until either the condition passes or expect.timeout (default 5000 ms) expires. The family is large. Here are every matcher that ships with it, for reference.
toBeVisibletoBeHiddentoBeEnabledtoBeDisabledtoBeCheckedtoBeAttachedtoBeEmptytoBeEditabletoBeFocusedtoBeInViewporttoHaveTexttoContainTexttoHaveValuetoHaveCounttoHaveClasstoHaveCSStoHaveAttributetoHaveJSPropertytoHaveURLtoHaveTitletoHaveScreenshottoHaveAccessibleNametoBeOK23 web-first matchers share one retry engine. The engine re-queries the same locator; it does not reconsider the selector.
Where auto-retry quietly fails
Two kinds of flakiness show up in a real suite. One is timing: something was not ready yet. Web-first retry fixes that cleanly. The other is selector drift: the thing you pointed at does not exist any more, or exists with a different name. Retry cannot synthesize a new locator, so it spins the full 5000 ms budget and then fails with a message that just says the locator was missing.
This matters more when an AI agent wrote the test. LLMs are good at guessing plausible selectors. They are not infallible. The line between button.primary-cta and button.cta-primary is invisible to a model that has not seen the current DOM. When the assertion fails, the human triaging it gets a timeout, not a hint.
Side by side: classic expect() vs scenario.md
Left: a familiar Playwright spec with three web-first assertions behind data-testids. Right: the same coverage as a markdown scenario with natural-language observables. The coverage is identical; the coupling is not. The left file breaks the moment someone renames a testid. The right file breaks only when the behavior actually changes.
login.spec.ts vs scenario.md
// login.spec.ts -- classic Playwright, web-first assertion
import { test, expect } from "@playwright/test";
test("sign in redirects to dashboard", async ({ page }) => {
await page.goto("/login");
await page.getByTestId("email-input").fill("matt@example.com");
await page.getByTestId("password-input").fill("correct-horse");
await page.getByTestId("submit-btn").click();
// auto-retries for 5 seconds (expect.timeout default)
await expect(page.getByRole("heading", { name: "Welcome back" })).toBeVisible();
await expect(page).toHaveURL(/\/app\/dashboard/);
});The anchor fact: the assert tool has three required fields
The whole design sits on one tool definition. Instead of a matcher family that couples a Locator type to a retry engine, Assrt exposes a single MCP tool with exactly three required fields. Everything else, every equivalent of toBeVisible, toHaveText, toContainText, etc., collapses into this one shape. You can read the schema yourself in the source tree.
Three fields. That is the whole assertion API.
descriptionWhat you are asserting, as a natural-language statement. Read by humans in execution.log and by the agent for self-check.
passedBoolean. No soft/hard distinction at the tool level; the scenario runner decides aggregate verdict.
evidenceFree-text rationale the agent writes after observing the a11y tree. This is the field that turns a failure into a diagnosis.
Verified in /Users/matthewdi/assrt-mcp/src/core/agent.ts, lines 132-143. Twenty-three Playwright matchers map onto these three fields via English description.
What replaces expect.timeout = 5000
You do not set a matcher timeout because there is no matcher. You get a single wait_for_stable tool that waits until the DOM stops mutating for a stable window, or the timeout fires. Default: 30 seconds budget, 2 seconds of DOM-quiet required. The agent calls this after async actions, then re-snapshots, then asserts.
Why DOM quiet instead of a fixed 5 second retry? A slow-hydrating SPA gets the time it actually needs. A snappy page does not idle in a polling loop. The same code handles both.
The assertion surface, before and after
Twenty-three matchers, each with its own retry semantics. Every test call site names a specific Locator, typically via getByTestId, getByRole, or getByText. A renamed testid is silently flaky: the 5000 ms retry burns before surfacing a 'locator not found'. Soft and hard modes are separate API entry points.
- 23 matchers, one retry engine
- expect.timeout default: 5000ms
- Locator drift = timeout, not diagnosis
- expect.soft is an opt-in wrapper
- Assumes the test author picked the right selector
The pipeline from scenario step to verdict
Three inputs: the English step, the accessibility snapshot, and the observable fact the agent is looking for. One hub: the assert tool, fed by Claude Haiku 4.5. Four outputs, written to disk under /tmp/assrt/<runId>/.
step → a11y tree → assert → artifacts
Numbers to commit to memory
required fields on assert
Playwright web-first matchers
default expect.timeout
wait_for_stable DOM-quiet
- assert: 3 required fields (description, passed, evidence)
- wait_for_stable: 30s timeout, 2s DOM-quiet default
- wait_for_verification_code: polls every 3s, up to 60s
- snapshot runs before every interaction
- 18 total MCP tools in the agent surface
- @playwright/mcp pinned at v0.0.70
What a real run looks like end to end
The same scenario as the markdown above, executed against a Chromium instance driven by @playwright/mcp. Nine tool calls, 4.7 seconds, two assertions, both passed with evidence.
Six things to know about the assertion surface
One assert tool, not 23 matchers
Playwright ships 23+ web-first matchers, each with its own retry semantics. Assrt exposes a single assert tool with three required fields: description, passed, evidence. Every matcher in the family collapses into natural-language observation plus a verdict.
Default expect.timeout was 5000ms
Playwright's web-first retry budget is 5 seconds per matcher. If the selector is wrong, that 5 seconds is wasted before the test surfaces an error.
wait_for_stable defaults
timeout 30s, stable 2s. The agent waits for DOM quiet instead of polling a locator, so slow SPAs get time and fast pages do not idle. Defined agent.ts:186-194.
Accessibility tree, not CSS
The agent calls snapshot before every click and type, reads role + accessible name, and interacts via ref=eN IDs. "Click Sign in" resolves to role=button with name="Sign in", never to a class that just got renamed in a refactor.
Evidence is a first-class field
Required string on every assert call. A failing test does not just say "not visible"; it says what the agent saw instead. The closest a11y match is usually enough to diagnose.
Soft by default
Every assert records into the run and does not throw. complete_scenario rolls them up. If you want hard-fail, pass a pass_criteria string to assrt_test.
How to migrate one test, not the whole suite
You do not need to delete your Playwright project. Pick one scenario that has been flaky for the wrong reason and move only that one. Ten minutes, four steps.
Move one flaky assertion
Write the observable, not the selector
Skip `getByTestId(...)`. Describe what a user would see: 'Dashboard heading "Welcome back" is visible'. The agent reads the a11y tree at runtime; your plan never ties to a DOM path that could be renamed in the next refactor.
Let wait_for_stable replace the fixed 5 second retry
Where you used to rely on `expect.timeout = 5000`, you now get a 2-second DOM-quiet window inside a 30 second budget. The agent calls it automatically after async actions (agent.ts:186-194). Override via `wait_for_stable({ timeout_seconds, stable_seconds })` when you need more or less.
Trust the assert tool's three-field contract
Every assertion your agent makes records description + passed + evidence. The evidence string becomes your failure triage: 'closest match: button with name "Log in"' is more actionable than 'locator not found'.
Keep Playwright for pixel diffs and perf
Assrt does not try to replace toHaveScreenshot or performance traces. Keep those where they are. Use Assrt where the flaky class is behavioral (visibility, enablement, URL changes, text content) and the selector was the thing breaking.
Side by side: assert vs expect
Not a replacement for Playwright. A replacement for the assertion layer that assumes you already know the right selector.
| Feature | Playwright expect | Assrt assert |
|---|---|---|
| Primitive | expect(locator).toBeVisible() + 22 siblings | assert(description, passed, evidence) |
| How flaky tests are handled | Matcher re-polls the same locator for 5000ms | Agent re-snapshots a11y tree; wait_for_stable adapts |
| When the selector is wrong | Timeout after 5s with generic "not found" | Evidence names the closest a11y match |
| Timeout surface | expect.timeout: 5000ms (per matcher) | wait_for_stable: timeout 30s, stable 2s |
| Selector language | CSS, XPath, data-testid, getByRole | Natural-language role + accessible name |
| Output artifact | .spec.ts you also have to maintain | scenario.md + events.json + webm video |
| Soft vs hard fail | Hard by default; expect.soft to opt into soft | Soft by default; configurable via pass_criteria |
| Price | Playwright is free; AI codegen SaaS up to $7,500/mo | Free, MIT-licensed, self-hosted |
Assrt runs on Playwright under the hood. The split is about the assertion primitive, not the browser automation engine.
Stop debugging 5-second locator timeouts
15 minutes with the team. We'll look at your flakiest Playwright test and show you exactly what the scenario.md version looks like.
Book a call →Frequently asked questions
What are Playwright's web-first assertions and how do they actually work under the hood?
A web-first assertion is any Playwright matcher that takes a Locator or Page and re-evaluates the expression on a timer until it passes or the timeout expires. The canonical form is `await expect(page.getByRole('button', { name: 'Sign in' })).toBeVisible()`. Under the hood, Playwright polls the matcher, re-queries the locator against the current DOM, checks the condition, and either resolves or retries. The default expect timeout is 5000 ms (not the 30 second action timeout, which is separate). You configure it per-project with `expect.timeout` in playwright.config.ts or per-call with `.toBeVisible({ timeout: 10_000 })`. The matcher family includes toBeVisible, toBeHidden, toBeEnabled, toBeDisabled, toBeChecked, toBeAttached, toBeEmpty, toBeEditable, toBeFocused, toBeInViewport, toHaveText, toContainText, toHaveValue, toHaveCount, toHaveClass, toHaveCSS, toHaveAttribute, toHaveJSProperty, toHaveURL, toHaveTitle, toHaveScreenshot, and toHaveAccessibleName, plus their response-level siblings like toBeOK. All of them share the same retry engine.
So what is wrong with web-first assertions? The docs say they fix flakiness.
They fix one kind of flakiness: timing. The agent clicks a button, the UI is in a loading state, the matcher retries for a few hundred milliseconds, the button becomes enabled, the assertion passes. Great. They do not fix the other kind: wrong selector. If the test was written against `button.primary-cta` and someone renamed it to `button.cta-primary`, web-first retry just burns 5000 ms and then fails the same way it would have failed in the first millisecond. The retry engine cannot synthesize a new locator. It re-runs the same query against a moving DOM. When an AI writes the test, this is the common failure mode: the LLM picked a selector that was plausible but not quite right, and the expect call cannot recover. The 5 second budget is lost before the test surfaces a useful error.
How does Assrt's assert primitive differ from expect(locator).toBeVisible()?
Assrt exposes one general-purpose `assert` tool with exactly three required fields: description (what you are asserting), passed (true or false), and evidence (free-text rationale). Defined at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 132 to 143. There is no Locator object, no matcher family, no timeout parameter on the assert call itself. The agent reads the live accessibility tree via the snapshot tool, finds the element whose role and accessible name match the English in the scenario, checks the observable property, and records the verdict with evidence. If the element is not there yet, the agent waits via `wait_for_stable` (default 30 seconds with a 2 second DOM-quiet window, agent.ts:186-194) before asserting. The retry happens at the agent's reasoning layer, not inside a matcher that is re-evaluating a fixed selector.
Is Assrt actually running real Playwright, or is it a separate browser runtime?
Real Playwright. The agent spawns the official Microsoft-maintained @playwright/mcp server, pinned to v0.0.70 in /Users/matthewdi/assrt/src/core/freestyle.ts (line 586, inside the baseImageSetup shell string: `npm install -g @playwright/mcp@0.0.70`). Clicks, types, navigations, screenshots, and the accessibility snapshot all go through the Playwright MCP toolset. What differs is the layer above: Assrt does not call Playwright's JavaScript expect API. It interprets an English plan, resolves a11y refs per step, and calls the assert tool once it has observable evidence. You keep Playwright's browser automation guarantees. You drop the part of the API that assumes you already know the right selector.
What does the 'evidence' field in assert look like in practice?
It is a one-line rationale the agent writes after observing the page. For a successful toBeVisible equivalent: evidence = 'Heading element with role=heading and accessible name "Welcome back" found at ref=e14 after snapshot'. For a failed toHaveURL equivalent: evidence = 'URL is /login after 3.2s wait_for_stable, expected /app/dashboard'. It is free-text, not a diff, because the audit trail is actually the /tmp/assrt/<runId>/events.json file, which logs every tool call with timestamps and payloads. The evidence string is for a human reading execution.log; the structured data is already captured per tool call. When a test fails, the failing assert plus the preceding snapshot is usually enough to diagnose without opening the .webm video. When it is not, the video auto-opens at 5x speed.
What about the built-in Playwright matchers that do not have an obvious assert equivalent, like toHaveScreenshot or toHaveCSS?
Pixel-level visual regression (toHaveScreenshot) and computed-style assertions (toHaveCSS) are deliberately not in the Assrt model. If you rely on pixel diffs, keep Playwright's visual regression alongside Assrt scenarios. They compose cleanly because Assrt does not monopolise the browser. For things that matter at the user level (element is visible, URL changed, text appeared, input accepts a value, modal dismissed), the single assert tool covers the same ground as the 20+ matcher family because the agent can phrase any of them as a natural-language observation. toBeChecked becomes 'Checkbox labeled "Remember me" is checked'. toHaveCount becomes 'Three product cards are rendered under the Featured heading'. The agent reads the a11y tree, verifies, and writes evidence. The split is: keep Playwright for pixel diffs and performance; use Assrt for behavior.
Does the agent flake when the page is still loading? What replaces the 5 second expect retry?
Two mechanisms. First, `wait_for_stable` (agent.ts:186-194): the agent calls this after any async action, and it blocks until no DOM mutations happen for a stable window (default 2 seconds) or the timeout fires (default 30 seconds, configurable per call). This is better than a fixed 5 second retry because a slow-hydrating SPA gets the time it actually needs, and a snappy page does not idle. Second, the agent is reasoning-based, so after a failed observation it can call snapshot again, scroll, wait longer, or try a different path. A Playwright expect retry can only re-query the same locator. The practical effect: under the same network conditions, Assrt runs tend to finish faster on the happy path (no fixed 5s polling floor) and more informatively on the sad path (evidence explains what the agent saw, not just that a locator was missing).
I use soft assertions (expect.soft) today to collect multiple failures per test. Is there an equivalent?
Yes, implicitly. In Playwright, expect.soft keeps the test running after a failed assertion so you collect all failures, then fail the test at the end. In Assrt, every assert call is soft by default: passed=false does not throw, it just records the result in the scenario run. A scenario finishes with complete_scenario (agent.ts:146-155), which takes a summary and a scenario-level passed boolean. If any individual assert was passed=false, the agent reports that in the summary. You can implement hard-fail semantics by having the scenario stop the moment an assert fails, via the pass_criteria string passed to assrt_test; or you can run through the whole plan and surface N failures at once. Both patterns work because assertions are data, not exceptions.
Does Assrt generate real Playwright code I can check into my repo, or is it proprietary?
Neither. Assrt deliberately does not emit a .spec.ts file. The plan you check into the repo is a markdown file (scenario.md) with #Case headers and English steps. The assertions that get executed are logged per run into /tmp/assrt/<runId>/events.json, and the final verdict is in /tmp/assrt/<runId>/results.json. No .spec.ts means no code-maintenance debt; no proprietary YAML means no vendor lock-in. If you cancel the vendor, the markdown is still a valid test description a human can read, and any Playwright-literate engineer can translate it back into a .spec.ts by hand in 15 minutes. Competitors that charge $7,500 a month typically store tests in their cloud database; you cannot grep them.
How do I migrate one flaky test from expect().toBeVisible() to Assrt without rewriting my whole suite?
Four steps, takes about 10 minutes for a single scenario. One: identify the test that is flaking because of selector drift, not because of a real regression. Two: write a scenario.md with one #Case block. The header is the test name, the body is 3 to 5 lines describing the actions and the observable you want to verify. Three: run `npx assrt run --url <your staging URL> --plan-file scenario.md`. The agent opens Chromium via @playwright/mcp@0.0.70, executes each step, and emits an execution.log plus a .webm. Four: if it passes deterministically across 5 runs, delete the flaky .spec.ts; if it fails, read the evidence strings to see whether the a11y role and accessible name are discoverable (that is a real accessibility bug) or whether the English step needs rephrasing. You can do this one test at a time. Assrt scenarios live next to .spec.ts files without interfering.
Related guides on Playwright, selectors, and agent-driven testing.
Keep going
Readable Playwright test generator
Why the most readable Playwright test is one that never compiled to .spec.ts. scenario.md runs directly on @playwright/mcp@0.0.70.
Selector drift detection testing
How an AI agent catches the rename you forgot. Accessibility-first locators vs CSS class paths.
E2E testing best practices
What actually matters in 2026: observable assertions, live a11y snapshots, and video evidence over stack traces.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.