What is Playwright testing, and what is the one primitive every guide forgets?

The textbook answer is correct. Playwright is browser automation through the Chrome DevTools Protocol, with cross-browser support, auto-waits, and a TypeScript runner. That definition was complete in 2023. In 2026 it is missing one primitive: a MutationObserver loop that detects when the page has actually finished changing. Vanilla Playwright does not ship one. You feel its absence the moment you let an LLM agent drive a real Chromium against a streaming chat interface or a debounced search box. This guide is about that gap and the 30-line answer that lives at agent.ts:956-1005.

M
Matthew Diakonov
12 min read
4.9from open source, MIT licensed
Built on @playwright/mcp
MutationObserver wait primitive
30-line wait_for_stable
Runs entirely on your machine

The textbook answer, with nothing taken for granted

Playwright is a browser automation library from Microsoft. It drives real Chromium, Firefox, and WebKit instances through the Chrome DevTools Protocol from a separate process. You write a script that opens a URL, walks through a real user flow, and either passes or fails based on assertions about what the rendering engine actually produced. The defining word in that sentence is real. Unlike unit tests that mock the DOM or component tests that render in jsdom, a Playwright test sees the same composited page a user sees, with the same network behavior, the same animations, the same lazy-loaded carousels, and the same race conditions.

Everything else about Playwright (the test() runner, page.locator(), expect(), fixtures, traces) is a productivity wrapper on top of that one fact. Strip the wrappers away and a Playwright end-to-end test is six primitives in sequence. Launch a context, navigate to a URL, locate an element, interact with it, wait for the page to settle, assert something about the visible state. Most guides describe the first five and skip the sixth as if it were trivial. It is not. The sixth primitive is where every flaky-test horror story comes from, and on an agent-driven run it is the only step the LLM cannot just guess its way through.

The architecture under an agent-driven run

A modern Playwright run does not start with a .spec.ts file anymore. It starts with a plain-English plan, an LLM agent, and the official Microsoft @playwright/mcp server. The agent loop dispatches eighteen tools against the running browser, and the browser writes a video, an event log, and a JSON report.

scenario.md goes in. WebM + JSON come out.

scenario.md
URL + variables
extension token
model + provider
Agent loop
WebM recording
JSON event log
5x video player
Per-step screenshots

The five primitives every guide covers

Read any introduction to Playwright and you will get some version of these five. They are correct, in order, and complete enough to write a passing test against a static page. They start to fail the moment the page is not static.

1

Snapshot the accessibility tree

Call browser_snapshot to get a YAML accessibility tree of the current page with [ref=e1] [ref=e2] [ref=e3] handles for every interactive element. Assrt writes the snapshot to a .yml file via --output-mode file (browser.ts:296) so a 100k-element tree does not blow up the agent's context window. Anything over 120,000 chars is truncated.

2

Pick an element by ref

The agent picks a ref from the snapshot (Sign In button is ref=e22) and passes it to the next tool call. There are no CSS selectors involved. The ref handle is opaque to the agent; it just identifies an element by its accessibility-tree position, which is far more stable than a class name or an XPath.

3

Take the action

browser_click({element: 'Sign In button', ref: 'e22'}). The element string is for the run log; the ref is what Playwright actually uses to locate the DOM node. After the action, Assrt also calls injectOverlay() so the next video frame shows a red cursor gliding to the click coordinates and a ripple expanding from them.

4

Assert and continue

Call assert with a description, a passed boolean, and an evidence string. The agent forms the assertion from what it sees in the post-action snapshot (URL is /app, heading text is Welcome). The result lands in the JSON event log; failed assertions also surface in the WebM with their evidence string, so a reviewer can scrub to the failing moment without reading the log.

30 lines

The wait between an action and an assertion is where every flaky test lives. Vanilla Playwright gives you three weak primitives for that wait. Assrt gives you one strong one.

Assrt agent system prompt, agent.ts:249-254

The sixth primitive: wait_for_stable

Between the action and the assertion, the page has to settle. On a hand-written test the human author already knows what should appear, so they reach for one of three tools: page.waitForTimeout (flaky and slow), page.waitForLoadState('networkidle') (returns too early on streaming flows), or page.waitForSelector (requires knowing the final selector). All three are wrong in different ways. On an agent-driven run none of them are available, because the agent does not know the final selector and does not know the network protocol. It only knows that something asynchronous happened and the next snapshot needs to reflect the settled DOM.

Assrt's answer is a 30-line MutationObserver loop. It lives in a single case statement at /Users/matthewdi/assrt-mcp/src/core/agent.ts:956-1005. The agent calls it after every action that could trigger an async update, and the system prompt at lines 249-254 instructs it to do so explicitly: "Use wait_for_stable to wait until the DOM stops changing. This is better than wait with a fixed time because it adapts to actual load speed."

agent.ts (the wait_for_stable case handler)

What it replaces

Three weak waits become one strong one. The left tab is what you scatter through a hand-written Playwright spec when an action triggers a streaming response or a debounced search. The right tab is what an Assrt agent calls instead.

The wait primitive, before and after

// What you write today in a hand-authored Playwright spec
// when an action triggers a streaming AI response or
// a debounced search box.

await page.click('text=Send');

// Hope 2 seconds is enough.
await page.waitForTimeout(2000);

// Or guess at a load state that does not actually mean
// "the streaming response finished":
await page.waitForLoadState('networkidle');

// Or wait for a specific selector that may or may not
// appear in time:
await page.waitForSelector('.response-bubble.complete', {
  timeout: 30_000,
});

// All three are wrong in different ways:
//   - waitForTimeout is flaky and slow
//   - networkidle returns too early on streaming flows
//   - waitForSelector requires you to know the final selector
//
// On an agent-driven run, the agent does not know any of
// these in advance. So you write 2000 ms everywhere and
// pray.
59% fewer guesses

The detail nobody else mentions: a 10-second hard cap on the stability window

Read the handler carefully and there is one line that looks innocent and is not. The stability window is Math.min((toolInput.stable_seconds as number) || 2, 10). Default 2 seconds, hard cap 10 seconds. The cap exists because an agent that wants "the page has been quiet for 30 minutes" is almost certainly waiting for the wrong thing, and the right answer is to fail the run with actionable evidence rather than block the whole loop. The timeout has a similar cap at 60 seconds. If your scenario genuinely needs to wait longer (a long video upload, a multi-minute model generation), you split the wait across two wait_for_stable calls separated by an explicit screenshot or assert. The cap forces you to think about what the wait is for, instead of typing 600 and walking away.

What an actual run looks like end to end

The plan is plain English in /tmp/assrt/scenario.md. The runner watches the file via fs.watch (debounced 1 second, scenario-files.ts:97-103), so you can edit the next case while a run is in progress and the agent picks up the change on the next iteration.

/tmp/assrt/scenario.md
assrt run, with two wait_for_stable calls in case 1

The eighteen tools the agent actually has

Each tool is a real Playwright primitive or a small orchestration on top of one. Together they are the entire surface area of an Assrt run. The list lives as Anthropic Tool objects at agent.ts:16-196.

snapshot, click, type, navigate

The four staples. Every Playwright run is some combination of these. Assrt routes them through the official @playwright/mcp tools, so the underlying browser_click is identical to what a hand-written spec would call. The agent picks elements by ref, not selector.

wait_for_stable

The MutationObserver-based stability detector. Default 2 seconds quiet, max 30 seconds wait. Replaces fixed-time sleeps and waitForLoadState. Source: agent.ts:956-1005.

create_temp_email + wait_for_verification_code

Two tools that turn a signup test from a 30-minute manual run into a 15-second automated one. The first creates a disposable inbox at mail.tm; the second polls it for OTP codes for up to 60 seconds. The agent uses them automatically when a flow asks for an email.

evaluate

Run JavaScript in the page. Used by the OTP-paste trick (agent.ts:235): a single DataTransfer paste fills six single-character OTP fields at once instead of typing into each one. Without this, OTP tests are 6 type calls and 4 minutes of brittle sequencing.

http_request

Make an HTTP call to an external API from inside a test. Used to verify that a webhook landed (Telegram getUpdates), a Slack message arrived, or a GitHub issue was created. Closes the loop between an action in the page and an effect outside it.

assert + complete_scenario

The two terminating tools. assert records a pass or fail with a description and an evidence string. complete_scenario marks the case finished and lets the agent move to the next #Case in the plan.

Concepts that all live under the same definition

The Playwright vocabulary you will see referenced again and again in any deep guide. Each one has a specific role in the agent-driven version.

DevTools ProtocolAccessibility tree refsMutationObserver@playwright/mcpChrome 1600x900 viewportWebM + VP9 recordingTrace ViewerLocator strategywaitForLoadStateAuto-waitingDisposable email + OTPPersistent browser profileHeadless vs headedRange-seekable playbackPlain-English scenariosSelf-healing locators

By the numbers

Four numbers worth memorizing if you are about to drive Playwright with an LLM agent. 0 lines for the entire stability primitive. 0 ms between mutation polls. 0 agent tools total. And exactly 0 CSS selectors you have to write or maintain.

0lines for wait_for_stable
0 msms poll interval
0tool definitions in the agent loop
0selectors you maintain

Vanilla Playwright versus an agent-driven run on top of it

Same underlying browser. Same DevTools Protocol. The difference is everything around it.

FeatureHand-written PlaywrightAssrt on @playwright/mcp
How a test is writtenA .spec.ts file with page.click, page.fill, page.waitForSelector, page.expect, plus fixtures and a config. Maintained by humans.A markdown file with #Case headers and bullet steps. No selectors, no fixtures. Agent picks elements from a fresh accessibility tree on every action.
Waiting for the page to settlepage.waitForTimeout (flaky), page.waitForLoadState (early on streaming flows), page.waitForSelector (requires knowing the final selector).wait_for_stable with a MutationObserver counter. Returns when 2 seconds pass with zero new DOM mutations or 30 seconds elapse, whichever comes first.
What you maintain when the UI changesEvery locator that touched the renamed element. Every getByText that hardcoded a label. Every CSS selector that anchored on a class name.Nothing. The agent reads the new label out of the next snapshot and clicks the right element. The plan stays the same.
What you watch back when something failsA .webm screencast with no cursor and no keystroke indicator, plus a Trace Viewer that requires a separate launch.A WebM with a red cursor, a click ripple, a green keystroke toast, a heartbeat dot, and a 5x HTML player that auto-opens on completion.
OTP and disposable email signupYou wire up a temp-email service yourself, write a polling helper, and parse the inbox HTML. Maybe an hour the first time.Two built-in tools, create_temp_email and wait_for_verification_code, work out of the box. Zero setup. mail.tm under the hood.
License + cost$7,500 per seat per month for enterprise testing platforms that wrap Playwright. Plans encoded in proprietary YAML you cannot run elsewhere.MIT licensed. Runs locally. Plans are plain Markdown you own forever. The only cost is LLM tokens, roughly cents per run on Haiku.

So, what is Playwright testing?

The honest answer in 2026: Playwright testing is a six-step loop where the sixth step has not made it into most documentation yet. Snapshot the accessibility tree, pick an element by ref, take the action, wait for mutations to settle, assert against the new snapshot, and continue. The first five you can find in any tutorial. The sixth lives in 30 lines at /Users/matthewdi/assrt-mcp/src/core/agent.ts:956-1005, and it is the difference between a test that passes 95 percent of the time and a test that passes always. If you are writing a Playwright run by hand, port that handler into your project. If you would rather skip the porting, that is what Assrt is for: a real Playwright run, with the sixth primitive built in, driven by an LLM that you instruct in plain English.

See an agent-driven Playwright run end to end

A 20-minute call. We open a real scenario, run it against your staging URL, and walk through the resulting WebM and the wait_for_stable mutation log together.

Frequently asked questions

What is Playwright testing, in one sentence, with nothing taken for granted?

Playwright testing is browser automation that drives a real Chromium, Firefox, or WebKit instance through the Chrome DevTools Protocol from a separate process, so you can write a script that opens a URL, walks through a real user flow, and either passes or fails based on assertions about what the rendering engine actually produced. The defining word in that sentence is real. Unlike unit tests that mock the DOM or component tests that render in jsdom, a Playwright test sees the same composited page a user sees, with the same network behavior, the same animations, the same lazy-loaded carousels, and the same race conditions. Everything else about Playwright (the test() runner, page.locator(), expect(), fixtures, traces) is a productivity wrapper on top of that one fact.

Why does anyone use Playwright instead of Selenium or Cypress?

Three reasons that the average tutorial buries. First, Playwright drives the browser through the DevTools Protocol directly, not through a WebDriver bridge, so it is faster and gets richer signal back (network requests, console logs, accessibility tree, all in one process). Second, Playwright auto-waits for elements to be visible and stable before interacting, which removes most of the flaky-test recipe that haunts Selenium suites. Third, Playwright runs the same test on Chromium, Firefox, and WebKit from one binary, so a single locator string verifies behavior across three rendering engines. Cypress is fine, but it runs inside the browser as JavaScript, which means it cannot test multiple tabs, cannot drive a real new-tab login flow, and cannot drive WebKit. Playwright runs outside the browser, so none of those constraints apply.

What is the actual lifecycle of a Playwright test? Step by step.

A Playwright test runs through six primitives. (1) launch a browser context, (2) navigate to a URL, (3) locate an element by role, text, or selector, (4) interact with it (click, type, hover), (5) wait for the page to settle, (6) assert something about the visible state. Most guides describe the first five and skip the sixth as if it were trivial. It is not. On a human-written test the script itself tells you when to wait and what to wait for, because the human writing it just clicked the button and knows what should appear next. On an agent-driven run, where an LLM picks each next action from a fresh accessibility tree, the agent does not know what it just triggered, so the wait step has to be self-describing. That is why Assrt adds a MutationObserver-backed wait_for_stable primitive at /Users/matthewdi/assrt-mcp/src/core/agent.ts:956-1005. The agent calls it after every action that triggers a network request or a streaming response, and it returns once 2 seconds pass with zero DOM mutations or 30 seconds elapse, whichever comes first.

What does wait_for_stable actually do, line by line?

It does five things in sequence. (1) It calls browser.evaluate to inject a MutationObserver onto document.body that observes childList, subtree, and characterData changes, and increments a global counter window.__assrt_mutations on every batch. (2) It enters a polling loop that runs every 500 ms. (3) On each tick it reads the counter back from the page and compares it to the last value. (4) If the counter has not changed for at least 2 seconds (configurable with stable_seconds, capped at 10), it considers the page stable and breaks out. (5) It cleans up: disconnects the observer, deletes the global counter, and reports either Page stabilized after Xs (N total mutations) or Timed out after Ys (page still changing, N mutations). Default timeout is 30 seconds, hard cap 60 seconds. The whole thing is 30 lines, all in the wait_for_stable case handler at /Users/matthewdi/assrt-mcp/src/core/agent.ts:956-1005.

Why a MutationObserver instead of Playwright's built-in waitForLoadState?

waitForLoadState waits for the network to be idle (no in-flight requests for 500 ms). That works for first-load. It does not work for the most common modern flow: a button click that opens a streaming AI response, where the network is busy for 30 seconds while tokens arrive, and you want to wait until the streaming finishes, not until the network falls silent. waitForLoadState would return immediately or hang depending on the streaming protocol. wait_for_stable is upstream of network state. It does not care why the DOM is changing, only that it has stopped. That makes it the right primitive for streaming AI chat, debounced search, infinite scroll, lazy-loaded carousels, file upload progress bars, and any other flow where the page settles when paint stops, not when fetch stops.

Can I use wait_for_stable from a regular Playwright spec file, without Assrt?

Yes, but you have to write it yourself. It is not in @playwright/test. The 30-line implementation in /Users/matthewdi/assrt-mcp/src/core/agent.ts:962-994 is portable: pull out the evaluate calls, inline them into a helper, and you have a stability waiter you can drop into any spec. The Assrt agent calls it automatically because the system prompt at agent.ts:249-254 instructs the LLM to use it after submitting forms, sending chat messages, or triggering any async operation. In a hand-written test you would call it explicitly: await waitForStable(page, { stable_seconds: 2, timeout_seconds: 30 }). It is a primitive that belongs in the standard Playwright vocabulary; it just has not been added yet.

What is the test input format for an Assrt run? Do I have to learn anything new?

You write a plain-English plan in /tmp/assrt/scenario.md with #Case blocks. Each case has a one-line header and three to six bullet steps. Example: #Case 1: a logged-out user signs up via email, then bullets like Open the homepage, Click the Sign Up button, Use create_temp_email for the email field, Submit, Wait for the verification code, Paste it, Assert the URL is /app. The runner reads that file, builds a Playwright agent loop on top of @playwright/mcp, and decides what to click and type at each step from a fresh accessibility tree on every action. There are no locator strings to maintain. The agent calls navigate, snapshot, click, type_text, wait_for_stable, assert, and complete_scenario in whatever order the scenario requires. Eighteen tools total, defined as Anthropic Tool objects at /Users/matthewdi/assrt-mcp/src/core/agent.ts:16-196.

Is Assrt actually open source? What about the LLM costs?

Yes, MIT licensed. The CLI ships as @assrt-ai/assrt on npm and the source lives in /Users/matthewdi/assrt-mcp on disk. The agent runs locally on your machine; the only thing that leaves it is the prompt body sent to whatever LLM endpoint you configure. The default is Anthropic Claude Haiku 4.5 (claude-haiku-4-5-20251001), which works out to roughly cents per test run for typical scenarios. You can swap to Gemini 3.1 Pro by passing --provider gemini and setting GOOGLE_API_KEY. Compare against $7,500 per seat per month for closed enterprise testing platforms that route every accessibility tree through their own backend by design. Assrt's video files stay on disk at /tmp/assrt/, served only over localhost by a persistent Range-capable Node server. Nothing leaves the machine unless you opt in to cloud sync.

How is this different from Playwright Codegen, Microsoft's existing 'record actions, get a test' tool?

Codegen is a recorder. You click around in a real browser and Codegen writes the corresponding page.click and page.fill calls into a .spec.ts file. The output is code you commit and maintain. The moment a button label changes, the file breaks and someone has to update it. Assrt is not a recorder. The plan is plain English. The locators are recomputed from a fresh accessibility tree on every run. There is no .spec.ts file to update when the UI changes. If a button is renamed from Sign In to Log In, the agent reads the new label out of the snapshot and clicks the right element on the next run; no human edits required. The artifact you keep is the scenario in English, the WebM video, and a JSON event log. None of those decay when the UI changes.

What is the relationship between Playwright, @playwright/mcp, and Assrt?

Three layers. Playwright (microsoft/playwright on GitHub) is the underlying browser automation library; it ships the Chromium, Firefox, and WebKit binaries and exposes a TypeScript API on top of the DevTools Protocol. @playwright/mcp is the official Microsoft wrapper that exposes Playwright as a Model Context Protocol server, so any LLM with MCP support can call browser_navigate, browser_snapshot, browser_click, etc. as tools. Assrt is a third layer on top of @playwright/mcp: an agent loop, a plan parser, a video overlay, and a stability primitive. Every action in an Assrt run still ends as a real Playwright tool call. The page in your browser does not know it is being driven by an LLM; it only sees standard DevTools Protocol messages. That is what the differentiator means by 'real Playwright code, not proprietary YAML.' The Playwright calls are genuine; nothing is reimplemented in a custom format.

assrtOpen-source AI testing framework
© 2026 Assrt. MIT License.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.