From the Assrt agent source

A Playwright e2e test agent is not an AI code generator. It is the four patches that sit between @playwright/mcp and a test that actually finishes.

The browser part is solved. Anthropic ships @playwright/mcp, it exposes navigate, click, type, snapshot, and twenty other primitives, and any LLM with function calling can drive it. That is the floor. The ceiling is the part nobody writes about: how to paste a six-digit code into six separate single-character inputs without losing focus, how to fail fast when a dev server is wedged, how to wait for a streaming AI response to settle, and how to reuse a logged-in browser across three scenarios in a row. This page is about those four patches. Every one of them is a file and a line number in the Assrt agent source.

M
Matthew Diakonov
11 min read
4.9from MIT-licensed, 100% open source
Wraps @playwright/mcp
Real Chromium, real cookies
Tests are Markdown on your disk
No SaaS lock-in

Where the browser tool belt ends and the agent layer starts

Think of it as a stack. The model sits on top. Tool-calling sits under the model. The Assrt agent layer sits under tool-calling. Underneath it, Playwright MCP drives the real Chromium. Each layer only exists because the one above it cannot do a particular thing on its own.

The agent layer is the thing you are not building yourself

Your #Case plan
Claude Haiku
Tool schemas
Assrt agent
@playwright/mcp
DataTransfer paste
preflightUrl
wait_for_stable
Patch 1 of 4

The OTP paste expression, hard-coded in the system prompt

A very common OTP layout is six separate <input maxlength="1"> fields. If you let the model type character by character, focus moves unpredictably during React re-renders, the wrong character lands in the wrong field, and the test fails without a useful error. The fix is a one-line synthetic paste dispatched to the parent container, and it is baked into the system prompt itself so the model never has to improvise it. The exact expression lives at /Users/matthewdi/assrt-mcp/src/core/agent.ts, lines 234 to 236.

agent.ts (excerpt from SYSTEM_PROMPT)

The model is instructed to substitute only CODE_HERE and leave the rest untouched. After the evaluate() call returns the literal string "pasted 6 fields", the agent re-snapshots to confirm all digits are present and clicks the Verify button. Every OTP library I have tested against Assrt (Magic Link, Clerk, Supabase, WorkOS, custom React implementations) triggers its onPaste handler on that single synthetic event and distributes the digits correctly.

1 line

Six separate single-character inputs are the most common failure mode for AI-driven e2e tests. One synthetic paste on the parent container is the only fix that survives every React re-render.

Assrt SYSTEM_PROMPT, agent.ts:234-236

The four ugly patterns a real test agent has to solve

Pick any log of a Playwright MCP agent trying to finish a real login-gated smoke test and these are the four places where it crashes or stalls. Each one has a concrete fix in Assrt.

Multi-field OTP inputs

Six separate <input maxlength='1'> boxes break character-by-character typing because focus moves unpredictably during re-renders. The agent's SYSTEM_PROMPT pins a DataTransfer + ClipboardEvent expression that pastes the full code to the parent container in one synthetic paste. React's onPaste handler distributes it across all six fields in a single render.

Wedged dev servers

A dev server that accepted the initial TCP handshake but hangs on the request body will hang navigate() for ~180 seconds, drop the Playwright MCP stdio connection, and surface as 'MCP client not connected.' preflightUrl() turns that into an 8 second timeout with an actionable error.

Streaming async UI

Anywhere you have an LLM streaming tokens, a search debouncing, or images lazy-loading, you cannot pin a waitForSelector in advance. wait_for_stable injects a MutationObserver, polls mutations every 500ms, and returns when the page is quiet for N seconds (default 2) or times out at 30.

Cross-scenario session state

Chromium is not closed between #Case blocks. Cookies, localStorage, and open tabs carry over. Case 1 logs in. Case 2 through N reuse the session. Case 3 can also be told that Case 1 passed, via previousSummaries, so the model reasons about the chain instead of every case standing alone.

Patch 2 of 4

Fail in 8 seconds, not 180

The most common Playwright MCP error I see in the wild is MCP client not connected, and the root cause is almost always the same: a dev server accepted the TCP handshake but hung on the request body. navigate() waits the default 180 seconds, the stdio connection to @playwright/mcp drops first, and the user sees the opaque disconnect message instead of the real problem. Assrt runs a HEAD probe with an 8-second AbortController before Chrome even launches.

agent.ts
Patch 3 of 4

Wait for a page to actually settle, not for a guessed selector

Playwright's built-in waits pin to network events or element presence. Both are useless when the page is already loaded and you are waiting for a streaming LLM response, a debounced search, or lazy images to resolve. Assrt injects a MutationObserver into the target page, counts mutations every 500 ms, and returns as soon as the page has been quiet for N consecutive seconds.

agent.ts (abridged)
Patch 4 of 4

One browser across every case in your plan

The scenario loop in agent.ts does not close Chromium between cases. Case 1 logs in. Case 2 through N walk into an already-authenticated browser. The comment in the source is blunt: "keep it alive so the user can take over and interact after the test finishes."

Raw @playwright/mcp, one test at a time

Every test scenario relaunches Chromium. Cookies reset. Every login scenario has to run first, every time. Your plan has a 40-line preamble that logs in before doing anything else.

Assrt shared-session scenarios

One Chromium boots. Case 1 signs up with a disposable email. Case 2 opens billing without going through /signin. Case 3 logs out. The browser stays open at the end so you can manually poke at the final state if you want.

What happens, end to end, when a scenario runs

Every step below maps to a specific block in src/core/agent.ts. You can read the whole loop in about 300 lines. No plugin manager, no orchestrator, no hidden rules engine.

1

Preflight

The agent probes your URL with an 8-second HEAD (falls back to GET on 405 or 501). If the server does not respond, it fails fast with a clear message. No Chrome launch, no hung navigate, no 'MCP client not connected' three minutes later. Source: agent.ts:518-543.

2

Launch and navigate

Chromium starts via @playwright/mcp (local, in-memory, or connected to your live Chrome session via --extension). The agent calls navigate() with a 30 second cap and fails cleanly if the page takes longer than that. Source: agent.ts:441-454.

3

Read the page once, act on ref IDs

Before every action, the agent calls snapshot() to get the accessibility tree with fresh [ref=eN] IDs. Clicks and types reference those ref IDs, not locator strings. If a ref is stale (the page changed), it re-snapshots and picks a new one. Selector rot is not a failure mode because selectors are not persisted.

4

Handle the ugly patterns

OTP inputs go through the DataTransfer paste expression in SYSTEM_PROMPT. Streaming content waits on wait_for_stable with a MutationObserver. Disposable email is requested before the form is filled, not after. Each of these is a tool the model can call; the model does not have to improvise.

5

Assert on observable things only

Assertions are structured objects with description, passed, and evidence fields (see TestAssertion at types.ts). The system prompt constrains the model to assert on visible text, URLs, element presence, and titles. Never CSS, never layout, never performance. Flaky assertions become hard to write.

6

Carry state to the next scenario

complete_scenario does not close the browser. The next #Case in the same plan starts with every cookie and localStorage entry from the previous one. The agent is told, via previousSummaries in runScenario, what passed before and does not repeat auth flows unnecessarily. Source: agent.ts:463-488.

7

Record and report

Every visual action emits a screenshot. The full run is written to /tmp/assrt/runs/latest.json. If --video was set, a .webm is recorded and a player.html opens at 5x with keyboard controls. If a scenario failed, assrt_diagnose can re-read the log and suggest a corrected #Case.

One tool-call round, drawn out

The agent loop is a four-actor conversation: your plan file, the Assrt runner, @playwright/mcp, and the LLM. Each round is snapshot, model-turn, tool-call, tool-result. This is what one round looks like for a single click + stability wait + assertion.

One tool-call round inside runScenario()

scenario.mdAssrt agent@playwright/mcpClaude Haikuread #Case 1 from scenario.mdsnapshot() -> accessibility treenodes with [ref=eN] idssystem + scenario + snapshottool_use: click ref=e14click(ref=e14)ok, new DOMtool_result + new snapshottool_use: wait_for_stableMutationObserver.evaluate()stable after 1.6stool_use: assert(...)tool_use: complete_scenario(passed=true)

Raw @playwright/mcp vs. Assrt's agent layer

Both use real Chromium. The difference is everything above it.

FeatureRaw @playwright/mcpAssrt
Scenario formatTypeScript .spec.ts or proprietary YAMLPlain Markdown #Case blocks at /tmp/assrt/scenario.md
Selector strategyLocator strings persisted in the test fileZero locators. Every step re-discovers from a fresh accessibility snapshot.
OTP input handlingCharacter-by-character typing, breaks on React OTP libsOne-line DataTransfer paste baked into SYSTEM_PROMPT at agent.ts:234-236
Hung dev server behavior3 minute hang, then opaque 'MCP client not connected'preflightUrl fails in 8 seconds with an actionable message
Async content stabilityHardcoded sleeps or waitForSelector with guessed nameswait_for_stable + MutationObserver. Returns when page is actually quiet.
Multi-scenario sessionBrowser relaunched per test, cookies rebuilt every timeSame Chromium across every #Case in the plan. Case 2 reuses Case 1's auth.
Disposable emailFake email libraries you wire up yourselfcreate_temp_email opens mail.tm and waits for the OTP, built in
Failure diagnosisScroll through the log yourselfassrt_diagnose re-reads the run log and returns a corrected #Case
Where the tests liveVendor dashboard behind an accountYour disk. Optionally your Git repo.
Cost at comparable scale$7.5K / month per seat for closed AI QA platforms$0 + your Anthropic tokens (MIT-licensed, self-hosted)

Numbers and file paths come from /Users/matthewdi/assrt-mcp at commit time of writing. Closed enterprise pricing references public Momentic, Mabl, and Testim list pages.

What a real run prints

A three-case login-gated smoke run against a local Next.js dev server. Notice the single preflight at the top, the DataTransfer evaluate call in scenario 1, the wait_for_stable return time, and the browser not relaunching for scenarios 2 and 3.

~/myapp · assrt run --video

Measured against the Assrt source

Every number on this page is grounded in a specific constant or branch in the agent source. No invented benchmarks.

0s
preflightUrl timeout before Chrome launch
0ms
wait_for_stable MutationObserver poll interval
0s
wait_for_stable max timeout
0x
Default video player playback speed
Playwright MCPMutationObserverDataTransfer pastemail.tm disposable emailChromium --extension tokenclaude-haiku-4-5-20251001WebM recording at 5x#Case blocks in Markdownassrt_diagnoseaccessibility snapshot refs

Is this the right tool for you?

If you answer yes to three or more of these, a Playwright e2e test agent is a better fit than hand-maintained codegen output.

Fit checklist

  • Your app has login, signup, or any OTP flow
  • Your UI streams content or uses async rendering
  • Your dev server occasionally hangs on the request body
  • You want tests to survive arbitrary selector/class changes
  • You want the scenario file to be reviewable Markdown, not locator strings
  • You would rather own the tests than rent them

Quickstart: a login-gated smoke suite

Install, write three cases, run. The whole dance is one npx setup, one plan file, and one run command. The agent handles the OTP, the stability wait, the preflight probe, and the shared session automatically.

terminal

Want to talk through which of the four patches actually hits your app?

Fifteen minutes. Bring a URL, or a flaky test you are tired of. I will show you where Assrt does and does not fit.

Frequently asked questions

What is a Playwright e2e test agent, specifically, and how does it differ from Playwright codegen?

Playwright codegen records your clicks and emits a TypeScript .spec.ts file you then maintain by hand. An e2e test agent sits between a language model and a real browser: the model reads a plain-English scenario, decides what to click and type in real time, and verifies what it sees before moving on. Assrt is the second shape. It wraps @playwright/mcp (the official Playwright MCP server) and adds the agent layer on top, which lives at /Users/matthewdi/assrt-mcp/src/core/agent.ts. The agent's tools are the usual Playwright primitives (navigate, click, type, scroll, press_key, snapshot), plus QA-specific ones (assert, create_temp_email, wait_for_verification_code, wait_for_stable, suggest_improvement). The tests never contain selector strings. Each step re-discovers the target element from a fresh accessibility snapshot, so selector rot is not a failure mode.

Why does a Playwright e2e test agent need anything beyond navigate, click, and type?

Because real web apps have four patterns that a browser tool belt alone does not cover: multi-field OTP inputs (six separate one-char inputs that break character-by-character typing), wedged dev servers that hang navigate() for three minutes and then surface an opaque MCP-disconnect error, async DOM updates that finish at an unpredictable time (streaming AI responses, lazy image loading, debounced search results), and session continuity across a multi-case plan (a login scenario should not be repeated before every other scenario that needs to be logged in). Each pattern has a concrete fix in the Assrt agent: the OTP paste expression baked into SYSTEM_PROMPT at agent.ts:234-236, preflightUrl() at agent.ts:518-543, wait_for_stable via MutationObserver at agent.ts:956-1009, and the shared-browser scenario loop at agent.ts:463-488.

Walk me through the actual OTP paste hack. What does it do, and why not just type?

A very common OTP layout is six separate <input maxlength="1"> boxes. If an agent types character by character into the first one, focus moves to the second box, and the second typed character arrives into the now-focused second box. Fine. Then a React-based OTP component re-renders mid-typing, focus drops, the next character goes into the wrong field, and your test fails with no visible error. The fix Assrt hard-codes into the system prompt is a single-line synthetic paste: it finds the first input[maxlength="1"], walks up to the parent container, constructs a DataTransfer with the full code as text/plain, and dispatches a bubbling ClipboardEvent("paste") to the container. React's onPaste handler sees one event with the full code, distributes it across all six inputs in one render, and the test moves on. The model is told 'do not modify this expression except to replace CODE_HERE' so it never guesses. Source: agent.ts:234-236.

What exactly does preflightUrl() do, and why does it matter for a test agent?

Before it even launches Chrome, the agent does a HEAD request to the target URL with an 8-second timeout (preflightUrl at agent.ts:518-543). If the server responds at all, even with 4xx or 5xx, the test continues. If it returns 405 or 501 on HEAD, the agent retries with GET, because some dev servers (and Next's dev middleware) reject HEAD. If the request times out or the connection is refused, the agent throws a clear error with actionable text: 'Target URL did not respond within 8000ms. The server may be wedged, still starting, or unreachable.' This one probe prevents the most common failure mode of Playwright-based agents: a hung dev server that lets the Chrome launch succeed, then hangs browser.navigate() for the 180-second default, then silently drops the Playwright MCP stdio connection, then surfaces to the user as the useless message 'MCP client not connected.' Three minutes of uncertainty become eight seconds of a real error you can act on.

How does wait_for_stable differ from Playwright's built-in waitForLoadState or waitForSelector?

waitForLoadState is tied to network or DOMContentLoaded events, which fire once per navigation. waitForSelector is tied to a specific element you name in advance. Both are useless for the case where the page has already loaded, you just clicked a button, and an AI response is streaming into a div that you did not know the selector for yet. wait_for_stable solves that class of problem. It injects a MutationObserver into window (agent.ts:963-971), counts mutations on a 500 ms poll, and returns as soon as the mutation count has not changed for N consecutive seconds (default 2 seconds of quiet), or after a hard timeout of 30 seconds. The agent calls it after any action that triggers async rendering. It adapts to the page: a fast page returns in 2 seconds, a streaming page waits 12 seconds, both without any hardcoded sleep in the test scenario.

Scenarios in my plan sometimes need auth state from a previous case. How does session continuity work?

The agent holds a single browser session across every #Case in the plan. When runScenario finishes, the browser is not closed (agent.ts:489-492 says 'keep it alive so the user can take over'). When the next scenario starts, it inherits all cookies, localStorage, and any open tabs from the one that just ran. The runScenario signature explicitly takes a previousSummaries array so the model sees 'Case 1: PASSED — logged in as the test user' in the context for Case 2, and knows not to re-login. If you want a clean slate, you write a #Case that explicitly logs out first. The same browser is also reused across MCP calls within one Claude Code session (see sharedBrowser at server.ts:31), so you do not pay the Chromium cold-start tax on every assrt_test invocation.

Is this open source? Where does my data go?

Assrt is an open source MCP server, MIT-licensed. The CLI, the MCP server, and the agent all live in the same npm package (@assrt-ai/assrt), and the full source is at /Users/matthewdi/assrt-mcp. When you run a test, the payload (your URL, the accessibility snapshot, any screenshots) goes directly from your machine to the LLM endpoint the agent is configured for. By default that is the Anthropic API with your own key or your Claude Code OAuth token. Every network boundary is configurable: ANTHROPIC_BASE_URL lets you point at a local proxy or an air-gapped endpoint, and --extension lets you run against your live Chrome so nothing ever leaves your real session. There is no 'Assrt cloud' that sees your DOM. Compare with closed enterprise runners priced at $7.5K per seat per month, which by design route every page you test through their backend for analytics and 'healing' features.

How do I run this against a login-gated app without constantly creating fresh accounts?

Two options. First, use disposable email: the agent's create_temp_email tool opens a mail.tm inbox, fills your signup form with that address, and then wait_for_verification_code polls the inbox for up to 120 seconds, extracts the 4-8 digit code, and pastes it into the OTP inputs using the DataTransfer expression from the previous answer. Second, extension mode. Pass --extension to connect the agent to your already-running Chrome via the Playwright extension token (saved to ~/.assrt/extension-token after first use). Your real logged-in session is the test session. Your Gmail, Slack, Zapier, whatever OAuth app you are testing, already has its cookies. The agent walks through the scenario as you would, but the assertions are structured and the run is recorded.

What does a failed run look like, and does the agent try to explain itself?

On failure, every step carries its structured metadata (action, description, status, timestamp) and the agent's final complete_scenario call returns both a boolean and a summary string. The whole scenario run, including the base64-encoded screenshots and the assertion log, is written to /tmp/assrt/runs/latest.json (see scenario-files.ts). On top of that, the assrt_diagnose MCP tool re-reads the run log and calls the model again with a diagnosis prompt, returning a plain-English root cause plus a corrected #Case you can paste back into your plan. The video of the failed run is written as .webm and a player HTML is auto-opened at 5x playback, with keyboard controls for 1x/2x/3x/5x/10x speed and arrow-key seek. Debugging an AI test failure tends to be the difference between re-reading logs and watching the browser do the wrong thing at quadruple speed.

Does this work in CI, or is it only for local dev?

Both. The CLI supports --json for structured output, --headed=false for headless runs (the default), and --video for .webm recording of each run. The standard CI flow is: npx @assrt-ai/assrt run --url $PREVIEW_URL --plan-file tests/smoke.md --json --video --no-auto-open. The JSON includes the pass/fail count per scenario, the duration, the path to the recorded .webm, and the path to /tmp/assrt/runs/latest.json. GitHub Actions and other runners can upload the .webm as an artifact. For teams who want cross-run history and shareable URLs, the runner can also sync scenarios and artifacts to app.assrt.ai, but that flag is opt-in.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.