Playwright as a testing tool for LLM agents
Most writeups treat Playwright as a library you import into a *.spec.ts file and populate with page.locator(...) calls. That is one life of the project. The other life, the one this page is about, started when Microsoft shipped Playwright MCP. The library is now a tool surface that an LLM can drive directly: 21 browser_* primitives over stdio, one accessibility-tree snapshot per step, zero locator strings in the repo.
The rest of this guide traces what that looks like in practice. Every file path and line number is pulled from the open-source Assrt reference implementation, so you can follow along with your own checkout.
The shift almost no one writes up
Introductory guides about Playwright open with the same paragraph. Install it from npm, write a spec file, call page.locator(...), run npx playwright test. That is a fine way to start and a fine way to run CI for a web app where humans are writing the specs. It is not the only way Playwright is used in 2026.
The part that is missing from the beginner pages is an officially published MCP server. The package name is @playwright/mcp, and it is how a Model Context Protocol client, typically a coding agent like Claude Code or Cursor, drives a browser. The client does not see the Playwright API. It sees 21 tools with browser_-prefixed names, each with a JSON schema. The MCP server translates those calls into real Playwright operations inside its own browser process.
Once you see a test run as a sequence of those tool calls, the rest of the design falls out. The repo stops carrying locator strings. The plan starts looking like English. The report becomes plain JSON. Let us walk through it.
Those are the 21 tools Playwright MCP actually exposes as of April 2026. The rest of this guide is about what an LLM agent does with them.
What sits on top of the tool surface
Raw Playwright MCP gives an agent 21 low-level actions. A testing tool on top of that adds the missing pieces: scenario parsing, assertions, disposable emails, a completion signal, and a place to put the report. Assrt, the reference implementation referenced throughout this guide, adds eight more tools on top of the 21. The shape looks like this.
A testing tool is a thin layer above Playwright MCP
The anchor fact: the exact CLI the agent spawns
Everything downstream of this guide depends on how Playwright MCP is actually started. In the Assrt implementation the launch args are hard-coded at /Users/matthewdi/assrt-mcp/src/core/browser.ts line 296. Three of those flags are non-obvious and worth calling out individually.
- --viewport-size 1600x900. Fixes the render size, which stabilises the accessibility tree across runs. A tree whose column count changes because the viewport shrank will produce different labels for the same element, and the agent will re-resolve the same click to a different ref. Pin it at the runner, not per-spec.
- --output-mode file. This is the flag that makes the loop actually feasible. Every
browser_snapshotandbrowser_take_screenshotgoes to disk at~/.assrt/playwright-output/instead of being inlined into the MCP transport. A full accessibility tree on something Wikipedia-shaped is tens of kilobytes; inlining it into a Claude response would either exceed the context window or burn the agent's attention on irrelevant markup. File mode lets the agent read a truncated excerpt and fetch the rest only if it needs to. - --caps devtools. Unlocks the
browser_start_videoandbrowser_stop_videotools. The reason Assrt can record a per-run video without context-levelrecordVideoconfig is that devtools capability. It also keeps recording working with a persistent browser session, which context-level recording does not.
“--output-mode file is the single setting that makes accessibility-tree-first testing work on apps larger than a todo demo.”
browser.ts line 296
The agent loop, one step at a time
The Playwright MCP side exposes the raw primitives. The agent side turns them into a disciplined snapshot-act-assert loop. The system prompt for the agent (at agent.ts lines 198-254) spells out the exact sequence: call snapshot first, use a ref from it, call snapshot again after each action, never rely on a stale ref. Here is what a single scenario produces on the terminal when you run the CLI in verbose mode.
What the agent actually calls
The agent has a fixed set of tool schemas, not the full 21-primitive Playwright MCP surface. That is intentional: the agent schema stays small and opinionated, and every call is translated under the hood to one or more browser_* tools by the wrapper.
The sequence of messages on a single step
Zooming in on one step of the loop, here is the actor-level picture. The agent does not talk to the page. The MCP server does. The agent reads tool results and emits tool calls.
One step through four actors
Why this is different from a spec file
The most common reaction to the accessibility-tree-first loop is that it sounds like a recorder with extra steps. It is not. The difference is where the locator string lives. In a classic spec file, the string lives in git and is evaluated at run time. In the agent loop, nothing lives in git except the plan text; the ref that gets clicked is generated fresh from the current state of the page and is thrown away one snapshot later.
| Feature | Playwright as a spec-file library | Playwright as an agent tool surface |
|---|---|---|
| Target resolution per step | Locator string compiled at spec-write time | Live accessibility tree snapshot, fresh ref per step |
| Source of truth for the test | *.spec.ts with locator strings, per-flow | One Markdown plan at /tmp/assrt/scenario.md |
| Who authors each step | A human, before the run | An LLM agent, during the run, per engine |
| Tool surface under the hood | Playwright library API (page.click, page.type) | Playwright MCP, 21 browser_* tools over stdio |
| What ships in CI | Node process + spec files + HTML report | Node process + plan.md + plain JSON report |
| Maintenance trigger | DOM/class rename breaks a selector string | Accessibility label change breaks a step (rarer) |
| OTP / magic-link path | Bring your own email infra | create_temp_email + wait_for_verification_code built in |
| Replay artifact | trace.zip inside Playwright HTML report | WebM video + step-by-step JSON + snapshots/*.yml |
| License and host | Open source library, optional paid cloud ($7.5K/mo+ in some cases) | MIT, runs locally, LLM tokens are the only variable cost |
What actually sits in the repo
A scenario is a Markdown file. That is the entire input contract. #Case 1: as a header, bullets for steps, Assert: lines for the checks. The parser at agent.ts lines 620-631 splits on #Scenario, #Test, or #Case. You can drop this in a repo, run it on Monday, rename every CSS class on Tuesday, and run it again on Wednesday without touching the file.
Case 2 is the flow that tutorials usually skip: signup-with-OTP. The two lines that matter are use create_temp_email and wait for the verification code. Both map to dedicated agent tools that handle the disposable-inbox side of the test without an external mail server. The split-input OTP pattern (six single-character fields, one per digit) is handled by a fixed browser_evaluate expression in the system prompt, so typing one digit per field never breaks focus.
What the full workflow looks like, end to end
- Write the plan in plain English as #Case blocks
- Agent calls browser_snapshot to read the live accessibility tree
- Agent picks a [ref=eN] from that tree and calls browser_click / browser_type
- Action fires inside the Playwright-controlled browser
- Agent calls browser_snapshot again to see the new state
- Agent records an assert and moves to the next step
- On complete_scenario, writeResultsFile dumps JSON to /tmp/assrt/results/
The last step matters because it is what makes the result diffable. Two runs produce two JSONs at /tmp/assrt/results/<runId>.json, keyed by UUID, plus an overwritten latest.json. Every scenario entry has a stable name and a boolean passed, so a one-liner jq over two runs tells you exactly which cases flipped.
Five ideas worth keeping
browser_snapshot is the first call, every step
The accessibility tree, not the DOM. Each node gets a short ref ID like [ref=e5] that survives one snapshot and one snapshot only. The agent never queries by CSS, XPath, or test-id.
--output-mode file keeps the loop feasible
Accessibility trees on a large app can be tens of kilobytes. Dumping them to ~/.assrt/playwright-output/ instead of stuffing them into the MCP transport is what makes the Haiku context budget workable.
18 agent-facing tools wrap 21 browser primitives
navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, plus assert, complete_scenario, create_temp_email, wait_for_verification_code, check_email_inbox, suggest_improvement, http_request, wait_for_stable.
--extension attaches to your real Chrome
For SSO, 2FA, and any flow that needs existing cookies. The token is saved to ~/.assrt/extension-token on first approval and reused. No selector changes, same tool loop.
Claude Haiku 4.5 by default, Gemini swappable
Pinned as claude-haiku-4-5-20251001 at agent.ts:9. Gemini path at lines 354-357 uses gemini-3.1-pro-preview. A typical step is one snapshot plus one action plus one reasoning turn, on the order of a tenth of a cent.
The rough numbers
Ballparks for a five-case plan with roughly twenty steps, run locally against a dev server. These come from the shapes in writeResultsFile and the Haiku rate card as of April 2026.
Hard-coded in browser.ts line 296. Stabilises accessibility-tree output across runs and engines.
One snapshot, one action, one short reasoning turn. A full five-case scenario lands around a couple of cents.
When a spec file is still the right answer
The accessibility-tree-first loop is not a replacement for every test. If you are writing unit-level tests against a stable internal component library, a locator string is genuinely cheaper than a snapshot plus a tool-call; you know the DOM, the selector is not going to drift, and the LLM overhead per step is pure cost. Keep using @playwright/test for that.
Where the agent loop earns its keep is at the top of the test pyramid. End-to-end flows against real apps, third-party checkout pages where you cannot stabilise selectors, flows that change shape per tenant, signup flows gated by OTP or magic link, and anything you want an AI coding agent to write or maintain without a human curating the locator strings. That is the space where treating Playwright as a tool surface instead of a library wins.
Wire the snapshot-first loop into your repo
30 minutes. We will walk through the MCP launch flags, the scenario.md format, and a first run against your own app together.
Frequently asked questions
What exactly is Playwright MCP, and how does a testing tool use it?
Playwright MCP is a Model Context Protocol server that Microsoft ships alongside the Playwright library. Instead of exposing a TypeScript API to a human who writes `page.click(...)` in a spec file, it exposes 21 browser tools over stdio to any MCP client, including LLM agents. The tools are named with a `browser_` prefix: `browser_navigate`, `browser_snapshot`, `browser_click`, `browser_type`, `browser_press_key`, `browser_evaluate`, `browser_resize`, `browser_select_option`, `browser_take_screenshot`, `browser_wait_for`, `browser_tabs`, `browser_hover`, `browser_drag`, `browser_file_upload`, `browser_fill_form`, `browser_handle_dialog`, `browser_console_messages`, `browser_network_requests`, `browser_navigate_back`, `browser_run_code`, and `browser_close`. A testing tool sitting on top of that surface does not write locator strings; it asks the agent to produce tool calls that the MCP server translates into Playwright API calls inside the same browser process. You can see this in Assrt at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 516-668, where every public method forwards directly to `this.callTool("browser_*", ...)`.
How does an LLM agent know which element to click without a selector string?
Before every action, the agent calls `browser_snapshot`. The snapshot is an accessibility-tree YAML, not a DOM dump. Each element appears as a line with a role, accessible name, and a short reference ID like `[ref=e5]`. The agent passes the ref back into `browser_click` or `browser_type`. That ref lives for exactly one snapshot; the next snapshot generates fresh refs. This is why there is no selector string in the repo and no selector drift between runs. The strategy is codified in the system prompt at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 198-254, which instructs the agent to call snapshot first, use the ref value in click or type_text, and call snapshot again after each action to refresh refs.
What are the exact CLI args Assrt uses to spawn Playwright MCP?
Hard-coded in /Users/matthewdi/assrt-mcp/src/core/browser.ts at line 296. The args array is `[cliPath, "--viewport-size", "1600x900", "--output-mode", "file", "--output-dir", outputDir, "--caps", "devtools"]` where `cliPath` resolves to the Playwright MCP binary and `outputDir` is `~/.assrt/playwright-output`. Three flags matter for the agent loop. `--viewport-size 1600x900` fixes the render size so accessibility trees are deterministic across runs. `--output-mode file` writes every snapshot and screenshot to disk rather than inlining them into the MCP transport, which is the piece that makes the loop actually feasible; a Wikipedia-sized accessibility tree inlined into a Claude response would exceed context, but a file reference plus a truncated excerpt will not. `--caps devtools` enables `browser_start_video` and `browser_stop_video`, which is how Assrt records a per-run video without context-level recordVideo config.
Why not just keep writing *.spec.ts files?
Two practical reasons. First, spec files bake locator strings into the repo. Those strings are the single largest source of flake in a cross-browser or cross-release run; a shadow-DOM traversal quirk, an accessibility-role default, or a CSS class rename silently shifts which node a selector resolves to. The agent loop resolves the target per step from the engine's live accessibility tree, so there is nothing to drift. Second, when a human writes selectors, the repo ends up containing a new set of selectors for every new flow. When an agent reads the accessibility tree per step, the only thing kept in source control is the plan file, phrased in plain English. At /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20, the plan lives at `/tmp/assrt/scenario.md` and is short enough to diff by eye. Grepping /Users/matthewdi/assrt-mcp/src/core/agent.ts for `selector`, `xpath`, `testid`, or `locator` returns no matches.
Which model runs the agent loop, and what does it cost per step?
The default model is Claude Haiku 4.5, pinned as `claude-haiku-4-5-20251001` at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 9. A typical step is one `browser_snapshot` call plus one action call plus one short reasoning turn, which is a few thousand input tokens of accessibility tree and a hundred or so output tokens. At Haiku April 2026 rates that is on the order of a tenth of a cent per step; a five-case scenario with twenty total steps comes out to a couple of cents. Provider is pluggable at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 342-367, which also supports Gemini via `--provider gemini` and a default of `gemini-3.1-pro-preview`. Neither path requires a cloud testing platform subscription.
What does a scenario actually look like, and what does the agent do with it?
A scenario is Markdown. `#Case 1: ...` as a header, dash bullets for steps, `Assert:` lines for checks. The parser at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 620-631 splits on any of `#Scenario`, `#Test`, or `#Case`. For each scenario the agent gets an initial `browser_snapshot` plus an optional screenshot, and then runs a tool-use loop until it calls `complete_scenario`. During the loop the same browser state carries across scenarios, which is how you chain login into a follow-up test without re-running the credentials flow. Tool calls can also include `assert` (records an assertion in the report), `create_temp_email` (spins up a disposable inbox for OTP flows), and `http_request` (hits an external API to verify a webhook fired).
What is written to disk after every run, and how is that structured?
Two files, every run, at stable paths defined in /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts. `writeResultsFile(runId, results)` on lines 77-84 writes `/tmp/assrt/results/latest.json` (overwritten each run) and `/tmp/assrt/results/<runId>.json` (historical, UUID-keyed). The JSON is a `TestReport` (types.ts lines 28-35) that wraps an array of `ScenarioResult` (lines 19-26); each scenario carries `name`, `passed`, `steps[]`, `assertions[]`, `summary`, and `duration`. The `steps[]` entries record one row per tool call the agent made, with the human-readable action, status, and optional error. That is enough to reconstruct the full trajectory after the fact, and because it is plain JSON you can `jq`, `git add`, or diff two runs with standard unix tools.
Can I use my real Chrome session for logins, instead of a clean profile?
Yes. Pass `--extension` to the CLI or set `extension: true` on the MCP tool. The launcher at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 299-306 skips the profile directory and passes `--extension` through to Playwright MCP, which then attaches to an already-running Chrome via the Playwright MCP Chrome extension. The first time you run in this mode, Chrome shows an approval dialog; the token is saved to `~/.assrt/extension-token` and reused on subsequent runs (resolveExtensionToken in the same file). This is the only way to test behind an authenticated session that uses SSO, 2FA, or a corporate identity provider, because the real browser already carries all the cookies and the fingerprint that the auth flow expects.
What happens when the page is still loading after an action?
The agent calls `wait_for_stable` (agent.ts lines 186-195, handler at lines 956-1009). That tool injects a MutationObserver via `browser_evaluate`, polls `window.__assrt_mutations` every 500ms, and returns once the count has not changed for the configured stable window (default 2 seconds, max 10). This replaces the older pattern of `wait(ms)` with a magic number. It matters for async content: streaming AI responses, skeleton screens, and search results that populate after an XHR. The MutationObserver watches `document.body` with `childList: true, subtree: true, characterData: true`, and cleans itself up when the waiter exits. After it returns, the agent calls `browser_snapshot` again to see the fully-loaded tree.
What about OTP or magic-link flows that every classic Playwright tutorial skips?
The agent has two dedicated tools. `create_temp_email` spins up a disposable inbox (DisposableEmail.create in core/email.ts) and returns the address for the signup form. `wait_for_verification_code` polls that inbox for up to 120 seconds and returns the code plus sender metadata. The third piece, at agent.ts lines 233-236, handles the split-input OTP pattern where each digit goes into its own `<input maxlength="1">`. The agent is told to call `browser_evaluate` with a ClipboardEvent + DataTransfer paste against the parent element, not to type each digit one by one; typing into the first field tends to move focus unpredictably on React apps. That one exact expression is baked into the system prompt so every run handles OTP without needing a per-app workaround.
How is this different from recording a test with Codegen?
Codegen records a human's clicks into a spec file full of locator strings, which is exactly what this approach moves away from. The artifact a recorder produces is read-only in practice because renaming a class breaks every selector downstream. The artifact Assrt produces is a plain Markdown plan plus a JSON report; the plan is authored in terms of intent ("Click the Sign up button"), not in terms of the DOM that existed on the day you recorded. When the DOM changes on the next deploy, the same plan still runs, because the agent re-reads the accessibility tree per step. The upside of Codegen is speed of initial authoring; the downside is the maintenance tax. The upside of the agent loop is that maintenance flattens, because intent text rots more slowly than selector strings.
Is this actually open source, or just an open-source-ish wrapper around a paid cloud?
Both the Playwright MCP server (Microsoft) and Assrt (MIT, on GitHub at github.com/m13v/assrt-mcp) run locally with no network dependency beyond the LLM provider. You can read every file referenced in this guide on disk. The agent loop, the tool schemas, and the file layout at `/tmp/assrt/` are all local. A hosted runner at app.assrt.ai exists for sharing results in a browser, but it is optional; `assrt run --url ... --plan "..."` with the `--json` flag produces identical reports locally. There is no call home during a run. The npm package itself lists its dependencies in plain view: `@anthropic-ai/sdk`, `@google/genai`, `@modelcontextprotocol/sdk`, and `@playwright/mcp`.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.