Playwright as a testing tool for LLM agents

Most writeups treat Playwright as a library you import into a *.spec.ts file and populate with page.locator(...) calls. That is one life of the project. The other life, the one this page is about, started when Microsoft shipped Playwright MCP. The library is now a tool surface that an LLM can drive directly: 21 browser_* primitives over stdio, one accessibility-tree snapshot per step, zero locator strings in the repo.

The rest of this guide traces what that looks like in practice. Every file path and line number is pulled from the open-source Assrt reference implementation, so you can follow along with your own checkout.

Matthew Diakonov, Written with AI

Published April 24, 202612 min read

4.9from 120+

Open source

Self-hosted

21 Playwright MCP primitives

Zero locator strings in repo

The second life of Playwright

Spec file -> agent tool surface

Playwright MCP exposes 21 browser_* tools over stdio

An agent calls browser_snapshot before every action

Each element gets a [ref=eN] that lives for one snapshot

No locator strings, no CSS, no XPath, no test-id

The whole loop runs locally on Claude Haiku

0:00 / 0:05

The shift almost no one writes up

Introductory guides about Playwright open with the same paragraph. Install it from npm, write a spec file, call page.locator(...), run npx playwright test. That is a fine way to start and a fine way to run CI for a web app where humans are writing the specs. It is not the only way Playwright is used in 2026.

The part that is missing from the beginner pages is an officially published MCP server. The package name is @playwright/mcp, and it is how a Model Context Protocol client, typically a coding agent like Claude Code or Cursor, drives a browser. The client does not see the Playwright API. It sees 21 tools with browser_-prefixed names, each with a JSON schema. The MCP server translates those calls into real Playwright operations inside its own browser process.

Once you see a test run as a sequence of those tool calls, the rest of the design falls out. The repo stops carrying locator strings. The plan starts looking like English. The report becomes plain JSON. Let us walk through it.

browser_navigate

browser_snapshot

browser_click

browser_type

browser_select_option

browser_hover

browser_drag

browser_press_key

browser_scroll

browser_wait_for

browser_evaluate

browser_run_code

browser_resize

browser_take_screenshot

browser_file_upload

browser_fill_form

browser_handle_dialog

browser_console_messages

browser_network_requests

browser_navigate_back

browser_tabs

browser_close

Those are the 21 tools Playwright MCP actually exposes as of April 2026. The rest of this guide is about what an LLM agent does with them.

What sits on top of the tool surface

Raw Playwright MCP gives an agent 21 low-level actions. A testing tool on top of that adds the missing pieces: scenario parsing, assertions, disposable emails, a completion signal, and a place to put the report. Assrt, the reference implementation referenced throughout this guide, adds eight more tools on top of the 21. The shape looks like this.

A testing tool is a thin layer above Playwright MCP

The anchor fact: the exact CLI the agent spawns

Everything downstream of this guide depends on how Playwright MCP is actually started. In the Assrt implementation the launch args are hard-coded at /Users/matthewdi/assrt-mcp/src/core/browser.ts line 296. Three of those flags are non-obvious and worth calling out individually.

browser.ts

--viewport-size 1600x900. Fixes the render size, which stabilises the accessibility tree across runs. A tree whose column count changes because the viewport shrank will produce different labels for the same element, and the agent will re-resolve the same click to a different ref. Pin it at the runner, not per-spec.
--output-mode file. This is the flag that makes the loop actually feasible. Every browser_snapshot and browser_take_screenshot goes to disk at ~/.assrt/playwright-output/instead of being inlined into the MCP transport. A full accessibility tree on something Wikipedia-shaped is tens of kilobytes; inlining it into a Claude response would either exceed the context window or burn the agent's attention on irrelevant markup. File mode lets the agent read a truncated excerpt and fetch the rest only if it needs to.
--caps devtools. Unlocks the browser_start_video and browser_stop_video tools. The reason Assrt can record a per-run video without context-level recordVideo config is that devtools capability. It also keeps recording working with a persistent browser session, which context-level recording does not.

1600x900

“--output-mode file is the single setting that makes accessibility-tree-first testing work on apps larger than a todo demo.”

browser.ts line 296

The agent loop, one step at a time

The Playwright MCP side exposes the raw primitives. The agent side turns them into a disciplined snapshot-act-assert loop. The system prompt for the agent (at agent.ts lines 198-254) spells out the exact sequence: call snapshot first, use a ref from it, call snapshot again after each action, never rely on a stale ref. Here is what a single scenario produces on the terminal when you run the CLI in verbose mode.

assrt-run.log

What the agent actually calls

The agent has a fixed set of tool schemas, not the full 21-primitive Playwright MCP surface. That is intentional: the agent schema stays small and opinionated, and every call is translated under the hood to one or more browser_* tools by the wrapper.

agent.ts

The sequence of messages on a single step

Zooming in on one step of the loop, here is the actor-level picture. The agent does not talk to the page. The MCP server does. The agent reads tool results and emits tool calls.

One step through four actors

Why this is different from a spec file

The most common reaction to the accessibility-tree-first loop is that it sounds like a recorder with extra steps. It is not. The difference is where the locator string lives. In a classic spec file, the string lives in git and is evaluated at run time. In the agent loop, nothing lives in git except the plan text; the ref that gets clicked is generated fresh from the current state of the page and is thrown away one snapshot later.

Feature	Playwright as a spec-file library	Playwright as an agent tool surface
Target resolution per step	Locator string compiled at spec-write time	Live accessibility tree snapshot, fresh ref per step
Source of truth for the test	*.spec.ts with locator strings, per-flow	One Markdown plan at /tmp/assrt/scenario.md
Who authors each step	A human, before the run	An LLM agent, during the run, per engine
Tool surface under the hood	Playwright library API (page.click, page.type)	Playwright MCP, 21 browser_* tools over stdio
What ships in CI	Node process + spec files + HTML report	Node process + plan.md + plain JSON report
Maintenance trigger	DOM/class rename breaks a selector string	Accessibility label change breaks a step (rarer)
OTP / magic-link path	Bring your own email infra	create_temp_email + wait_for_verification_code built in
Replay artifact	trace.zip inside Playwright HTML report	WebM video + step-by-step JSON + snapshots/*.yml
License and host	Open source library, optional paid cloud ($7.5K/mo+ in some cases)	MIT, runs locally, LLM tokens are the only variable cost

What actually sits in the repo

A scenario is a Markdown file. That is the entire input contract. #Case 1: as a header, bullets for steps, Assert: lines for the checks. The parser at agent.ts lines 620-631 splits on #Scenario, #Test, or #Case. You can drop this in a repo, run it on Monday, rename every CSS class on Tuesday, and run it again on Wednesday without touching the file.

/tmp/assrt/scenario.md

Case 2 is the flow that tutorials usually skip: signup-with-OTP. The two lines that matter are use create_temp_email and wait for the verification code. Both map to dedicated agent tools that handle the disposable-inbox side of the test without an external mail server. The split-input OTP pattern (six single-character fields, one per digit) is handled by a fixed browser_evaluate expression in the system prompt, so typing one digit per field never breaks focus.

What the full workflow looks like, end to end

Write the plan in plain English as #Case blocks
Agent calls browser_snapshot to read the live accessibility tree
Agent picks a [ref=eN] from that tree and calls browser_click / browser_type
Action fires inside the Playwright-controlled browser
Agent calls browser_snapshot again to see the new state
Agent records an assert and moves to the next step
On complete_scenario, writeResultsFile dumps JSON to /tmp/assrt/results/

The last step matters because it is what makes the result diffable. Two runs produce two JSONs at /tmp/assrt/results/<runId>.json, keyed by UUID, plus an overwritten latest.json. Every scenario entry has a stable name and a boolean passed, so a one-liner jq over two runs tells you exactly which cases flipped.

Five ideas worth keeping

browser_snapshot is the first call, every step

The accessibility tree, not the DOM. Each node gets a short ref ID like [ref=e5] that survives one snapshot and one snapshot only. The agent never queries by CSS, XPath, or test-id.

--output-mode file keeps the loop feasible

Accessibility trees on a large app can be tens of kilobytes. Dumping them to ~/.assrt/playwright-output/ instead of stuffing them into the MCP transport is what makes the Haiku context budget workable.

18 agent-facing tools wrap 21 browser primitives

navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, plus assert, complete_scenario, create_temp_email, wait_for_verification_code, check_email_inbox, suggest_improvement, http_request, wait_for_stable.

--extension attaches to your real Chrome

For SSO, 2FA, and any flow that needs existing cookies. The token is saved to ~/.assrt/extension-token on first approval and reused. No selector changes, same tool loop.

Claude Haiku 4.5 by default, Gemini swappable

Pinned as claude-haiku-4-5-20251001 at agent.ts:9. Gemini path at lines 354-357 uses gemini-3.1-pro-preview. A typical step is one snapshot plus one action plus one reasoning turn, on the order of a tenth of a cent.

The rough numbers

Ballparks for a five-case plan with roughly twenty steps, run locally against a dev server. These come from the shapes in writeResultsFile and the Haiku rate card as of April 2026.

0browser_* primitives in Playwright MCP

0agent tool schemas Assrt defines

0Markdown plan file in the repo

0locator strings in the repo

Viewport, fixed

0x0

Hard-coded in browser.ts line 296. Stabilises accessibility-tree output across runs and engines.

Per-step cost, Haiku tokens

~0c

One snapshot, one action, one short reasoning turn. A full five-case scenario lands around a couple of cents.

When a spec file is still the right answer

The accessibility-tree-first loop is not a replacement for every test. If you are writing unit-level tests against a stable internal component library, a locator string is genuinely cheaper than a snapshot plus a tool-call; you know the DOM, the selector is not going to drift, and the LLM overhead per step is pure cost. Keep using @playwright/test for that.

Where the agent loop earns its keep is at the top of the test pyramid. End-to-end flows against real apps, third-party checkout pages where you cannot stabilise selectors, flows that change shape per tenant, signup flows gated by OTP or magic link, and anything you want an AI coding agent to write or maintain without a human curating the locator strings. That is the space where treating Playwright as a tool surface instead of a library wins.

Wire the snapshot-first loop into your repo

30 minutes. We will walk through the MCP launch flags, the scenario.md format, and a first run against your own app together.

Frequently asked questions

What exactly is Playwright MCP, and how does a testing tool use it?

Playwright MCP is a Model Context Protocol server that Microsoft ships alongside the Playwright library. Instead of exposing a TypeScript API to a human who writes `page.click(...)` in a spec file, it exposes 21 browser tools over stdio to any MCP client, including LLM agents. The tools are named with a `browser_` prefix: `browser_navigate`, `browser_snapshot`, `browser_click`, `browser_type`, `browser_press_key`, `browser_evaluate`, `browser_resize`, `browser_select_option`, `browser_take_screenshot`, `browser_wait_for`, `browser_tabs`, `browser_hover`, `browser_drag`, `browser_file_upload`, `browser_fill_form`, `browser_handle_dialog`, `browser_console_messages`, `browser_network_requests`, `browser_navigate_back`, `browser_run_code`, and `browser_close`. A testing tool sitting on top of that surface does not write locator strings; it asks the agent to produce tool calls that the MCP server translates into Playwright API calls inside the same browser process. You can see this in Assrt at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 516-668, where every public method forwards directly to `this.callTool("browser_*", ...)`.

How does an LLM agent know which element to click without a selector string?

Before every action, the agent calls `browser_snapshot`. The snapshot is an accessibility-tree YAML, not a DOM dump. Each element appears as a line with a role, accessible name, and a short reference ID like `[ref=e5]`. The agent passes the ref back into `browser_click` or `browser_type`. That ref lives for exactly one snapshot; the next snapshot generates fresh refs. This is why there is no selector string in the repo and no selector drift between runs. The strategy is codified in the system prompt at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 198-254, which instructs the agent to call snapshot first, use the ref value in click or type_text, and call snapshot again after each action to refresh refs.

What are the exact CLI args Assrt uses to spawn Playwright MCP?

Hard-coded in /Users/matthewdi/assrt-mcp/src/core/browser.ts at line 296. The args array is `[cliPath, "--viewport-size", "1600x900", "--output-mode", "file", "--output-dir", outputDir, "--caps", "devtools"]` where `cliPath` resolves to the Playwright MCP binary and `outputDir` is `~/.assrt/playwright-output`. Three flags matter for the agent loop. `--viewport-size 1600x900` fixes the render size so accessibility trees are deterministic across runs. `--output-mode file` writes every snapshot and screenshot to disk rather than inlining them into the MCP transport, which is the piece that makes the loop actually feasible; a Wikipedia-sized accessibility tree inlined into a Claude response would exceed context, but a file reference plus a truncated excerpt will not. `--caps devtools` enables `browser_start_video` and `browser_stop_video`, which is how Assrt records a per-run video without context-level recordVideo config.

Why not just keep writing *.spec.ts files?

Two practical reasons. First, spec files bake locator strings into the repo. Those strings are the single largest source of flake in a cross-browser or cross-release run; a shadow-DOM traversal quirk, an accessibility-role default, or a CSS class rename silently shifts which node a selector resolves to. The agent loop resolves the target per step from the engine's live accessibility tree, so there is nothing to drift. Second, when a human writes selectors, the repo ends up containing a new set of selectors for every new flow. When an agent reads the accessibility tree per step, the only thing kept in source control is the plan file, phrased in plain English. At /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20, the plan lives at `/tmp/assrt/scenario.md` and is short enough to diff by eye. Grepping /Users/matthewdi/assrt-mcp/src/core/agent.ts for `selector`, `xpath`, `testid`, or `locator` returns no matches.

Which model runs the agent loop, and what does it cost per step?

The default model is Claude Haiku 4.5, pinned as `claude-haiku-4-5-20251001` at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 9. A typical step is one `browser_snapshot` call plus one action call plus one short reasoning turn, which is a few thousand input tokens of accessibility tree and a hundred or so output tokens. At Haiku April 2026 rates that is on the order of a tenth of a cent per step; a five-case scenario with twenty total steps comes out to a couple of cents. Provider is pluggable at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 342-367, which also supports Gemini via `--provider gemini` and a default of `gemini-3.1-pro-preview`. Neither path requires a cloud testing platform subscription.

What does a scenario actually look like, and what does the agent do with it?

A scenario is Markdown. `#Case 1: ...` as a header, dash bullets for steps, `Assert:` lines for checks. The parser at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 620-631 splits on any of `#Scenario`, `#Test`, or `#Case`. For each scenario the agent gets an initial `browser_snapshot` plus an optional screenshot, and then runs a tool-use loop until it calls `complete_scenario`. During the loop the same browser state carries across scenarios, which is how you chain login into a follow-up test without re-running the credentials flow. Tool calls can also include `assert` (records an assertion in the report), `create_temp_email` (spins up a disposable inbox for OTP flows), and `http_request` (hits an external API to verify a webhook fired).

What is written to disk after every run, and how is that structured?

Two files, every run, at stable paths defined in /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts. `writeResultsFile(runId, results)` on lines 77-84 writes `/tmp/assrt/results/latest.json` (overwritten each run) and `/tmp/assrt/results/<runId>.json` (historical, UUID-keyed). The JSON is a `TestReport` (types.ts lines 28-35) that wraps an array of `ScenarioResult` (lines 19-26); each scenario carries `name`, `passed`, `steps[]`, `assertions[]`, `summary`, and `duration`. The `steps[]` entries record one row per tool call the agent made, with the human-readable action, status, and optional error. That is enough to reconstruct the full trajectory after the fact, and because it is plain JSON you can `jq`, `git add`, or diff two runs with standard unix tools.

Can I use my real Chrome session for logins, instead of a clean profile?

Yes. Pass `--extension` to the CLI or set `extension: true` on the MCP tool. The launcher at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 299-306 skips the profile directory and passes `--extension` through to Playwright MCP, which then attaches to an already-running Chrome via the Playwright MCP Chrome extension. The first time you run in this mode, Chrome shows an approval dialog; the token is saved to `~/.assrt/extension-token` and reused on subsequent runs (resolveExtensionToken in the same file). This is the only way to test behind an authenticated session that uses SSO, 2FA, or a corporate identity provider, because the real browser already carries all the cookies and the fingerprint that the auth flow expects.

What happens when the page is still loading after an action?

The agent calls `wait_for_stable` (agent.ts lines 186-195, handler at lines 956-1009). That tool injects a MutationObserver via `browser_evaluate`, polls `window.__assrt_mutations` every 500ms, and returns once the count has not changed for the configured stable window (default 2 seconds, max 10). This replaces the older pattern of `wait(ms)` with a magic number. It matters for async content: streaming AI responses, skeleton screens, and search results that populate after an XHR. The MutationObserver watches `document.body` with `childList: true, subtree: true, characterData: true`, and cleans itself up when the waiter exits. After it returns, the agent calls `browser_snapshot` again to see the fully-loaded tree.

What about OTP or magic-link flows that every classic Playwright tutorial skips?

The agent has two dedicated tools. `create_temp_email` spins up a disposable inbox (DisposableEmail.create in core/email.ts) and returns the address for the signup form. `wait_for_verification_code` polls that inbox for up to 120 seconds and returns the code plus sender metadata. The third piece, at agent.ts lines 233-236, handles the split-input OTP pattern where each digit goes into its own `<input maxlength="1">`. The agent is told to call `browser_evaluate` with a ClipboardEvent + DataTransfer paste against the parent element, not to type each digit one by one; typing into the first field tends to move focus unpredictably on React apps. That one exact expression is baked into the system prompt so every run handles OTP without needing a per-app workaround.

How is this different from recording a test with Codegen?

Codegen records a human's clicks into a spec file full of locator strings, which is exactly what this approach moves away from. The artifact a recorder produces is read-only in practice because renaming a class breaks every selector downstream. The artifact Assrt produces is a plain Markdown plan plus a JSON report; the plan is authored in terms of intent ("Click the Sign up button"), not in terms of the DOM that existed on the day you recorded. When the DOM changes on the next deploy, the same plan still runs, because the agent re-reads the accessibility tree per step. The upside of Codegen is speed of initial authoring; the downside is the maintenance tax. The upside of the agent loop is that maintenance flattens, because intent text rots more slowly than selector strings.

Is this actually open source, or just an open-source-ish wrapper around a paid cloud?

Both the Playwright MCP server (Microsoft) and Assrt (MIT, on GitHub at github.com/m13v/assrt-mcp) run locally with no network dependency beyond the LLM provider. You can read every file referenced in this guide on disk. The agent loop, the tool schemas, and the file layout at `/tmp/assrt/` are all local. A hosted runner at app.assrt.ai exists for sharing results in a browser, but it is optional; `assrt run --url ... --plan "..."` with the `--json` flag produces identical reports locally. There is no call home during a run. The npm package itself lists its dependencies in plain view: `@anthropic-ai/sdk`, `@google/genai`, `@modelcontextprotocol/sdk`, and `@playwright/mcp`.