From the open agent source

A Playwright AI test agent is a while loop. Here is the loop.

Direct answer: a Playwright AI test agent is a function-calling chat session whose tools are a real browser. The model reads a Markdown scenario, asks for an accessibility snapshot, decides which [ref=eN] to click, executes through @playwright/mcp, and re-snapshots after every action. It stops when the model emits stop_reason: end_turn. The whole thing is roughly fifty lines of TypeScript. The rest of this page traces it.

Matthew Diakonov, Written with AI

Published May 9, 20269 min read

Every file path on this page is real. The agent source is open at github.com/assrt-ai/assrt-mcp. If a claim here looks specific, it is because the line number is checkable.

The first thing to get right: the agent reads text, not pixels

Most descriptions of these agents draw a screenshot flowing into a vision model. That is wrong for almost every Playwright-based agent shipping today, and it is wrong for this one. The model's primary input on every iteration is text: a serialized accessibility tree returned by the snapshot tool. Each interactable element in that tree carries a stable [ref=eN] id, and the model picks targets by ref. Pixels are along for the ride for the human watching the run video, gated to visual actions only.

The mental model swap

A vision model squints at a screenshot, finds the Sign Up button by its blue color, fires a click at coordinates (412, 308), waits a second, takes another screenshot, repeats. The agent is doing OCR.

Pixel-based locators
Coordinate-based clicks
Vision model on every step
Brittle to theme, font, viewport changes

text-first

“The screenshot tool exists, but it is gated. Look at agent.ts:1024. Screenshots only fire after navigate, click, type_text, select_option, scroll, and press_key. snapshot, wait, assert, evaluate, http_request all skip it. The agent's decisions are made from text.”

agent.ts:1024 in the open Assrt source

One full iteration of the loop

Open agent.ts and look at the function called runScenario, starting around line 633. A scenario starts with one snapshot and one screenshot of the initial page. They are appended to the first user message. Then a while loop begins. Each iteration calls the model, executes the returned tool calls, and appends results. The loop body is the spine of the whole product.

One iteration of the agent loop

A few things on this diagram are worth pinning down. The agent loop calls anthropic.messages.create with the system prompt, the tool list, and the running messages array (line 714). The response comes back as content blocks. Text blocks fire a reasoning event for the live UI; tool_use blocks become the tool calls that get executed against the browser. Every tool_use gets a tool_result back on the next user turn (lines 1031-1036). That pairing is non-negotiable: the API rejects an assistant turn that contains tool_use blocks unless the next user turn carries a matching tool_result for each one.

The test file is Markdown, not TypeScript

A Playwright codegen output is a TypeScript file with hardcoded locators. A Playwright AI test agent runs against a Markdown plan with #Case blocks. Same browser engine underneath. Different surface. The agent does not parse selectors out of the Markdown; the model does, on each step, against a fresh snapshot.

Codegen vs. agent input

// Generated by Playwright codegen.
// Locators are baked in. UI churn breaks them.

import { test, expect } from "@playwright/test";

test("a new user signs up", async ({ page }) => {
  await page.goto("https://example.com/");
  await page.getByRole("link", { name: "Sign up" }).click();
  await page
    .locator("input[name='email']")
    .fill("test+a9fj2@example.com");
  await page.getByRole("button", { name: "Continue" }).click();
  await page
    .locator("input[name='code']")
    .fill("394721");
  await page.getByRole("button", { name: "Verify" }).click();
  await expect(
    page.getByRole("heading", { name: "Welcome" })
  ).toBeVisible();
});

35% fewer lines, zero locators

The Markdown file lives on your disk. You can grep it, diff it, and commit it. There is no proprietary YAML schema and no SaaS dashboard where it secretly lives. The agent parses it with one regex (parseScenarios at agent.ts:620-631), splits into an array of name/steps pairs, and runs each in order with the same browser session.

What is actually in the box

A Playwright AI test agent is six small pieces. None of them is mysterious. Pull any one out and the system breaks; together they fit into a few hundred lines of TypeScript. If you are evaluating one of these tools, this is the inventory to ask about.

The system prompt

56 lines at agent.ts:198-254. Tells the model to call snapshot first, use [ref=eN] ids returned in that tree, assert on observable text or URLs only, call complete_scenario when done, and fall back to snapshot on every failure. The OTP paste expression is also pinned here (lines 234-236) so the model never has to improvise it.

The 15 tools

Defined as Anthropic.Tool objects at agent.ts:16-196. Browser primitives, QA primitives, email primitives. Each tool also mirrored as a Gemini function declaration at lines 277-301.

The accessibility tree

Returned by snapshot at agent.ts:778-783. Serialized DOM with [ref=eN] ids on every interactable node. The tree, not the screenshot, is what the model reasons over. Screenshots are gated to visual actions only (line 1024).

The while loop

agent.ts:692-747. Calls the model, executes the returned tool_use blocks against the browser, appends a tool_result for every tool_use, loops. Exits when the model emits stop_reason: end_turn (line 723) or the API hits a fatal tool_use error (line 735).

The browser

McpBrowserManager at /Users/matthewdi/assrt-mcp/src/core/browser.ts is a thin wrapper around @playwright/mcp over stdio. Real Chromium, real cookies, real network. The agent layer never reaches into Playwright directly; it routes everything through the same five or six manager methods (navigate, snapshot, click, type, scroll, evaluate).

The result file

On scenario_complete the agent emits a structured TestReport with passed/failed counts, a per-step log, every assertion, and the run duration. Written as JSON to /tmp/assrt/runs/latest.json so any subsequent process (CI, an IDE agent, assrt_diagnose) can read it without scraping a dashboard.

When does the loop stop, exactly

Three conditions. First and most common: the model decides it is done. After it calls complete_scenario, the next turn has nothing to do, the response comes back with no tool_use blocks, and response.stop_reason === "end_turn" (agent.ts:723). The agent flips completed = true and exits.

Second: a fatal API error. If the response carries a tool_use or invalid_request error, the agent emits a reasoning message, marks the scenario failed, and exits (lines 735-742). It does not retry these, because they indicate the conversation got into a state the API cannot recover from.

Third: a retryable API error. 529, 429, 503, or any error matching /overloaded|rate/i retries up to four attempts with backoff (lines 728-734). Anything else throws and bubbles up to the scenario crash handler at lines 478-487, which converts it into a failed ScenarioResult and continues to the next #Case. There is no hard step counter; the constants at the top of the file are MAX_STEPS_PER_SCENARIO = Infinity and MAX_CONVERSATION_TURNS = Infinity. The model is trusted to converge or to call complete_scenario.

Three things this shape implies for your test stack

One: your tests should be observable text. Headings, button labels, visible error messages. The agent asserts on what it can read in the accessibility tree. CSS classes, layout, and pixel-perfect comparisons are out of scope by design. If you have a page where the only way to know it works is to compare a screenshot to a baseline, an agent is the wrong tool. Visual regression is a separate pass.

Two: every step is a roundtrip to the LLM. A 12-step scenario is 12 model calls plus the tool_use re-injection on each. That costs tokens and time. The Assrt agent uses Claude Haiku 4.5 by default, which keeps each turn cheap and fast, but the shape is fundamentally chattier than a compiled spec.ts file. If your CI runs hundreds of cases on every commit, an agent is good for the long-tail flakiness-prone scenarios; a hand-written spec is still the right answer for the smoke pass.

Three: you can read every byte that flows. Open agent.ts and follow the messages array. The system prompt is right there. The tool list is right there. The accessibility tree is what the model sees on every snapshot. There is no telemetry pipe to a vendor backend, no opaque selector-healing service, no proprietary format hiding the test. If you do not like the prompt, you fork the file and ship a new one. If you want to swap Claude for Gemini, the Gemini branch is right there at lines 700-712. This is the shape of an open agent. Closed AI QA platforms in the same category sell the same loop with a different veneer; the difference shows up the day you want to leave.

Want to see this loop run against your app?

Twenty minutes. We point Assrt at your URL, run a generated plan, and you watch the messages array in real time. If the agent can finish your login-gated flow, you have a baseline you can extend.

Common questions

What is a Playwright AI test agent, in one sentence?

A function-calling chat session whose tools are a real Chromium browser. The model receives a plain-English scenario, calls snapshot to read the page as an accessibility tree, calls click or type_text against [ref=eN] ids returned from that tree, appends the new snapshot back to its message array, and continues until it emits stop_reason: end_turn. In the open Assrt agent, the entire loop fits in roughly fifty lines of agent.ts (lines 692-747). The browser part is a thin wrapper over the official @playwright/mcp server, so the agent inherits real Playwright primitives without re-implementing them.

How is this different from Playwright codegen?

Playwright codegen records your manual clicks and emits a TypeScript .spec.ts file with hardcoded locators. Maintenance is on you forever; when the DOM changes, the locator breaks, the test fails, you fix it. A Playwright AI test agent has no locators in the test file. The test file is Markdown like '#Case 1: a new user signs up' and the agent re-discovers the actual elements from a fresh accessibility snapshot on every step. There is no compiled selector to break. The Assrt source lives at /Users/matthewdi/assrt-mcp/src/core/agent.ts; the system prompt explicitly tells the model 'use snapshot refs (e.g. ref=e5) for reliable element targeting' (line 679 of the user-prompt builder).

Does the agent see pixels or DOM?

Both, but not equally. The model's primary input is text. After every action, the agent calls snapshot, which returns the page as an accessibility tree: a serialized YAML-style tree where every interactable node has a stable [ref=eN] id. That tree is what the model reasons over. The screenshot tool exists too, but it is intentionally gated. Look at agent.ts:1024: the agent only emits a screenshot AFTER visual actions like navigate, click, type_text, scroll, and press_key. snapshot, wait, assert, evaluate, http_request and the rest skip the screenshot entirely. Screenshots are for the human watching the run video; the agent's decisions are made from the accessibility text. This is the single biggest mental shift versus thinking about it as 'computer use'.

What stops the loop?

The model itself. The Anthropic branch of the loop checks response.stop_reason after each turn (agent.ts:723). When the model emits stop_reason: end_turn with no tool_use blocks, the agent marks completed = true and exits the while. In practice the model does this after it calls the complete_scenario tool with its summary and pass/fail, because complete_scenario is a terminal tool: the agent records the result, and on the next turn the model has nothing to do. Failure modes: an API error tagged tool_use or invalid_request short-circuits the loop with scenarioPassed = false (agent.ts:735-742). 529, 429, 503, or 'overloaded' errors retry up to four attempts with backoff (agent.ts:728-734). The loop bound itself is set to Infinity at the top of the file because in practice the model converges on its own; the API or the user, not a counter, are the stop conditions that matter.

What tools does the agent actually have?

Fifteen, defined as Anthropic.Tool objects at the top of agent.ts (lines 16-196). Browser primitives: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate. QA-specific: assert (records a boolean with description and evidence), complete_scenario (terminal). Email and external IO: create_temp_email (opens a mail.tm inbox), wait_for_verification_code, check_email_inbox, http_request (for API verification of webhooks and external state). Stability and feedback: wait_for_stable (injects a MutationObserver, polls every 500ms, returns when no DOM mutation for N seconds, default 2), suggest_improvement (the agent reports UX bugs it noticed during the run). Each tool is also mirrored as a Gemini function declaration at agent.ts:277-301 so the same agent works against gemini-3.1-pro-preview by swapping a flag.

How does the agent avoid 'tool_use without tool_result' blowups?

Strict pairing. Every tool_use the model emits gets a tool_result appended to messages before the next turn. Look at agent.ts:1031-1036: regardless of whether the action succeeded, errored, or was unknown, exactly one ToolResultBlockParam goes back. On exception inside the switch, the catch at agent.ts:1012-1020 still produces a result string (with the page snapshot embedded for context) and feeds it back as a tool_result so the conversation stays valid. This matters because the Anthropic API rejects assistant turns that contain tool_use blocks without matching tool_result blocks on the next user turn; without that pairing the whole conversation 400s. It also helps the model: the error message includes the new accessibility tree, so the next turn can adapt instead of guessing.

Can I read the system prompt?

Yes, all 56 lines of it. SYSTEM_PROMPT is a const at agent.ts:198-254. It tells the model six things: (1) it is an automated web testing agent, (2) call snapshot first to get [ref=eN] ids, (3) make assertions with the assert tool, (4) call complete_scenario when done, (5) on failure call snapshot to see what the page actually looks like and try a different ref, (6) for OTP inputs use the exact DataTransfer expression baked into lines 234-236 verbatim. There are no hidden instructions. This is one of the things that distinguishes an open agent from a closed SaaS: you can fork the prompt and ship a new version. The plan-generation prompt (the one that turns a URL into a #Case list) is in a separate file at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:219-236, and is also auditable.

What does 'agent' mean here that 'codegen' does not?

Codegen is a transcription. The user does the test once, the tool writes down what happened, the test ships, and the loop closes. There is no live decision-making after generation. An agent loops at runtime. Every tool_use is the model deciding what to do based on the latest tool_result. If a modal appeared unexpectedly, the agent sees it on the next snapshot and dismisses it. If a button moved, the new accessibility tree still has it, just at a different ref. This is why agent runs are slower than compiled spec.ts files but more durable across UI churn. It is also why the Assrt agent does not ship a 'lock the locator' optimization; the whole point is that locators are re-derived per step.

From the agent source

Keep reading

Deep dive

Playwright e2e test agent: the four ugly patterns

OTP inputs, wedged dev servers, async DOM stability, and shared auth across cases. The four patches that sit between @playwright/mcp and a test that finishes.

Read

Generation

AI Playwright test generator with an open prompt

The other half of the system. How three screenshots and an 18-line prompt turn a URL into a #Case Markdown plan you can grep, diff, and commit.

Read

Reference

AI-powered agentic test execution: the 18-tool vocabulary

Each tool the agent can call, why it exists, and what fails when it does not. The vocabulary an LLM needs to drive a real browser to a useful end state.

Read