Test architecture, picked apart

Accessibility tree vs screenshots for Playwright AI testing: which is the agent's source of truth?

Both inputs exist for a reason. Only one of them is the substrate the agent reasons over. The other is an artifact for humans. The line between them is not folklore: it is set in the code of every working Playwright AI agent, including ours.

Direct answer (verified 2026-05-04)

Use the accessibility tree as the agent's primary input. Screenshots are for humans and for visual regression, not for the agent's decisions. Concretely: an aria snapshot is 2 to 5 KB of structured YAML per step; a JPEG screenshot is roughly 500 KB; and screenshot-mode runs of Playwright MCP cost about 114,000 tokens per test versus about 27,000 tokens for tree mode (a ~4x reduction). Authoritative source: playwright.dev/docs/aria-snapshots.

0 KBAria snapshot, typical step

0 KBJPEG screenshot, typical step

0K tokensScreenshot mode, full test

0K tokensTree mode, full test

Token figures from public benchmarks of Playwright MCP. The 4x ratio is what shows up on your bill at the end of the month.

Two paradigms, one runner

A Playwright AI agent has, in principle, two ways to perceive a page. It can read a structured representation of what is on the page, or it can look at the rendered pixels. These are not equivalent. They are not even the same kind of thing. One is a tree of facts, the other is an image to be interpreted. The choice of which one feeds the model on every step decides almost everything about how the agent behaves: its latency, its determinism, its cost, the quality of its failure logs.

Toggle between the two below. Same login flow, two different substrates.

Same login test, two different inputs

The agent receives a JPEG of the page. It runs a vision model to detect a button-shaped region with text that looks like "Sign in." It returns coordinates. The runner does a mouse click at (482, 311). When the test fails, the trail is a sequence of (x, y) clicks. Six months later nobody can read it.

500 KB image per step
Vision model can misread lookalikes
Failure log is coordinates, not semantics
Renders differently across browsers and DPRs

What this looks like in code

The shape of the test loop is not subtle. With a screenshot-first agent, the model is the locator. With a tree-first agent, the accessibility tree is the locator and the model just picks which ref to act on. Toggle between the two:

The agent loop, two ways

// Vision-first agent: input is a 500 KB JPEG.
// The model "sees" the page and decides what to click.
//
// What the model sees: pixels.
// What the test code says: nothing structural.
// What the failure log shows: (x: 482, y: 311), good luck.

const screenshot = await page.screenshot();
const action = await visionModel.decide({
  image: screenshot,
  goal: "sign in with the email and password",
});
await page.mouse.click(action.x, action.y);

-22% more semantic

The right column is what Assrt does, and what Microsoft's Playwright MCP does by default. The left column is what vision-first tools do, and it is what most teams pattern-match to when they hear “AI testing.” The pattern match is wrong: AI testing does not require pixels.

The anchor fact: a JPEG quality of 50

The most direct evidence that the tree is the substrate and the screenshot is an artifact lives in the open-source Assrt code. In assrt-mcp/src/core/browser.ts, the screenshot tool is called like this:

async screenshot(): Promise<string | null> {
  const result = await this.callTool(
    "browser_take_screenshot",
    { type: "jpeg", quality: 50 },
  );
  // ...
}

That is a deliberate downgrade. If you were betting your test decisions on the screenshot, you would push the format to PNG and the quality to 95. You would never throw half the bytes away on purpose. Assrt does throw them away on purpose, because the screenshot is not how the agent decides what to click. It is a thumbnail for the run viewer, a frame on the failure overlay, something a human glances at while scrolling a CI report. Quality 50 is plenty for that.

“If screenshots were the agent's input, you would not save them as quality-50 JPEGs. The fact that we do is the cleanest signal that the accessibility tree is the substrate, not the pixels.”

assrt-mcp/src/core/browser.ts, line 601

The other anchor fact: a 120,000 character ceiling

The flip side of trusting the tree is dealing with its weight. A dense admin dashboard or a Wikipedia article serializes to an accessibility tree that will overflow any agent's context window. Assrt's answer is a hard cap, also in assrt-mcp/src/core/browser.ts:

/** Max characters for a snapshot before truncation
 *  (roughly ~30k tokens). */
private static readonly SNAPSHOT_MAX_CHARS = 120_000;

When the cap is hit, the snapshot is truncated at the last clean newline and a marker is appended so the agent knows the view was partial. Practically this means tests should target the section of the page they care about (a dialog, a form, a table) rather than crawling everything. That is a useful constraint, not a limitation. Tests that try to operate on entire pages are usually testing the wrong thing.

The third anchor fact: screenshot polling, removed

Earlier versions of the agent polled a screenshot every 1.5 seconds while a scenario ran, to keep a live preview flowing into the run viewer. The current version does not. From assrt-mcp/src/core/agent.ts:

// Screenshot polling removed: screenshots are now
// captured only after visual actions (navigate, click,
// type, select, scroll, press_key) to avoid duplicate
// /redundant captures that waste tokens and time.

The agent now snapshots the accessibility tree before each decision, and grabs a screenshot only when something visible actually changed. Two consequences. The token budget on a long scenario drops sharply. And a human reading the run sees a frame for every meaningful state change, not 60 frames of a static loading spinner.

The agent loop, in five tools

Putting all three anchor facts together, here is the loop the agent actually runs. Notice that snapshot is called twice (once before, once after), and screenshot is called once, only after the click changed the page state.

snapshot, decide, act, re-snapshot, capture

snapshot

Tree (~3 KB), with [ref=eN] markers

decide

Model picks a ref to act on

act

click, type, select via Playwright

snapshot

Fresh tree; refs are stable per snapshot

screenshot

JPEG q=50, after the visual change

Head to head

Feature	Screenshot first (vision agents)	Tree first (Assrt, Playwright MCP default)
Bytes per step	~500 KB JPEG (or larger PNG)	2 to 5 KB structured text
Tokens per full test	~114K (screenshot inline)	~27K (Playwright MCP, tree mode)
Determinism	Non-deterministic, vision model guesses	Deterministic, refs resolve to nodes
Failure log readability	(x, y) coordinates, sometimes a crop	Role, name, and ref of the failed node
Survives class renames	Sometimes; vision is forgiving but inconsistent	Yes; role and name are independent of CSS
Works on canvas-only apps	Yes; pixels are the only signal	No; tree is empty for a single canvas
Doubles as accessibility check	No correlation	Yes; missing label = test failure
Cross-DPR / cross-browser stability	Different pixels, more flakes	Identical tree across viewports

What lives in an aria snapshot

The point of the tree is not that it is small (although it is). The point is that everything in it is the kind of thing a test wants to target, and almost nothing in it is coupled to the markup that designers rewrite every sprint.

One YAML node per accessible element

Role: button, textbox, link, heading, dialog, listitem
Accessible name (the label a screen reader would speak)
State: disabled, checked, expanded, selected, focused
Hierarchy: a dialog contains its buttons, a list contains its items
Element ref like [ref=e5] for the agent to refer back to
Truncation marker if the tree exceeded 120,000 characters

What it does not contain: CSS classes, inline styles, image pixel data, JavaScript event handlers, or markup that exists purely for layout. That omission is the point. If a thing is not in the accessibility tree, your tests cannot target it, and your users with assistive tech cannot use it either. Both gaps are the same gap.

When screenshots earn their keep

The honest position is not that screenshots are bad. They are the wrong substrate for behavioral test decisions, and the right substrate for a small set of cases.

Use a screenshot when the test is actually visual

Visual regression diff: pixels are exactly what the test is checking
Canvas-only apps (CAD, charts, games) where the tree is empty
Shadow DOM nests so deep that semantic refs are not exposed
Human review: a thumbnail on the run, an attachment on a failure
Marketing layouts where a hover color or transition timing matters

Treat the two as different runners. Behavioral tests run against the accessibility tree. Visual regression runs against pixels and lives in its own pipeline. Mixing the substrates is how teams end up with flaky pixel diffs every time the design system tweaks a token, and missed regressions every time someone removes a button's name.

What Assrt gives you, in one paragraph

Assrt is the open-source, MIT-licensed implementation of the tree-first pattern described above. It runs as a Node CLI plus a local MCP server. Under the hood it spawns @playwright/mcp as a child process, so every click, type, and snapshot is a real Playwright primitive. The agent reads the accessibility tree, picks a ref, calls real Playwright. The generated tests are standard Playwright spec files you commit to git. No rented YAML, no proprietary cloud editor, no per-seat fee.

npx @m13v/assrt discover https://your-app.com

For a deeper guide on why the tree is the right target, read the long form on accessibility tree web testing. If you want a side-by-side against a competing vision-first product, see Playwright vs AI browser automation.

Bring the test that keeps flaking

Thirty minutes with the Assrt team. Walk through your worst flaky scenario and leave with a tree-first plan to express it.

Frequently asked questions

Which is the agent's source of truth, the accessibility tree or the screenshot?

For a Playwright AI agent that values determinism, the accessibility tree is the source of truth. The tree is a structured YAML representation of roles, names, and states with stable element refs (Playwright calls these aria snapshots). A screenshot is an artifact: it helps a human reviewing a run, and it backs up visual regression checks, but it is not what the agent reads to decide what to click. Assrt's own system prompt instructs the agent to call snapshot first, every time, before doing anything else (see assrt-mcp/src/core/agent.ts).

What is the actual byte and token cost difference?

An aria snapshot is typically 2 to 5 KB of structured text per interaction. A JPEG screenshot at modest quality is roughly 500 KB. End to end, public benchmarks of Playwright MCP put screenshot mode at about 114,000 tokens per test versus about 27,000 tokens for tree-only mode, a roughly 4x reduction (see currents.dev/posts/state-of-playwright-ai-ecosystem-in-2026). Tokens are not the only cost, but they correlate with latency and dollar spend on every model call.

Why does Assrt cap the tree at 120,000 characters?

In assrt-mcp/src/core/browser.ts the constant SNAPSHOT_MAX_CHARS is set to 120,000, which works out to about 30,000 LLM tokens. The cap exists because some pages (Wikipedia, sprawling admin dashboards, infinite scrolls) produce accessibility trees that would otherwise blow past an agent's context window. When the cap is hit, the snapshot is truncated at the last clean newline and a marker is appended so the agent knows the view was partial. Tests that target a section, a dialog, or a flow rarely come close.

Why is the screenshot saved at JPEG quality 50?

Look at assrt-mcp/src/core/browser.ts around line 601: the screenshot tool is called with { type: "jpeg", quality: 50 }. That is a deliberate downgrade. If screenshots were the agent's input, you would push them to PNG or quality 95. They are not. They exist as a low-fidelity artifact: a thumbnail in the run viewer, an attachment on a failure, a frame to debug a regression with. Quality 50 keeps file sizes small and uploads cheap without losing the only thing humans need from them, which is a passable preview.

Is screenshot polling still a thing?

In Assrt, no. Earlier versions polled screenshots every 1.5 seconds during a scenario. The current code in assrt-mcp/src/core/agent.ts has a comment that says exactly: "Screenshot polling removed: screenshots are now captured only after visual actions (navigate, click, type, select, scroll, press_key) to avoid duplicate/redundant captures that waste tokens and time." The agent now snapshots the tree before each decision and grabs a screenshot only when something visible actually changed.

When are screenshots actually the right input?

Three cases. Visual regression: you are checking pixels, so use pixels. Canvas-rendered apps (large data visualizations, in-browser CAD, some game engines) produce one canvas DOM node and an empty accessibility tree, so the only signal you have is what is on screen. Shadow DOM trees that are nested several layers deep can hide elements from the accessibility snapshot, depending on the component author's hygiene; in those cases a vision pass plus coordinate clicks is sometimes the only working path.

Do I lose anything by skipping screenshot-based testing entirely?

Yes, you lose pixel-level regression detection, and you lose the ability to test things that have no semantic identity (an animated background, a chart's exact bar heights, a print stylesheet). The honest framing is not "tree replaces screenshot" but "tree is the default substrate for behavioral tests, screenshots are the right tool for the things that are actually visual." Assrt is built around that split: the agent reasons over the tree; visual regression has its own runner and its own diff.

How does this compare to vision-first AI test tools?

Vision-first tools (think computer-vision agents that decide what to click from raw pixels) have two failure modes. They misread lookalike elements, and they leave a debug trail of (x, y) coordinates that nobody can interpret six months later. Driving Playwright through accessibility refs gives you LLM flexibility on top of a deterministic substrate. When a click fails the failure log shows "clicked [ref=e14] labeled Submit, got no navigation," which a human Playwright author can read.

Does Playwright itself recommend tree over screenshots?

The Playwright docs at playwright.dev/docs/aria-snapshots describe aria snapshots as a YAML representation of the accessibility tree, intended for structural and accessibility regression. Microsoft's Playwright MCP server documentation states the design choice plainly: it operates on the accessibility tree, not screenshots, because AI models do not need vision capabilities to interact with pages and the resulting automation is deterministic and fast. Vision is available as an opt-in for the cases above, not the default.

Where does Assrt fit in the open-source picture?

Assrt is MIT licensed (npm: @m13v/assrt). It runs as a Node CLI (npx @m13v/assrt discover https://your-app.com) plus a local MCP server. Under the hood it spawns @playwright/mcp as a child process, so every click, type, and snapshot is a real Playwright primitive. Tests are emitted as standard Playwright spec files you commit to git. There is no rented YAML, no proprietary cloud editor, no per-seat fee.