Playwright deep dive

Playwright visual testing without baseline PNGs. A vision model watches every screen change, and the recording carries its own overlay.

Every guide on this teaches the same one thing: toHaveScreenshot(), pixelmatch, a baseline PNG committed to tests/__snapshots__/, and a long checklist for taming flake (disable animations, freeze fonts, mask the ads). Assrt is built on real Playwright, but visual testing inside Assrt has no baselines and no pixelmatch. It captures a JPEG quality 50 screenshot after every visual action and pushes it to a vision model alongside the live accessibility tree, and if you record the run you get a WebM with an injected DOM overlay (red cursor, click ripple, keystroke toast, heartbeat dot) so a human can watch what the agent watched.

M
Matthew Diakonov
12 min read
4.9from every claim sourced to a file + line in the assrt-mcp repo
JPEG q50 to a vision model, not pixelmatch against a baseline PNG
1600x900 WebM with overlay you can watch, MIT-licensed source
No tests/__snapshots__/, no --update-snapshots, no per-platform diff
q=50

await this.callTool("browser_take_screenshot", { type: "jpeg", quality: 50 })

assrt-mcp/src/core/browser.ts:601

The four numbers that define how Assrt looks at your page

One viewport. One screenshot per visual action. One overlay layer that only the human sees. One vision model deciding what passed.
0x0

Viewport + WebM size

0%

JPEG quality on the wire

0

Action types that trigger a screenshot

0

Baseline PNGs in your repo

What the standard pattern is, and why this one is different

Every page about this opens the same way: install @playwright/test, write await expect(page).toHaveScreenshot(), commit the resulting PNG, and re-run. Subsequent runs use pixelmatch to count the differing pixels and fail the test if the count exceeds your threshold. The rest of the guide is a long stability checklist for keeping that count low: disable animations, freeze fonts, mask dynamic regions, set a deterministic timezone and locale. The official Playwright snapshots doc describes the mechanism plainly: the first run took screenshots until two consecutive ones matched, and saved the last one to disk. Every subsequent assertion is pixelmatch.

The Assrt agent does not call toHaveScreenshot() at all. The visual-judgement loop is built into the agent itself: a screenshot is captured after every visual action, the screenshot rides as an image block in the next conversation turn, and a vision model reads the rendered pixels against the English assertion in the #Case paragraph. No baseline file is ever written; no diff PNG is ever generated. The check is whether the model agrees that what is on screen matches what the paragraph said should be on screen.

The same check, written both ways

Same scenario: the upgrade modal on a pricing card looks right. The toHaveScreenshot version commits a PNG and trusts pixelmatch. The #Case version writes the assertion in English and trusts the model.

TOHAVESCREENSHOT vs #CASE

// the standard Playwright visual testing pattern
import { test, expect } from '@playwright/test';

test('pricing modal looks right', async ({ page }) => {
  await page.goto('https://your-app.com/pricing');
  await page.getByRole('button', { name: 'Upgrade' }).click();

  // first run: writes pricing-modal-chromium-darwin.png
  // every other run: pixelmatch against that PNG
  await expect(page.getByRole('dialog')).toHaveScreenshot(
    'pricing-modal.png',
    { maxDiffPixels: 100, animations: 'disabled' }
  );
});

// failure mode: a designer tweaks padding by 1px,
// fonts subpixel-render differently in CI, an A/B
// banner pushes the modal down 4px. pixelmatch fails.
// you re-run with --update-snapshots and try to remember
// what the dialog was supposed to look like.
-5% fewer lines

A single click, frame by frame

The most important loop in Assrt's visual layer is short. After any of six action types, capture a JPEG, push it as an image block, let the model decide. Watch it run.

THE POST-ACTION SCREENSHOT LOOP

01 / 05

Step 1: agent dispatches a visual action

click({ element: 'Upgrade button', ref: 'e21' }) goes through @playwright/mcp into real Chromium.

Why JPEG quality 50, of all the numbers

The screenshot is not for archive. It is for one model turn, billed as image-block tokens. Quality 50 is the sweet spot where layout, color, and copy remain perceptually intact while payload size collapses to roughly 60x smaller than a lossless PNGat the same dimensions. The model never asked to see the page at print resolution; it asked “is the modal centered, is the button enabled, is the price $39.”

Per-screenshot wire size, picked deliberately

The two visual layers Assrt produces

One layer is for the model. One layer is for the human reviewing the run. They are deliberately not the same image. The model sees the clean page so it can judge the real UI; the human sees the page plus an injected overlay so the recording is watchable instead of just “a screen with no cursor.” Toggle:

THE MODEL'S SCREENSHOT vs THE HUMAN'S WEBM

A clean JPEG of the rendered page at viewport 1600x900, no cursor, no overlay. The model judges the actual UI exactly as a real user would see it, with no agent-introduced visual noise.

  • browser.ts:601 calls browser_take_screenshot directly
  • Captures the page snapshot, not the screencast layer
  • Overlay z-index 2147483647 is invisible to this path
  • Lossy JPEG q50 keeps token cost predictable

The injected overlay, broken out

The overlay is not a video editor effect added after the fact. It is real DOM, declared in TypeScript at browser.ts:33-97, evaluated into the page on first contact, and rendered by the same browser that renders your app. CDP screencast picks up those frames and writes them into the WebM. Each piece has a job.

Red cursor dot, 20px, rgba(239,68,68,0.85)

Position-tracked across navigations (cursorX/Y at browser.ts:128-129 default to 640/400, roughly center-screen). 2px white border, 8px red shadow, transitions left/top over 300ms with ease.

Click ripple, scales 0.5 → 2.0

Same red as the cursor (rgba(239,68,68,0.6) border). Triggered by __pias_showClick(x,y). Ripple expands over 50ms, fades opacity to 0. The frame shows a clear visual confirmation of which element was clicked.

Green monospace keystroke toast

Bottom of viewport, translateX(-50%), bg rgba(0,0,0,0.85), color #22c55e, font-family monospace 14px. Each `type` call shows the typed text for 2500ms then fades.

6px green heartbeat dot, bottom-right

rgba(34,197,94,0.6). The reason: Web Animations API runs `[opacity 0.2 → 0.8, scale 0.8 → 1.2]` on a 800ms infinite-alternate loop. This forces the browser compositor to push frames during waits, so the WebM does not skip when nothing else is moving.

All on z-index 2147483647

The maximum 32-bit signed integer. Nothing on your page can stack above it. Overlay survives any modal, any z-stack, any portal.

Injected once, gated by __pias_cursor_injected

browser.ts:34: `if (!window.__pias_cursor_injected) { window.__pias_cursor_injected = true; ... }`. Re-injection on the same page is a no-op; SPA route changes preserve it because it lives in the document body.

Read the overlay yourself

One file. One injected script. The whole “visible interaction” story fits on one screen.

assrt-mcp/src/core/browser.ts

And the screenshot loop, also one screen

The post-action allowlist sits inside the agent's tool-dispatch loop. Six action names trigger a fresh JPEG; eleven names suppress it. The discriminator is whether the action plausibly mutated the rendered scene.

assrt-mcp/src/core/agent.ts
0%JPEG quality (browser.ts:601)
0pxRecording width, also viewport (browser.ts:628)
0pxRecording height (browser.ts:628)
0Overlay z-index (browser.ts, max int32)

Which actions trigger a screenshot, and which do not

The discriminator is “could the rendered pixels plausibly have changed.” Tool calls that are pure read paths (snapshot, evaluate), pure waits, or out-of-band utilities (email check, HTTP request) do not consume a screenshot, because the model already has the previous one and re-reading it would burn tokens for no information.

Triggers a fresh JPEG

  • navigate (URL change → new page DOM)
  • click (the most common visual mutation)
  • type (input/textarea contents change)
  • select (option/select state)
  • scroll (viewport changes, lazy content loads)
  • press_key (Enter, Escape, arrow nav)

Suppressed (no screenshot)

  • snapshot, wait, wait_for_stable, assert, complete_scenario
  • create_temp_email, wait_for_verification_code, check_email_inbox
  • screenshot (manual), evaluate, http_request
0 actionstrigger a post-action screenshot (agent.ts:1024)
0 actionsare explicitly suppressed from screenshotting
0msheartbeat compositor pulse interval
0mskeystroke toast fade-out timer

The pipeline, end to end

Six steps from the moment a click lands to the moment the model decides what to do next. None of them touch a baseline PNG. None of them write a file unless you opted into video.

Click to verdict

1

Real Chromium, real viewport

Playwright MCP launches `npx @playwright/mcp/cli.js --viewport-size 1600x900` in a persistent profile at ~/.assrt/browser-profile (browser.ts:296). Same Chromium binary as Playwright Test.

2

Cursor + overlay script injected on first navigate

browser.ts:33-97 declares CURSOR_INJECT_SCRIPT. It gets evaluated into the page so the heartbeat, cursor, ripple, and toast elements live in document.body. They are visible to the screencast layer that the WebM captures, but not to browser_take_screenshot.

3

Agent dispatches a visual action

click, type, navigate, scroll, select, press_key. Each one mutates the rendered page. The agent then checks against the allowlist at agent.ts:1024 and decides whether to take a fresh screenshot.

4

browser_take_screenshot at JPEG quality 50

browser.ts:601: `await this.callTool("browser_take_screenshot", { type: "jpeg", quality: 50 })`. Returns inline base64 in normal mode, or a file path that gets re-read in --output-dir mode.

5

Image block + AX tree to the model

agent.ts:1037 packs the JPEG as `{ type: "image", source: { type: "base64", media_type: "image/jpeg", data: ... } }` for Anthropic, or `inlineData` for Gemini at agent.ts:684. The latest accessibility-tree YAML rides alongside, truncated to 3000 chars.

6

Model reads pixels + DOM, judges, decides next action

Either it asserts what it expected (English `#Case` paragraph), continues the scenario, or calls complete_scenario. No baseline PNG was consulted at any point. The screenshot is discarded after this turn.

The chip row of things this approach does not need

The traditional Playwright visual testing checklist is twelve items long. Most of them exist to keep pixelmatch happy. None of them exist here.

no `tests/__snapshots__/` directory
no `*-chromium-darwin.png` baselines
no `*-chromium-linux.png` divergence
no `--update-snapshots` flag
no `maxDiffPixels` tuning
no `animations: 'disabled'` config
no `mask: [page.locator('.ad')]` regions
no font-loading dance
no timezone freeze
no color-scheme pinning
no per-platform baseline storage
no diff-PNG triage UI

How a vision-model run actually looks in the terminal

One CLI invocation, one preview URL, one English paragraph. The screenshot exists in memory only; the WebM is the durable record.

$ assrt test ... --case '...' --video

How a screenshot becomes a verdict, in one diagram

Inputs to the vision model, outputs back to the agent

Real Chromium
Live AX tree
JPEG q50
Vision model
Pass / fail
Next tool call
WebM frame

The screenshot is the comparator, not the artefact. Discarded after the model turn.

The full feature comparison

FeaturetoHaveScreenshot() + pixelmatchAssrt (vision model + WebM)
What detects the regressionpixelmatch counts differing pixels against a baseline PNG; failure if count > maxDiffPixels.A vision model reads the JPEG screenshot and judges it against an English assertion in the #Case.
Where the baseline livestests/__snapshots__/<spec>.spec.ts-snapshots/<name>-chromium-darwin.png, committed to your repo.There is no baseline. The English assertion is the spec; the model is the comparator.
Update workflow when the UI changes`npx playwright test --update-snapshots`, then review every diff in the PR.Edit the English paragraph in the #Case. No regenerated PNGs to review.
Anti-flake setup requiredDisable animations, freeze fonts, mask dynamic regions, set timezone, set color scheme, set locale.wait_for_stable polls the DOM until mutation stops (default 2s). That is it.
Cross-OS rendering differencesBaselines are per-platform: <name>-chromium-darwin.png vs <name>-chromium-linux.png. CI vs local often diverges.The model reads both screenshots the same way; subpixel rendering differences do not break the assertion.
Cost shapeStorage of N baselines * platforms; CI compute for diff.Per-screenshot vision tokens against your Anthropic key. Quality 50 keeps payload small.
Permanent visual artefactDiff PNG (expected/actual/diff) on failure only.Optional WebM at 1600x900 with cursor + ripple + keystroke + heartbeat overlays for every step.

When pixelmatch is still right

Lock-in beats judgement when the spec is “pixel-identical to this PNG.”

An icon set, a marketing screenshot grid, a generated PDF preview, a chart with strict design tokens, anywhere a one-pixel shift is itself the regression: pixelmatch is the right tool. Keep your toHaveScreenshot()tests for those. The vision-model loop is for the long tail where the assertion is “the right thing is on the screen,” not “the same pixels are on the screen.” Both are real Playwright. They run in the same Chromium, on the same viewport, against the same app. Pick the one that matches the question you are asking.

Want to see this run against a screen you keep regressing?

Bring a preview URL and one screen that pixelmatch keeps flaking on. We point the vision model at it, watch the WebM together, and you walk away with a #Case paragraph and a recording.

Frequently asked questions

Is this actually visual testing, or just E2E testing with screenshots attached?

It is visual testing in the working sense: a non-deterministic UI change (color, layout, copy, modal pop, missing element) is caught by something other than a DOM assertion. The mechanism is different from pixelmatch baselines. Assrt's agent loop at /Users/matthewdi/assrt-mcp/src/core/agent.ts:1024 captures a fresh JPEG quality 50 screenshot after every visual action (navigate, click, type, select, scroll, press_key) and pushes it back to the LLM as a tool-result image block at agent.ts:1036-1037. The model sees the screen exactly as a user would, and writes its assertion in plain English ("the upgrade modal is open and the annual price reads $39/mo"). Pixelmatch is one way to detect a regression; a vision model reading the screenshot is another. Both are visual.

Why JPEG quality 50? That sounds aggressive.

It is. The line is browser.ts:601: `await this.callTool("browser_take_screenshot", { type: "jpeg", quality: 50 })`. Two reasons. First, every screenshot becomes an image block in a Claude or Gemini API request, and image blocks are billed by pixel count and base64 size. Quality 50 cuts the payload roughly in half compared to PNG, which adds up over a 30-step scenario. Second, Anthropic's vision models top out at usable image resolution well below what a lossless 1600x900 PNG carries; the lossy compression artifacts at q50 are below the model's perceptual threshold for layout, color, and text legibility. If a regression is visible to a human at q50, the model will see it.

How is this not just `toHaveScreenshot()` with extra steps?

Because there is no baseline file. `toHaveScreenshot()` writes a `*-chromium-darwin.png` to disk on the first run, then on every subsequent run runs pixelmatch and fails if the diff exceeds the threshold. Assrt's screenshot path never persists. The screenshot exists in memory for one model turn, the model reads it, and it is discarded (or, if you ran with `--video`, captured into a 1600x900 WebM at browser.ts:628 instead). There is no `tests/__snapshots__/` directory in your repo. There is no "update baselines" flag. The assertion is whatever the English `#Case` paragraph said it was.

What about flake from animations, fonts, dynamic content, ads?

The traditional fix list (disable animations, hide elements, mask regions, set a deterministic timezone, preload fonts) exists because pixelmatch is binary: a single-pixel difference fails the test. A vision model reasoning about a screenshot does not care that two adjacent frames of a CSS spinner differ by 12 pixels of subpixel anti-aliasing; it cares whether the spinner is present. Assrt does not ship masking config or animation-disabling CSS, because the failure mode it is built for is different. There is one wait helper (`wait_for_stable` at agent.ts:186-195, polls until the DOM stops mutating for N seconds, default 2s) and that is it.

Where does the recording end up, and what is in it?

If you pass `--video` to the CLI (cli.ts:89-91), Playwright MCP starts a WebM recording at viewport size 1600x900 (browser.ts:628), and the file ends up under your test output directory as `recording.webm` (cli.ts:540-542). The recording is not just raw page video. It contains the injected DOM overlay defined at browser.ts:33-97: a 20px red cursor (`rgba(239,68,68,0.85)` with white border), a click ripple that scales 0.5 to 2.0 over 50ms, a green monospace keystroke toast at the bottom that fades after 2500ms, and a 6px green heartbeat dot at bottom-right pulsing every 800ms. The heartbeat is there for a single technical reason: it forces the compositor to push frames continuously, so the WebM does not skip during idle waits.

Does the agent see the cursor and keystroke overlay too?

No, and this is the subtle part. The injected overlay only matters for video frames that the CDP screencast captures. The screenshots passed to the LLM are taken via `browser_take_screenshot` (browser.ts:601), which Playwright takes from the page snapshot, not from the screencast layer. The model sees the real page; the human watching the WebM sees the page with the overlay on top. So the overlay never confuses the model into clicking on a fake red dot, and a human reviewing a failed scenario sees exactly which element the agent moved to and what it typed.

What does the agent actually do with the screenshot it just captured?

The screenshot rides as an `image` content block in the next user-role message of the conversation. For Anthropic, it is `{ type: "image", source: { type: "base64", media_type: "image/jpeg", data: screenshot } }` (agent.ts:1037). For Gemini, it is `{ inlineData: { mimeType: "image/jpeg", data: screenshot } }` (agent.ts:684). Alongside it, the same message carries the latest accessibility-tree YAML (truncated to 3000 chars at agent.ts:1054). So every model turn after a visual action is reading both the rendered pixels and the AX tree at the same time. The DOM tells it what to click; the screenshot tells it whether the click did what was expected.

Can I still write `await expect(page).toHaveScreenshot()` if I want pixelmatch?

Yes. Assrt is built on real Playwright under the hood, exposed through the official @playwright/mcp server (browser.ts:296 launches it as `npx @playwright/mcp/cli.js` with `--viewport-size 1600x900`). Anything Playwright supports is available via `browser_evaluate` or by writing a normal `*.spec.ts` next to your `#Case` markdown. The two patterns coexist. Most teams that adopt Assrt for the LLM-watching pattern keep their existing pixel-diff tests for components where they actually want pixel-perfect lock-in (icons, marketing screenshots, design-token regressions). The LLM-watching pattern handles the long tail of "is the right thing on the screen" assertions where pixelmatch is brittle.

Is the screenshot data ever stored anywhere I can inspect later?

Two paths. In normal mode, the JPEG bytes live in the model conversation only (sent inline as base64), and the agent emits a `screenshot` event over the EventEmitter at agent.ts:653 and 1025 that includes the base64 payload, so the web UI can render a thumbnail strip. In file output mode (`--output-dir`), Playwright MCP writes the JPEGs to disk and Assrt reads them back with `readFileSync` and base64-encodes them for the model (browser.ts:606-620). If you want a permanent visual record, use `--video` and you get the WebM that contains every frame plus the cursor/keystroke overlay.

How do I run a visual check on just one component?

Write a `#Case` paragraph that focuses the agent. Example: `#Case 1: the pricing card upgrade modal looks right. Navigate to /pricing. Click "Upgrade" inside the Pro card. Verify a modal appears centered on the page, the title reads "Upgrade to Pro", the price reads $39/mo, the "Pay" button is enabled, and the modal can be closed with Escape.` That paragraph becomes the prompt. The agent will snapshot, click, snapshot again, screenshot, and the model judges every visible part of that scene against the English checklist. No selectors, no baseline image. If a designer changes the modal background tomorrow and breaks visual hierarchy, the model can see that and report it.

assrtOpen-source AI testing framework
© 2026 Assrt. MIT License.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.