The visual regression tutorial without toHaveScreenshot()

A visual regression tutorial that never touches a golden PNG.

Every other tutorial on this page of Google walks you through toHaveScreenshot(), a __snapshots__/ folder, and a maxDiffPixels knob you tune until CI stops flaking. This one does not. Assrt treats visual regression as a reasoning problem: each step screenshot is attached as base64 JPEG to Claude Haiku 4.5, and the model decides pass or fail from the frame. There is no baseline image, because there is nothing to diff against.

Matthew Diakonov, Assrt maintainer

Published April 19, 202610 min read

4.8from Assrt MCP users

Zero toHaveScreenshot() calls, zero golden PNGs, zero maxDiffPixels

Screenshots attached to Claude Haiku 4.5 as base64 JPEG

TestAssertion is three fields: description, passed, evidence

The premise every top-5 result skips

If a model can read the screenshot, you do not need a baseline PNG to compare it to. You need a sentence that describes what correct looks like.

That one inversion is the whole tutorial. Everything below is how Assrt implements it, verified by line number, and how to run it against your own app in under a minute.

Install npx assrt-mcp

Visual regression, without the golden PNG

Every top tutorial teaches toHaveScreenshot(). This one skips it.

Zero baseline images on disk

Each step screenshot goes to Claude Haiku as JPEG

Pass or fail is prose evidence, not a pixel count

No maxDiffPixels. No threshold. No mask.

Your plan is English in a plain .md file

0:00 / 0:05

What the SERP teaches vs what you are about to read

I read the top five results for this keyword before writing this page. Every one of them hands you the same recipe: await expect(page).toHaveScreenshot("homepage.png", { maxDiffPixels: 100 }), run npx playwright test --update-snapshots to generate a baseline, commit the PNG, and tune a threshold when CI flakes. The recipe works, and it is what you want if you are doing pixel-level component QA on a locked design system. It is not what you want for page-level, behavior-level visual regression on a product that changes weekly.

Two tutorials, same keyword, opposite shape

Capture a screenshot, store it as homepage.png in __snapshots__/, compare every future run with pixelmatch at a threshold of 0.2, tune maxDiffPixels when the hero gradient shifts, and run --update-snapshots every time marketing touches a color. Regression = any pixel difference above the tolerance.

Baselines committed to git as PNGs, churning on every theme tweak
maxDiffPixels and threshold tuned per-test to suppress flake
animations: 'disabled' + mask: [...] bolted onto every call
A fail is a diff image; you still have to read the UI to decide

Side by side: the test file, the plan, the same assertion

Same goal on both sides: make sure the homepage still looks right after a deploy. Left is the classic Playwright pattern the SERP teaches. Right is the Assrt plan. Neither is long. They are architecturally different.

toHaveScreenshot() vs a plain-English #Case

import { test, expect } from "@playwright/test";

test("homepage looks right", async ({ page }) => {
  await page.goto("/");
  // First run: creates homepage.png in __snapshots__/
  // Every other run: pixelmatch against that baseline
  await expect(page).toHaveScreenshot("homepage.png", {
    maxDiffPixels: 100,
    threshold: 0.2,
    animations: "disabled",
    mask: [page.locator(".toast"), page.locator(".shimmer")],
  });
});

// When the marketing team tweaks the hero gradient:
// npx playwright test --update-snapshots
// (and pray you didn't bake in a real regression)

6% fewer lines

The anchor fact: the screenshot is a JPEG, not a baseline

This is the uncopyable part of the page. The behavior lives in a single block of code inside assrt-mcp/src/core/agent.ts around lines 972-990. Every click, type, navigate, scroll, and key press is followed by a fresh screenshot, captured with Playwright and attached to the next LLM message as a base64 JPEG. That is the whole visual regression pipeline.

assrt-mcp/src/core/agent.ts:972-990

Note the actions excluded from the screenshot guard on line 974: snapshot, wait, assert, create_temp_email, and friends. Those do not change the pixels, so capturing after them is wasted tokens and slower runs. Only the visual-affecting actions trigger a new JPEG. That detail is why the screenshot attachment rate lines up 1:1 with state changes the model needs to reason about, not with every tool call.

The 8 tools the agent can call, none of them diff pixels

navigatesnapshotclicktype_textselect_optionscrollpress_keywait

The assertion has three fields. None of them is a pixel count.

The failure signal in a classic pixel-diff tutorial is a number: N mismatched pixels above threshold T. The failure signal in Assrt is a sentence. The full type is below, from the source.

assrt-mcp/src/core/types.ts:13-17

If you grep the codebase for toHaveScreenshot, maxDiffPixels, pixelmatch, or pngjs, you get zero results. Visual regression is implemented without any of those primitives. That is not a marketing claim; it is the actual search output against the repo.

What the repo does not do

0Calls to toHaveScreenshot()

0Golden baseline PNGs on disk

0Tolerance knobs in the config

What it does instead

0JPEG attached to the model per action

0Fields on a TestAssertion

0Playwright MCP tools driving the browser

How a JPEG becomes a pass or fail, end to end

The pipeline is short on purpose. There is no diff worker, no baseline fetcher, no image-store API. Each frame is born, sent to the model, and either survives in the transcript or gets superseded by the next action's frame.

From the browser to a sentence the PR reviewer can read

What a run actually prints

Below is a trimmed transcript for the two-case scenario.md in the code comparison above. The assertions at the bottom are the visual regression checks; note the evidence strings in place of a diff image.

npx assrt-mcp run --url https://staging.acme.app --plan-file scenario.md

0 pixel diffs

“Grep the entire assrt-mcp source tree for toHaveScreenshot, maxDiffPixels, pixelmatch, or pngjs. Zero hits. Visual regression is a JPEG attached to a model, and an assertion with three fields: description, passed, evidence.”

assrt-mcp/src/core/agent.ts:987 + types.ts:13-17

Pixel-diff vs semantic visual regression, honestly

Both approaches catch real bugs. Neither one makes the other obsolete. The table below is the pragmatic breakdown: if you own a locked design system, Playwright's built-in diff is still great for component-level pixel checks. If you own a product that ships features weekly, semantic VRT catches more regressions per maintenance hour.

Feature	Playwright toHaveScreenshot() (pixel-diff VRT)	Assrt (semantic VRT)
What the tool is looking at	PNG vs PNG, pixel by pixel, via pixelmatch	JPEG of the current frame, read by Claude Haiku 4.5
Source of truth	Golden PNG in __snapshots__/, committed to git	One-sentence description in scenario.md
Tolerance model	maxDiffPixels, threshold, mask, animations:'disabled'	No tolerance knob; model judges the frame against the plan
First run	npx playwright test --update-snapshots, commit PNGs	npx assrt-mcp run --url ... --plan-file scenario.md
When marketing tweaks a gradient	Regenerate baselines, review diff, pray	Nothing changes, unless the plan said the gradient matters
Dynamic content (toasts, shimmers, animations)	Must be masked or animations-disabled to stop flake	Fine by default; the plan specifies what to verify
Shape of a failure	N mismatched pixels + diff-image.png for review	passed: false, evidence: 'one-sentence description'
Best for	Component-level pixel QA on a locked design system	Page-level user-journey checks on a fast-moving product

Your first semantic visual regression run, four steps

This is the minimal path from nothing installed to a passing (or informatively failing) run against your own app. No config file. No baseline folder. No CI wiring required.

First run, start to finish

Install the CLI

npx assrt-mcp is the only command you need. The package installs an MCP server, a persistent browser profile at ~/.assrt/browser-profile so logins survive between runs, and a PostToolUse hook for Claude Code if you use it.

Write a plan in plain English

Create scenario.md with a #Case header and one or two imperative sentences about what the page should look like. The file is parsed by the regex #?\s*(?:Scenario|Test|Case), so #Case, Test, and Scenario all work. No YAML, no JSON schema.

Run it headed on the first attempt

npx assrt-mcp run --url http://localhost:3000 --plan-file scenario.md --headed. Watch the browser. The agent narrates each action, captures a JPEG after every visual step, and asks Claude Haiku whether the frame matches your plan.

Read the evidence field, not a diff image

Open /tmp/assrt/results/latest.json. The assertions array contains description, passed, evidence. The evidence is the model's first-person description of what it actually saw. When a test fails, you read the sentence, watch the WebM, and update the plan or fix the app.

Commit the .md, archive the .webm, move on

scenario.md goes into git like any other source file. Reviewers read English. The WebM video and JSON report upload as CI artifacts. There is no __snapshots__/ folder to maintain and no --update-snapshots to re-run on every theme tweak.

What you give up, and what you keep

Semantic visual regression is not strictly better than pixel diffing, and I would not pretend otherwise. Here is what actually changes when you adopt this pattern, verified against the Assrt source.

The real trade, verified from source

Constraints that hold across every run

Zero calls to Playwright's toHaveScreenshot(), verified by grep across assrt-mcp/src
Zero mentions of maxDiffPixels, threshold, or pixelmatch in the codebase
Screenshots are base64 JPEG attached to Claude Haiku 4.5 at agent.ts line 987
TestAssertion = { description, passed, evidence } — three fields, types.ts lines 13-17
No __snapshots__/ directory to maintain, no --update-snapshots flow
Screenshots saved per step to /tmp/assrt/<runId>/screenshots/ as forensic artifacts, not baselines
Plan lives in /tmp/assrt/scenario.md, parsed by the regex #?\s*(?:Scenario|Test|Case)

The obvious trade: you cannot catch a 1px color drift on a button outline anymore. That is a real loss for a mature design-system team. The gain: you stop maintaining a baseline folder for pages whose layout genuinely changes every week, and your "visual regression failed" conversation becomes a sentence instead of a three-pane diff viewer.

Why this is open-source and self-hosted

The comparable tier-3 AI testing platforms charge around $7,500 a month at scale and keep your scenarios, screenshots, and diff history in their cloud. Assrt is npx assrt-mcp, open-source, $0 to run, everything on your disk. The plan is English in a .md file. The tools are the public Playwright MCP vocabulary. The code that captures the screenshot is browser.screenshot(). There is no vendor DSL between you and the assertion, which means the same plan runs unchanged under any other Playwright MCP agent you point at it.

Run a semantic visual regression against your own app

One npx command, one .md file, real Playwright under the hood. Video auto-opens when the run finishes. No baseline folder, no account, no cloud. When you cancel, you keep the plan, the videos, and every JPEG the model read.

Install npx assrt-mcp →

Visual regression tutorial: specific answers

Is this a Playwright toHaveScreenshot() tutorial?

No, on purpose. Every other result for this keyword already teaches toHaveScreenshot() with a pixelmatch-backed diff, a __snapshots__ folder of golden PNGs, and a maxDiffPixels or threshold knob you tune. The Assrt source contains zero references to toHaveScreenshot, maxDiffPixels, or pixelmatch. Instead, this tutorial walks through semantic visual regression: the screenshot after every browser action is attached as base64 JPEG to a Claude Haiku 4.5 tool-result message (see assrt-mcp/src/core/agent.ts lines 972-990), and the model answers a plain-English question about what it sees. Pass and fail are decided by reasoning, not by a pixel count.

What exactly does Assrt send to the model on each step?

After every visual action (navigate, click, type_text, select_option, scroll, press_key), the agent captures a JPEG screenshot and pushes it into the tool result content as { type: 'image', source: { type: 'base64', media_type: 'image/jpeg', data: screenshotData } }. That exact structure is at assrt-mcp/src/core/agent.ts line 987. The model sees the result text from the tool call plus the current frame, and decides whether to continue, retry, or emit a pass/fail assertion. There is no diff image, no composite overlay, and no baseline PNG. The only state the model has about 'what the page should look like' is the English plan and the evidence field of prior assertions.

Where do the screenshots actually land on disk?

Every screenshot is written to /tmp/assrt/<runId>/screenshots/<index>_step<stepNumber>_<action>.png. The naming convention is set at assrt-mcp/src/mcp/server.ts line 468. The index is zero-padded to two digits. The server deduplicates per step, so if the agent thrashes during one action you only keep the last PNG for that step (server.ts lines 473-486). Unlike Playwright's native workflow, these are not baselines and nothing compares them to each other. They are forensic artifacts for humans to review alongside the WebM video at /tmp/assrt/videos.

What does a failing assertion look like without a pixel diff?

The TestAssertion type at assrt-mcp/src/core/types.ts lines 13-17 is exactly three fields: description (the English thing the model was checking), passed (boolean), and evidence (a free-text description of what the model saw on the screenshot). There is no tolerance field, no mismatched pixel count, no diff image URL. When a visual regression fails, you get a line like 'passed: false, evidence: the Submit button has no visible label text, only a loading spinner'. That is the full fail signal. It is meant to be readable in a PR comment without opening a three-panel diff viewer.

How do I run this locally in under a minute?

npx assrt-mcp run --url http://localhost:3000 --plan-file scenario.md. scenario.md is a plain-text file with a #Case header and imperative English (grammar is the regex #?\s*(?:Scenario|Test|Case) in assrt-mcp/src/core/scenario-files.ts). No baseline PNG folder to initialize. No --update-snapshots first run. On the first run you watch the WebM video, read the pass/fail in /tmp/assrt/results/latest.json, and you're done. If you want a richer plan, add a 'Pass Criteria' section with the visual properties the model should verify, and it will assert against them one by one.

What about CSS animations and moving elements that wreck pixel-diff runs?

This is the scenario where semantic visual regression shines. A classic toHaveScreenshot() run fails when a shimmer loader, a rotating carousel, a fade-in toast, or an animated skeleton changes even two pixels between runs. You end up adding animations: 'disabled' and mask: [locator] to every call, and the diffs still leak through on slow CI. Assrt's agent does not care. The prompt is 'has the confirmation toast appeared with the order ID' and the model answers from the frame regardless of what other pixels moved. The evidence field records what it actually read. There is no baseline to regenerate when marketing tweaks the hero gradient.

What are the trade-offs versus pixel-diff visual regression?

Pixel-diff catches sub-pixel layout shifts and one-pixel color drifts that a model will not flag. If you are regression-testing a design system where a 1px border change matters, keep Playwright's toHaveScreenshot() for that specific component. Semantic visual regression is additive and better for page-level behavior: is the form in the right state, is the success banner up, did the avatar load. The realistic stack is: tier-1 Playwright toHaveScreenshot() for pixel-sensitive component tests, tier-4 Assrt for page-level user-journey visual regression. They answer different questions.

Which model does Assrt use for visual reasoning, and can I swap it?

The default is Claude Haiku 4.5 (claude-haiku-4-5-20251001), set at assrt-mcp/src/core/agent.ts line 9. You can override with --model at the CLI. Gemini 3.1 Pro Preview is supported via the Provider type ('anthropic' | 'gemini') on the same file. Both providers receive the same JPEG screenshots after each visual action, so switching models does not change the plan format or the artifact layout. It changes how the screenshot is interpreted, which means you can A/B two models against the same plain-text #Case and compare their evidence strings.

Can I commit this to git and review it in a PR?

Yes, and that is the point. The plan lives in /tmp/assrt/scenario.md (layout documented at assrt-mcp/src/core/scenario-files.ts lines 5-20), which is plain text with no YAML or DSL. Copy it into your repo, commit it, and reviewers read the English. The results JSON at /tmp/assrt/results/<runId>.json is structured (schema at assrt-mcp/src/core/types.ts lines 28-35) and safe to archive as a CI artifact. The WebM video of the run uploads as a GitHub Actions or GitLab CI artifact and plays in any browser. There is no vendor dashboard to log into, and there is no __snapshots__ folder that churns on every theme tweak.

Is this open-source? What is the lock-in?

Assrt ships as an open-source npm package (npx assrt-mcp). Self-hosted, $0 to run against your own app, no cloud dependency. The comparable tier-3 AI testing platforms charge around $7,500 a month at scale and keep scenarios, selectors, and diff history in their cloud. With Assrt, the scenario is a .md file on your disk, the results are JSON on your disk, the video is a WebM on your disk, and the screenshots are PNGs on your disk. Cancel tomorrow and you still have a runnable plan, a readable transcript, and a video of every run. The plan even ports to any other Playwright MCP agent unchanged, because the tools are public and not a vendor DSL.

Adjacent guides on agent-driven testing, regression coverage for AI-generated apps, and why tier 4 is where semantic visual regression lives.

Keep reading

Taxonomy

E2E testing frameworks: the 4-tier taxonomy every top-10 list misses

Scripted, low-code, AI-compiled, agent-interpreted. Assrt sits in tier 4 and that is why semantic visual regression works at all.

Read

Feature

Automated self-healing tests

The flip side of semantic visual regression: when a button moves, the agent re-snapshots and continues. No locator to patch.

Read

Guide

Auditing AI-generated apps with E2E regression testing

What to regression-test when the author of the code is an LLM. Semantic visual checks beat pixel diffs on ever-changing UIs.

Read

Visual regression, without the diff folder

A screenshot, a sentence, a pass or fail. That is the whole test.

0 golden PNGs, 0 fields on a TestAssertion, 0 Playwright MCP tools driving the run.

Try Assrt free

What the SERP teaches vs what you are about to read

Two tutorials, same keyword, opposite shape

Side by side: the test file, the plan, the same assertion

The anchor fact: the screenshot is a JPEG, not a baseline

The assertion has three fields. None of them is a pixel count.

How a JPEG becomes a pass or fail, end to end

From the browser to a sentence the PR reviewer can read

What a run actually prints

Pixel-diff vs semantic visual regression, honestly

Your first semantic visual regression run, four steps

First run, start to finish

Install the CLI

Write a plan in plain English

Run it headed on the first attempt

Read the evidence field, not a diff image

Commit the .md, archive the .webm, move on

What you give up, and what you keep

Why this is open-source and self-hosted

Visual regression tutorial: specific answers

Keep reading

E2E testing frameworks: the 4-tier taxonomy every top-10 list misses

Automated self-healing tests

Auditing AI-generated apps with E2E regression testing

A screenshot, a sentence, a pass or fail. That is the whole test.

Comments (••)

Comments ()