Field guide, no marketing

Visual regression testing, including the baseline tax nobody mentions.

Most guides on this topic teach baseline management as the central skill. Storing goldens, approving diffs, tuning thresholds. They skip the part where every intentional design change costs a human-hour of reviewer attention. This page covers how visual regression actually works, when it is worth the tax, and a no-baseline alternative that records a Playwright WebM you can question in English after the fact.

Matthew Diakonov, Founder, Assrt

Published April 27, 202611 min read

4.8from 180+ engineering teams

Open source, MIT licensed

Self-hosted, runs on your machine

Generates plain Playwright code

Visual regression, reframed

The artifact you keep is a video, not a __snapshots__ folder.

Pixel-diff against a stored PNG works.

Approving baselines does not scale.

Record the run instead.

Ask the recording questions in English.

Keep the WebM, drop the goldens.

0:00 / 0:05

The classical pipeline, in one diagram

Strip the marketing off any visual regression product and the wiring is roughly the same. A test runner takes a screenshot, a diff engine compares it to a stored baseline, and either an automated threshold or a human reviewer decides whether to approve the new image as the new baseline. The variable bits are which renderer takes the screenshot, which algorithm does the diff, and where the baselines live.

What every traditional visual regression tool actually does

What it costs in PR review time

The hidden line item on every visual regression rollout is reviewer attention per intentional UI change. Numbers below are rough field observations, not benchmarks. They scale almost linearly with the surface area of your UI and the number of contributors.

baselines that flip on a single header redesign across a 40-page site.

0-7 min

of human attention per diff to scan the red-overlay PNG and decide if it is intentional.

directories named __snapshots__ in the assrt-mcp source. Zero .png baselines. Zero pixel-diff dependencies.

~3 hours

“The team approves 30+ baseline diffs a week and at this point nobody actually looks at them.”

Engineering lead, mid-size SaaS, post-rollout

Two ways to write the same check

Below is the same intent expressed two ways. On the left, the standard Playwright recipe with a committed baseline. On the right, an Assrt scenario that produces no baseline file at all.

Visual regression, two styles

// tests/checkout.spec.ts
import { test, expect } from "@playwright/test";

test("checkout page looks right", async ({ page }) => {
  await page.goto("/checkout");
  await page.waitForLoadState("networkidle");
  await page.evaluate(() => document.fonts.ready);

  await expect(page).toHaveScreenshot("checkout.png", {
    fullPage: true,
    animations: "disabled",
    mask: [
      page.locator("[data-testid=order-id]"),
      page.locator("[data-testid=timestamp]"),
      page.locator("iframe[src*=stripe]"),
    ],
    maxDiffPixels: 100,
    maxDiffPixelRatio: 0.001,
  });
});

// On first run: npx playwright test --update-snapshots
// commits tests/__snapshots__/checkout.png to the repo.
// Every subsequent PR diffs against this PNG.

45% fewer lines, no PNG to commit

Both styles are valid. The Playwright version will catch a 1px border drift; the Assrt scenario will catch a coupon that fails to apply even when the layout is pixel-identical to last week. Pick the tool that matches the question you are actually asking.

Where each approach earns its keep

Feature	Pixel-diff baselines	Semantic vision
Catches a 1px border or color drift	Yes, this is the strength	Unreliable, the model will not flag it
Catches a broken coupon flow	No, screenshot still matches	Yes, scenario asserts the discount line
Survives a font that loads 50ms late	Often flakes, needs document.fonts.ready guard	Tolerant, the model judges layout meaning
Survives an ad iframe rotating creative	Needs explicit mask: [...]	Tolerant by default
Maintenance per intentional UI change	One reviewer scan per affected baseline	Zero, the scenario does not name pixels
Artifact you keep on disk	tests/__snapshots__/*.png	/tmp/assrt/<runId>/videos/recording.webm + player.html
Can you re-question the run later?	Only by re-running	Yes, assrt_analyze_video reads the cached WebM
Vendor lock-in	Format-dependent (Percy, Chromatic, Applitools)	None, scenarios are .md, results are JSON, recordings are WebM

Both are valid; most mature suites end up running both, scoped to what each is good at.

The Assrt run, end to end

Here is what a single Assrt invocation does, traced through the source. File and line numbers are from the open-sourceassrt-mcprepo so you can verify any of this.

1. Read the scenario

Parse the .md file. The case header is matched by the regex /#?\s*(?:Scenario|Test|Case)/. Each numbered step becomes an instruction the agent will work through.

2. Launch a Playwright browser with video recording on

Chromium boots at viewport 1600x900. Playwright's native page.video() is enabled, which means the browser session is recorded to a WebM file from frame zero.

3. Loop, action by action

For each visual action (navigate, click, type_text, scroll, press_key, select_option), the agent captures a JPEG screenshot at quality 50 and attaches it to the next Claude message as image input. Source: assrt-mcp/src/core/agent.ts:987.

4. Claude Haiku 4.5 judges the frame

DEFAULT_ANTHROPIC_MODEL is claude-haiku-4-5-20251001 (assrt-mcp/src/core/agent.ts:9). The model reads each frame against the scenario and decides whether to continue, retry, or emit an assertion via the assert tool with description, passed (boolean), evidence (string).

5. Stop the recording, finalize the artifacts

When the scenario finishes, the recorder is stopped (server.ts:578). The WebM is moved to /tmp/assrt/<runId>/videos/recording.webm. A self-contained player.html is generated next to it (server.ts:618).

6. Optional: question the recording in English

If GEMINI_API_KEY is set, the assrt_analyze_video MCP tool is registered (server.ts:929). It reads the WebM off disk, base64-encodes it, and sends it to Gemini 3.1 Flash Lite Preview as a video/webm part. You can ask any English question about the run, as many times as you want.

The retrospective question loop

Once the WebM is on disk, the cost of a new question is one Gemini call. You do not re-run the test, you do not re-take screenshots, you do not approve a baseline. You just ask.

assrt-mcp/src/mcp/server.ts (abridged)

zsh

When to keep pixel diffs anyway

This is not a pitch to delete every existing snapshot. The honest breakdown of where each approach earns its keep:

Keep pixel diffs for design system components

Button, Input, Card, Badge, Tag. These are pure visual primitives where 1px shifts and 1-shade color changes are real regressions. Run toHaveScreenshot on a Storybook-style isolation page.

Use semantic checks for user journeys

Checkout, signup, dashboard load, search. The question is functional, not pixel-level. The cost of a baseline is high, the value is low.

Pick one for marketing pages

Either works. Pixel diffs catch typo deployments. Semantic checks catch broken hero CTAs without flaking on font load timing.

Avoid pixel diffs on anything with live data

Dashboards, feeds, lists with timestamps. The mask:[...] config grows until it negates the test.

Record the run regardless

Even if you keep your pixel-diff suite, recording the WebM is free and lets a human re-watch any failure in browser. No replication, no flaky-on-CI debugging.

A 30-second sanity check on the source

The claims on this page about the Assrt-MCP source are verifiable. Clone the repo, run these three commands, and you will see what is and is not in there.

verify-the-claims.sh

Decision shortcut

Component library with strict visual contracts? Pixel diffs.
User journey through a flow? Semantic vision check.
Anything with live data, dates, or rotating ad creative? Semantic.
Want zero baseline approval workflow? Semantic, full stop.
Want a 1px border guard? Pixel diff, full stop.
Want both? Run them in parallel, scoped to what each is good at.

Want to see the no-baseline run on your app?

15 minutes. We point Assrt at one of your real flows, run it live, and you walk away with the WebM and the scenario .md.

Frequently asked questions

What is visual regression testing in plain language?

It is the practice of detecting when a UI changes in ways nobody intended. The classical recipe is: take a screenshot of a known-good page (the baseline or golden), commit it to the repo, then on every CI run take a fresh screenshot, pixel-diff it against the baseline, and fail if the diff exceeds a threshold. Tools like Playwright's expect(page).toHaveScreenshot(), Percy, Chromatic, and Applitools all start from this idea. The interesting bits are how each one handles fonts, animations, dynamic data, anti-aliasing, and the human approval workflow when a baseline genuinely needs to change.

Why does everyone complain about flakiness in visual regression?

Because pixel diffs trip on things humans never notice: a font that loads 50ms late and gets laid out one pixel differently, an animation that hasn't fully settled, an iframe ad that rotates its creative, a date that updates to today, a GPU rendering color slightly differently between Linux CI and macOS dev. Every mature suite ends up with a stack of mitigations: animations: 'disabled' in Playwright, mask: [...] for dynamic regions, page.waitForFunction(() => document.fonts.ready), explicit clock freezing, and a maxDiffPixels threshold tuned per page. Each mitigation is a config knob the team has to maintain.

What is the baseline approval tax, exactly?

Every intentional UI change generates a diff against every baseline that includes the affected element. A header tweak across a 40-page site can produce 40 diffs that each need a human pair of eyes. Reviewers learn to rubber-stamp them, which defeats the point. Then a real regression slips through under the same rubber stamp. The tax is the wall-clock time between 'I shipped a CSS change' and 'all baselines are approved and merged,' multiplied by every PR that touches anything visual. Most teams underestimate it by an order of magnitude.

How is Assrt's approach different from Playwright's toHaveScreenshot?

Assrt does not call toHaveScreenshot. The Assrt-MCP source has zero references to toHaveScreenshot, pixelmatch, resemble.js, looks-same, or odiff, and zero __snapshots__ directories. Instead, after each visual action (click, type_text, navigate, scroll, press_key), the browser screenshot is captured as JPEG quality 50 and attached to the next Claude message as an image input (assrt-mcp/src/core/agent.ts:987). Claude reads the frame against the scenario in plain English and decides if the assertion passed. The artifact you keep is not a baseline PNG, it is the full WebM recording (assrt-mcp/src/mcp/server.ts:578) plus a self-contained player.html (server.ts:618). You can re-watch any run, and if you set GEMINI_API_KEY you can also ask the recording questions in English via the assrt_analyze_video MCP tool (server.ts:925-1018).

Does that mean I never need pixel diffs again?

No. Pixel diffs are still the right tool for design system component libraries, brand-color regressions, and 1px border drifts. A model will not reliably catch a button that lost 1px of padding. But for user-journey level questions like 'is the cart total right after applying the coupon' or 'did the success toast appear,' the pixel diff is overkill and the maintenance is a tax. The honest answer is: keep toHaveScreenshot scoped to your component library, and use a semantic approach for full-page flows. The two are complementary.

How do I set up visual regression testing with Playwright today?

Add expect(page).toHaveScreenshot('checkout.png') to a test, run npx playwright test --update-snapshots once to seed the baseline, commit the resulting tests/__snapshots__ folder, then run the suite normally on CI. Playwright will fail any test where the new screenshot differs from the committed PNG by more than maxDiffPixels (default 1) or maxDiffPixelRatio (default 0). Lock fonts in your test setup, disable animations with the animations: 'disabled' option, mask dynamic regions, freeze the clock if your UI shows timestamps, and pin the OS in CI so anti-aliasing is consistent.

Where does Assrt put the recording so I can keep it?

Each run lands in /tmp/assrt/<runId>/. Inside that folder you get videos/recording.webm (the Playwright video recorder output), player.html (a static HTML page that embeds the video, the scenario, and the assertions list), screenshots/*.jpg for each frame Claude saw, and a JSON TestReport matching the shape at assrt-mcp/src/core/types.ts:28. The MCP tool response includes a videoPlayerUrl pointing to a local static server that auto-opens in your browser at the end of each test (server.ts:629). Move the folder anywhere you like, the player works offline.

Can I question a past run instead of re-running it?

Yes, that is what the assrt_analyze_video MCP tool is for. Once a run finishes, the WebM path is cached in a module-level variable lastVideoFile (assrt-mcp/src/mcp/server.ts:270). You can call assrt_analyze_video with any English prompt, no videoPath argument required, and the entire recording is base64-encoded and sent to Gemini 3.1 Flash Lite as a single video/webm part. Repeat with new questions for free: 'did the modal flicker,' 'estimate the LCP element visually,' 'was there a flash of red on the pricing page.' Each question is a fresh request against the same recording on disk.

What runs Assrt's per-frame judgment, and is the model swappable?

The default is Claude Haiku 4.5. The exact model ID claude-haiku-4-5-20251001 is declared at assrt-mcp/src/core/agent.ts:9 as DEFAULT_ANTHROPIC_MODEL. The agent loop at agent.ts:1022 attaches the screenshot only after visual actions, not on every step, which keeps token usage bounded. You can override the model via env var or scenario config; the codebase reads ANTHROPIC_MODEL if set. Phase 2 (post-run video Q&A) uses Gemini 3.1 Flash Lite Preview because it accepts video/webm as a first-class modality in a single inlineData part.

Is any of this proprietary, or do I keep my tests if I cancel?

Everything stays on your disk in standard formats. Scenarios are .md files (the case header regex is #?\s*(?:Scenario|Test|Case)). Results are JSON matching the TestReport interface at assrt-mcp/src/core/types.ts:28-35. Screenshots are JPEGs, the recording is WebM, the player is plain HTML. There is no cloud baseline store, no proprietary diff format, no dashboard you have to log into. Assrt ships as an open-source npm package (npx assrt-mcp), so canceling means deleting node_modules. The comparable tier-3 visual testing platforms charge around $7,500 a month at scale; Assrt is free and self-hosted.

How do I write a scenario that does the equivalent of a visual regression check?

Write what a human reviewer would check, in English, in a .md file. Example: '# Test: Header is teal on / and looks centered' followed by 'Open https://assrt.ai. Take a screenshot. Confirm the navbar background is teal (not dark blue or white). Confirm the logo is left-aligned and the CTA button is right-aligned.' Run with assrt_test pointed at that file. Claude Haiku 4.5 sees the rendered page, decides whether each English claim is true, and emits a TestAssertion with description, passed, and evidence. There is no baseline to commit, no threshold to tune, no diff to approve. The next run will judge the same English claims against whatever HTML the site is serving.

Related guides

Deep dive

AI visual regression: the two-phase pipeline

Claude Haiku 4.5 judges each screenshot during the run; Gemini 3.1 Flash Lite answers English questions about the WebM after the run.

Read

Tutorial

Playwright visual regression testing guide

How to set up toHaveScreenshot, lock fonts, disable animations, mask dynamic regions, and tune thresholds without drowning in flakes.

Read

Primer

Visual regression for beginners

Start here if you have never seeded a baseline. Covers what a snapshot is, how diffs work, and the 5 most common gotchas.

Read

The classical pipeline, in one diagram

What every traditional visual regression tool actually does

What it costs in PR review time

Two ways to write the same check

Where each approach earns its keep

The Assrt run, end to end

1. Read the scenario

2. Launch a Playwright browser with video recording on

3. Loop, action by action

4. Claude Haiku 4.5 judges the frame

5. Stop the recording, finalize the artifacts

6. Optional: question the recording in English

The retrospective question loop

When to keep pixel diffs anyway

Keep pixel diffs for design system components

Use semantic checks for user journeys

Pick one for marketing pages

Avoid pixel diffs on anything with live data

Record the run regardless

A 30-second sanity check on the source

Want to see the no-baseline run on your app?

Frequently asked questions

Related guides

AI visual regression: the two-phase pipeline

Playwright visual regression testing guide

Visual regression for beginners

Comments (••)

Comments ()