A field guide, not a baseline tutorial

Visual regression baselines: eight failure modes no threshold knob fixes.

Every guide on this topic teaches you to capture a golden PNG, tune maxDiffPixels, mask the dynamic regions, and mark the run green. That works for design-system components and breaks for everything else. This is a catalog of where it breaks at the page-journey level, with the line of code in the Assrt source that does the same job without any stored reference image at all.

Assrt is open-source, runs locally via npx assrt-mcp, generates real Playwright code under the hood, and stores zero baseline images on disk. The repo has no references to toHaveScreenshot, pixelmatch, resemble.js, or maxDiffPixels. Verifiable with one grep.

Matthew Diakonov, Written with AI

Published April 27, 202614 min read

“Baseline images stored anywhere in the assrt or assrt-mcp repos. The TestAssertion type at types.ts:13-17 has three fields: description, passed, evidence. No tolerance, no diff URL.”

grep -r 'toHaveScreenshot\\|pixelmatch\\|maxDiffPixels' on /Users/matthewdi/assrt-mcp returns no matches

What baselines are, and why this catalog exists

A baseline is a reference screenshot saved into version control, usually in a __snapshots__ folder. Every later run renders the same page in headless Chromium and pixel-compares against that frozen PNG. If the diff exceeds a tolerance you tune, the test fails. Playwright's toHaveScreenshot, BackstopJS, Percy, Chromatic, and Applitools Eyes all share this primitive even when their dashboards wrap it in a smarter algorithm.

The primitive works for what it was designed for: pixel-faithful regression checks on a single component in a controlled environment. It breaks when you point it at user journeys: dashboards, forms, checkout flows, anything that loads real data, animates, runs across browsers, or ships ten times a week. The eight entries below are the pixel-level reasons it breaks. Each one names the cause, the mitigation every other guide recommends, why the mitigation cracks under real conditions, and what Assrt does instead in the actual source.

Things baselines cannot survive at the journey level

antialiasing driftfont hinting on CIshimmer loadersscrollbar chromefocus ring deltasemoji font shiftstimestamps in fixturesdevicePixelRatio roundingOS minor updatesapprove-all fatiguegigabyte __snapshots__ foldersmaxDiffPixels creep

What replaces the baseline, in four steps

The baseline is one moving part doing two jobs at once: it is the reference frame and the comparator. Splitting those jobs is the whole idea. The plan is the reference. The model is the comparator. The capture is forensic, not a candidate for diffing. The record is plain text.

The four parts that replace one baseline

Plan
scenario.md, English #Case headers
Capture
JPEG after every non-denylist tool
Judge
Claude Haiku 4.5 on the next message
4
Record
TestAssertion: description, passed, evidence

Eight failure modes, with the Assrt source line beside each

Each entry names the pixel-level cause, the standard mitigation, why that mitigation cracks at the user-journey scale, and the file:line in Assrt's source that does the same job without a baseline.

1. Antialiasing drift across renderers

Pixel-level cause: Chromium on Linux, WebKit on macOS, and Firefox each rasterize the same DOM with subtly different sub-pixel coverage. A diagonal border draws with 30% gray on one and 28% gray on another. Diff: 200+ pixels different at the same intent.
Standard mitigation: Pin the rendering engine. Run all visual tests in a single browser, single OS, single Docker image. Bump maxDiffPixels until the noise floor disappears.
Why it cracks: Cross-browser tests are now off the table because their baselines fight each other. Bumping maxDiffPixels eats real one-pixel regressions silently. The first time someone runs the suite locally on a different OS, every baseline is 'broken'.
What Assrt does instead: agent.ts:1037 attaches the JPEG to a model that does not care about sub-pixel coverage. The assertion is 'is the border visible' or 'is the table aligned', not 'are the pixels identical'. The same scenario.md runs across browsers without per-browser goldens.

2. Font hinting on CI vs developer machines

Pixel-level cause: macOS uses subpixel-positioned text rendering by default. Linux on CI uses TrueType hinting with grayscale antialiasing. Same font, same size, different glyph shapes. The 'l' character in Inter renders 1px wider on one and the line wraps differently.
Standard mitigation: Bake the font binary into the baseline machine. Use a Docker image with one specific freetype build. Disable subpixel rendering with CSS font-smoothing.
Why it cracks: The font binary changes whenever the CI image is rebuilt. A new freetype patch ships in Ubuntu, every baseline becomes stale. font-smoothing CSS does not affect headless Chromium's text shaping pipeline; it only affects how the rasterized glyph is composited onto the canvas.
What Assrt does instead: The model reads the rendered text from the JPEG (and from the accessibility tree captured at agent.ts:494 alongside the screenshot). Glyph shape does not change what the text says. 'Submit' is 'Submit' whether it rendered with 1px or 2px stroke width.

3. Animated skeletons, shimmer loaders, fade toasts

Pixel-level cause: A 30fps shimmer loader covers 200x60 px and shifts every frame. A fade-in toast finishes its animation between 1.8s and 2.2s depending on CPU pressure. Two consecutive screenshots of a 'stable' page can differ by thousands of pixels.
Standard mitigation: Pass animations: 'disabled' to Playwright's toHaveScreenshot(). Pass mask: [page.locator('.shimmer')]. Wait for a network idle event before the diff.
Why it cracks: animations: 'disabled' clobbers the very animation you are trying to verify (the toast slide-in is the assertion). Masking grows until the baseline is 60% masked rectangles, at which point you are testing nothing. network idle does not mean visual idle.
What Assrt does instead: The wait_for_stable tool at agent.ts:957 watches DOM mutations directly via MutationObserver and waits for two seconds of zero mutations before the model looks. Then the model is asked 'has the success toast appeared with the order ID', and the answer is reasoned, not pixel-counted. No mask required.

4. OS chrome: scrollbars, focus rings, native widgets

Pixel-level cause: Windows draws an 18px scrollbar by default. macOS draws a 0px overlay scrollbar that appears on scroll. Chromium on Linux draws a 12px scrollbar with a different gradient. A focused button has a 3px blue ring on Mac, a 2px ring on Windows, a faint dotted outline on Linux.
Standard mitigation: Inject CSS to hide ::-webkit-scrollbar. Force outline: none on every focused element in the test fixture. Pin viewport size precisely.
Why it cracks: outline: none breaks accessibility on the page that is now under test. ::-webkit-scrollbar overrides bleed into the actual production styles. Pinning viewport does not help because the scrollbar inset offsets the content area by a different amount per OS.
What Assrt does instead: The model is shown the JPEG and asked semantic questions. 'Is the navigation bar fully visible' returns true regardless of whether the scrollbar took 18px or 0px. The visual differences exist in the PNG; they just stop being a failure mode.

5. Emoji and color font shifts

Pixel-level cause: A new macOS minor version updates Apple Color Emoji. Twitter switches Twemoji from version 14 to 15. The 🎉 in your hero ships a slightly different palette and outline. Diff: 4000+ pixels in a single emoji.
Standard mitigation: Replace emoji with SVG icons. Or pin Apple Color Emoji to a specific .ttc file in your CI image. Or mask every emoji region.
Why it cracks: Replacing emoji rewrites the product. Pinning fonts on CI is undone the next time someone updates the base image. Masking emoji means you are not testing whether the celebration banner shipped its emoji at all.
What Assrt does instead: The model reads the JPEG and answers 'does the hero contain a celebration emoji'. The exact glyph variant is not the assertion. The assertion is the presence and intent of the emoji.

6. Timestamps, dates, IDs, generated content

Pixel-level cause: Every order page shows a created_at timestamp. Every dashboard shows 'Last synced 4 minutes ago'. Every receipt shows an order ID. None of those values are the same across runs.
Standard mitigation: Mask the timestamp region. Or freeze Date.now() with sinon. Or mock the API to return a fixed payload during visual tests.
Why it cracks: Freezing Date.now() in production-like environments means the test is no longer integration. Mock APIs drift away from real APIs. Masking grows until the baseline is rectangles, see #3.
What Assrt does instead: The model is told to assert 'a created_at timestamp is shown in human-readable form', not 'a created_at timestamp matches a specific string'. The evidence field captures the actual value the model saw, which means a regression that produces 'Invalid Date' or '1970-01-01' is caught without a baseline lookup.

7. Viewport rounding and zoom

Pixel-level cause: 1366x768 at devicePixelRatio 1 looks correct. 1367x768 at devicePixelRatio 1 reflows your sidebar to a different breakpoint. devicePixelRatio 2 makes the same baseline four times larger and antialiases differently.
Standard mitigation: Lock viewport to one exact size. Lock devicePixelRatio to one value. Generate one baseline per (width, dpr) tuple you care about.
Why it cracks: Storage of N goldens per page across M viewports across K browsers explodes. The baseline folder reaches gigabytes. Reviewing a diff that changes 12 goldens for one CSS edit is a multi-screen exercise, and most reviewers click 'approve all'.
What Assrt does instead: TestRunOptions.viewport at types.ts:46 takes a preset string or a {width, height} object. The same scenario runs at any viewport with the same plan; the model sees the rendered layout and asserts that the layout is correct for the viewport it is in. No per-viewport goldens to maintain.

8. Legitimate design changes (the slow-poison failure)

Pixel-level cause: Marketing tweaks the hero gradient. Dark mode is added. The footer gets a new column. None of these are bugs. All of them invalidate every baseline that included the changed region.
Standard mitigation: Run the suite with --update-snapshots. Review the diffs. Approve and commit the new baselines.
Why it cracks: The diff viewer shows a wall of binary blobs. After the third 'approve all' click, no one is reading the diffs. A real regression that snuck in during the gradient change ships. This is the failure mode that has caused every public 'we shipped a regression that our visual tests should have caught' postmortem.
What Assrt does instead: The plan is the source of truth, not a stored image. When marketing changes the hero gradient, no plan needs updating. When a real regression appears, the model evaluates against the unchanged plan and the assertion fails on real signal. There is no 'approve all goldens' step that buries the regression.

The line every tutorial writes vs the line in Assrt

Every guide on this topic ends with a code snippet that calls expect(page).toHaveScreenshot(). That is the line that creates the baseline and locks the test to it forever. The Assrt loop is shaped differently: capture happens on a denylist, the JPEG attaches to a tool-result message, and the model on the other side reasons. Both blocks are below.

Same job, two primitives

// Standard Playwright visual regression tutorial
import { test, expect } from "@playwright/test";

test("dashboard renders correctly", async ({ page }) => {
  await page.goto("/dashboard");
  // The whole assertion is a pixel diff against a stored PNG.
  // First run writes __snapshots__/dashboard.png. Every later run
  // pixelmatches against that PNG. The only knob is maxDiffPixels.
  await expect(page).toHaveScreenshot("dashboard.png", {
    maxDiffPixels: 100,
    animations: "disabled",
    mask: [page.locator(".timestamp"), page.locator(".shimmer")],
  });
});

-107% more honest signal

What is actually on disk after an Assrt run

None of this is a baseline. None of it is compared to anything else. All of it is forensic, reviewable in any browser, and yours to keep.

/tmp/assrt/<runId>/screenshots/00_step1_navigate.png(per-step PNG, naming convention at server.ts:468)
/tmp/assrt/<runId>/videos/recording.webm(full session video, finalized at server.ts:577-594)
/tmp/assrt/results/latest.json(TestReport JSON, shape at types.ts:28-35)
/tmp/assrt/scenario.md(the plan, plain text, regex parser at scenario-files.ts)
/tmp/assrt/<runId>/videos/player.html(self-contained replay, opens in any browser)

Every one of those paths is a real artifact you can ls after running npx assrt-mcp run --url http://localhost:3000. There is no proprietary baseline format, no signed cloud URL, no DB row to lose access to.

Baselines vs Assrt's baseline-free loop

The dimensions that matter when you are trying to keep page-level visual checks honest at scale.

Feature	Pixel-diff baseline tools	Assrt
Stores reference frames in version control	Yes — __snapshots__ folder of PNGs	No — zero baseline images on disk
Tolerance knob you tune to suppress noise	maxDiffPixels, threshold, antialias options	None — TestAssertion has no tolerance field
Comparison primitive on each run	pixelmatch / resemble.js / proprietary diff	Model reasoning over a JPEG (agent.ts:1037)
When the design legitimately changes	--update-snapshots, then click 'approve all'	Edit the plan if needed. Often no edit at all.
Cross-browser without separate baselines	One golden per browser per viewport	One scenario.md, any browser, any viewport
Failure signal that pastes into Slack	Three-panel diff viewer, binary PNGs	Plain-text evidence string from the model
Fits in a self-hosted, open-source stack	Often $7,500+/mo SaaS for review workflow	Yes — open-source MCP server, npx assrt-mcp
Output is real Playwright code you keep	Proprietary YAML, low-code, or DB rows	Yes — Playwright runner under the hood

For component-level pixel-faithful checks (a 1px border on a design system primitive), keep toHaveScreenshot. The two layers answer different questions.

How the denylist actually decides

The closest thing the Assrt codebase has to a baseline policy is a single list at /Users/matthewdi/assrt-mcp/src/core/agent.ts:1024. Eleven tool names. Anything not in that list captures a JPEG. Anything in the list does not. That is the entire capture decision.

snapshot
wait
wait_for_stable
assert
complete_scenario
create_temp_email
wait_for_verification_code
check_email_inbox
screenshot
evaluate
http_request

The reasoning behind each exclusion is concrete. The pure-text tools (assert, complete_scenario, create_temp_email, check_email_inbox, http_request) do not change the page, so a fresh JPEG would only repeat the previous frame. The wait tools are instrumentation; the model reasons about what changes during them from the next visual action's screenshot. The screenshot tool already returns its own image, no redundant capture needed. The evaluate tool runs JavaScript that often does not produce visible changes. Every other tool (navigate, click, type_text, select_option, scroll, press_key) does change the page, so a JPEG is captured and attached to the next message.

Frequently asked questions

What is a 'visual regression baseline' in plain English?

A baseline is a screenshot saved into version control (typically a __snapshots__ folder of golden PNGs) that every future test run is pixel-compared against. Playwright's toHaveScreenshot(), Storybook's Chromatic, BackstopJS, Percy, and Applitools Eyes all share this primitive, even when they wrap the diff in a smarter algorithm. The 'regression' is detected when the new render differs from the saved baseline by more than maxDiffPixels (or whatever the tool calls its tolerance knob). The baseline is the reference frame and the comparator simultaneously.

Why do teams say baselines are 'flaky'?

Because the baseline is a frozen artifact of one specific runtime (a font cache, an OS rasterizer, a GPU, a browser version, a moment in time) and every future run happens in a slightly different runtime. Antialiasing changes one pixel. Apple's text rendering vs Chromium on Linux changes a glyph by a sub-pixel. The CI machine has no Helvetica installed and falls back to a different font. A toast that appeared at 1.9s in one run appears at 2.1s in the next. The baseline does not know which of those changes are real bugs and which are noise. So you raise maxDiffPixels until the noise is masked, and now you also mask real regressions. Every page on the internet calls this 'flakiness'. It is not flakiness, it is the wrong primitive.

What does Assrt do instead of storing a baseline?

It does not store one. After every visual action (navigate, click, type_text, select_option, scroll, press_key) the agent captures a JPEG and attaches it to the next tool-result message as { type: 'image', source: { type: 'base64', media_type: 'image/jpeg', data: screenshotData } }. The exact attach site is /Users/matthewdi/assrt-mcp/src/core/agent.ts line 1037. The model receiving that message (Claude Haiku 4.5 by default, set at agent.ts line 9) reasons about the frame against the plain-English plan and returns a pass/fail assertion with an evidence string. There is no diff. There is no folder of golden PNGs. The repo has zero references to toHaveScreenshot, pixelmatch, resemble, or maxDiffPixels.

How does the agent decide when to capture a JPEG?

By denylist. agent.ts line 1024 contains an explicit list of eleven tool names that do NOT trigger a capture: snapshot, wait, wait_for_stable, assert, complete_scenario, create_temp_email, wait_for_verification_code, check_email_inbox, screenshot, evaluate, http_request. Anything else (every actual interaction with the page) takes a JPEG and pushes it to the model. This is the closest thing the codebase has to a baseline policy: a list of when NOT to ask the model to look. Compare that to a Playwright project where every test author writes their own page.screenshot() / toHaveScreenshot() calls and decides which goldens to commit; the policy is centralized in one file and one list.

Are the captured JPEGs ever compared to each other?

No. Every JPEG is forensic, not comparative. The model sees the most recent frame plus the result text from the tool that just ran, and it decides what to do next. After the run, the screenshots are written to /tmp/assrt/<runId>/screenshots/<index>_step<stepNumber>_<action>.png (the naming convention is at server.ts line 468). Nothing in the codebase compares any of those PNGs to anything else. They sit on disk as evidence next to the WebM video. If you want to interrogate the recording, that is what assrt_analyze_video is for, and it sends the whole video to Gemini, not the per-step PNGs.

When are pixel-diff baselines still the right tool?

For component-level visual regression on a design system where a 1px border change is a real regression, keep Playwright's toHaveScreenshot() with maxDiffPixels: 0 against a controlled headless render. The realistic stack is layered: pixel-diff baselines for design-system components where the assertion really is 'these pixels must match', and a baseline-free model loop for page-level user journeys where the assertion is 'is the form in the right state, did the success toast appear, is the avatar a real image'. The two answer different questions. Assrt's claim is not that baselines are useless. It is that using them at the user-journey layer is a category error.

What is the TestAssertion that comes out the other end?

Three fields. /Users/matthewdi/assrt-mcp/src/core/types.ts lines 13-17 declares: description (the English thing the model checked), passed (boolean), evidence (free-text describing what the model saw). No tolerance, no diffPixelCount, no diff image URL, no baseline path. A failed assertion looks like 'passed: false, evidence: the Submit button has no visible label, only a loading spinner'. That is the full fail signal. It pastes into a PR comment without a three-pane diff viewer. It also pastes into a Slack message and reads correctly to a non-engineer.

Where does this stop working? When should I be skeptical of the AI loop?

Two places. First, if your page has a sub-pixel layout regression that does not change any meaning the model can read (a 1px border shift, a 2px line-height drift, a kerning change), the model will not flag it. Use a pixel-diff baseline scoped to that component. Second, if your test plan is vague ('check the page looks right'), the model will return vague evidence. Tighten the plan with a Pass Criteria section listing the visual properties the model should assert (the order ID is shown, the price field reads $X, the avatar is not a gray placeholder). Vague prompts produce vague evidence regardless of the model.

Can I switch models without rewriting plans?

Yes. Provider selection lives at agent.ts (Provider type 'anthropic' | 'gemini'). The plan format does not change. The screenshot capture policy does not change. The denylist at agent.ts line 1024 does not change. Only the model that receives the JPEG changes. So you can A/B Claude Haiku 4.5 against Gemini against the same scenario.md and compare the evidence strings. With pixel-diff baselines this is impossible: switching tools means switching baseline format, threshold semantics, and folder structure all at once.

How is this different from the Assrt visual-regression-tutorial page?

The tutorial page at /t/visual-regression-tutorial walks through running an end-to-end test without baselines: the npx command, the scenario.md format, the on-disk artifacts. This page is a reference. It is organized as a taxonomy of what specifically goes wrong with baselines at the page-journey level, with each failure mode mapped to the Assrt source line that replaces it. Read the tutorial first if you want to run a test. Read this if you want to argue against baselines in a design review or pick which scenarios should still keep them.

Want to walk through your snapshots folder together?

Bring a Playwright project that has visual baselines that have been hurting. We will model what a baseline-free version of the same suite looks like and where to keep the pixel-diff layer.

Adjacent reading on the baseline-free testing approach.

Visual regression baselines: eight failure modes no threshold knob fixes.

What baselines are, and why this catalog exists

What replaces the baseline, in four steps

The four parts that replace one baseline

Eight failure modes, with the Assrt source line beside each

1. Antialiasing drift across renderers

2. Font hinting on CI vs developer machines

3. Animated skeletons, shimmer loaders, fade toasts

4. OS chrome: scrollbars, focus rings, native widgets

5. Emoji and color font shifts

6. Timestamps, dates, IDs, generated content

7. Viewport rounding and zoom

8. Legitimate design changes (the slow-poison failure)

The line every tutorial writes vs the line in Assrt

What is actually on disk after an Assrt run

Baselines vs Assrt's baseline-free loop

How the denylist actually decides

Frequently asked questions

Want to walk through your snapshots folder together?

Read next

Visual regression tutorial without toHaveScreenshot()

AI visual regression: the two-phase pipeline

Visual regression framework: the six files on disk

Comments ()

What baselines are, and why this catalog exists

What replaces the baseline, in four steps

The four parts that replace one baseline

Eight failure modes, with the Assrt source line beside each

1. Antialiasing drift across renderers

2. Font hinting on CI vs developer machines

3. Animated skeletons, shimmer loaders, fade toasts

4. OS chrome: scrollbars, focus rings, native widgets

5. Emoji and color font shifts

6. Timestamps, dates, IDs, generated content

7. Viewport rounding and zoom

8. Legitimate design changes (the slow-poison failure)

The line every tutorial writes vs the line in Assrt

What is actually on disk after an Assrt run

Baselines vs Assrt's baseline-free loop

How the denylist actually decides

Frequently asked questions

Want to walk through your __snapshots__ folder together?

Read next

Visual regression tutorial without toHaveScreenshot()

AI visual regression: the two-phase pipeline

Visual regression framework: the six files on disk

Comments (••)

Want to walk through your snapshots folder together?

Comments ()