Visual regression testing screenshots, the three pathways teams confuse for one.
Most guides on this topic write "screenshot" as if there is one kind of artifact. In a modern suite there are three, with different formats, different lifecycles, and different best-fit detection problems. Pick the wrong pathway for the wrong question and the screenshot becomes the source of flake instead of the instrument that catches it.
Direct answer, verified 2026-05-08
A regression screenshot has three viable shapes today. A lossless PNG baseline committed to the repo and pixel-diffed on every CI run. An ephemeral JPEG quality 50 frame sent to a vision LLM that judges the screen against a plain-English scenario, then discarded. A WebM frame captured at a fixed viewport with a DOM overlay (cursor, click ripple, keystroke toast, heartbeat) for human replay after the run. Each format encodes a different bet about who or what is doing the judging. Pick by the question you are asking, not by what your test runner happened to ship with.
Sources cross-checked against Playwright snapshot docs and the open-source assrt-mcp source.
The three pathways, side by side
Strip the marketing off any visual regression product and the screenshot is doing one of three jobs. The job determines the format, the format determines the lifecycle, and the lifecycle determines what the screenshot can actually catch. The diagram below traces the three pathways from capture to disposition.
What a regression screenshot can become
Capture moment
viewport, timing, mask
PNG baseline
lossless, on disk, pixel-diff
JPEG q50 frame
lossy, in-memory, LLM-judged
WebM frame
with overlay, archival, replay
Pathway 1
PNG baseline
Lossless, committed to tests/__snapshots__/, diffed by pixelmatch or odiff on every PR. Catches a 1px border drift. Costs reviewer attention per intentional UI change. The tax scales linearly with surface area.
Best for: design-system primitives, icon sets, marketing screenshots, isolated component pages.
Pathway 2
JPEG quality 50 frame
Lossy on purpose. Captured fresh after every visual action, base64-encoded into one image part of the next LLM message, judged in plain English against the scenario, then garbage collected. Source: browser.ts:601.
Best for: full user journeys, functional checks, anything where a 1px shift is not the question.
Pathway 3
WebM frame, with overlay
Captured at a fixed 1600x900 viewport with an injected DOM overlay (red cursor, ripple, keystroke toast, heartbeat dot) that the screenshot pathway never sees. The recording survives the run on disk for human replay.
Best for: post-failure forensics, sharing a repro with a teammate, sanity-checking that the agent did what you think it did.
Same intent, two screenshot formats
Below is the same UI check expressed as a PNG-baseline pathway and as a JPEG-vision pathway. Read both. The capture step is one line in each. The lifecycle around the capture is what differs.
Two screenshot pathways, one assertion
// tests/checkout.spec.ts
import { test, expect } from "@playwright/test";
test("checkout looks right", async ({ page }) => {
await page.goto("/checkout");
await page.waitForLoadState("networkidle");
// Lossless PNG baseline lives at:
// tests/checkout.spec.ts-snapshots/checkout-chromium-darwin.png
// Diff engine: pixelmatch. Threshold: maxDiffPixels.
// Lifecycle: committed to the repo, reviewed on every PR.
await expect(page).toHaveScreenshot("checkout.png", {
fullPage: true,
animations: "disabled",
mask: [page.locator("[data-testid=order-id]")],
maxDiffPixels: 100,
});
});The PNG pathway will catch a 1px border drift; the JPEG pathway will tolerate one and instead catch a coupon flow that silently broke while the layout stayed pixel-identical to last week. Two real regressions, two different instruments.
Why the WebM frame carries content the JPEG does not
The third pathway is the one most teams underuse. A WebM recording and a JPEG screenshot both come from the same browser, but they read two different surfaces. The screenshot is a page snapshot. The WebM is a screencast frame. Anything injected into the DOM after the page loads will appear in the screencast and not in the snapshot, which means you can enrich the human-replay artifact without ever contaminating what the model sees.
What ends up in the bytes
The page as the user sees it. Plain DOM. No cursor, no ripple, no keystroke toast, no heartbeat. The model judges the screen the same way a human visiting the URL would. The bytes never touch disk; they live for one model turn and are released.
- No cursor overlay
- No keystroke toast
- No heartbeat dot
- Discarded after one model turn
The capture lifecycle, six moments that matter
Most flake in a visual regression suite traces back to one of these six moments going wrong. The order matters. A screenshot taken before the DOM is stable produces a fresh failure on every run; a screenshot taken at a different viewport on CI than on dev produces a phantom regression.
- 1
Pin viewport
1600x900, document the choice
- 2
Wait for stable
MutationObserver, 2s quiet window
- 3
Capture
PNG lossless, or JPEG q50
- 4
Judge
pixelmatch threshold, or LLM read
- 5
Record
WebM frame with overlay (optional)
- 6
Dispose
commit, garbage collect, or archive
The wait_for_stable step is the one most home-grown setups skip and most teams blame on the screenshot itself. The pattern is concrete: a MutationObserver attached to document.body polls every 500ms, the agent only proceeds once 2 seconds pass with zero childList, subtree, or characterData mutations, and the ceiling is 30s before it gives up. A fast SPA settles in 400ms; a streaming chat UI churns for 4 seconds. Both are handled by the same primitive, and neither requires the test author to guess a timeout.
Numbers that decide which pathway wins
The numbers below are observable in the open-source assrt-mcp source and the standard Playwright defaults. They are what each pathway actually costs and produces, not vendor benchmark claims.
“The team approves 30+ baseline diffs a week and at this point nobody actually looks at them.”
Engineering lead, mid-size SaaS, post baseline rollout
The on-disk wrinkle worth knowing
One real quirk in the Assrt screenshot pipeline that nobody flags: the JPEG bytes that the LLM judged are also written to disk for the replay player, but with a .png filename extension. The directory and filename pattern are declared at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:431 and :468:
// server.ts:431
const screenshotDir = join(runDir, "screenshots");
// server.ts:468 (filename pattern)
const filename = `${String(screenshotIndex).padStart(2, "0")}_step${currentStep}_${currentAction || "init"}.png`;So a run lands at /tmp/assrt/<runId>/screenshots/00_step1_navigate.png, 01_step2_click.png, and so on. The bytes inside are JPEG quality 50; the extension is .png only because the replay player serves the file as raw bytes and most browsers autodetect the format from the magic header anyway. Worth knowing if you ever scp a folder to a teammate and they open it in a strict image viewer, because some viewers refuse a JPEG inside a .png. Rename if you care; let it be otherwise.
Decision criteria, by the question you are asking
Pick by intent, not by what your runner shipped with. The shorthand below covers the bulk of real cases.
- Is the question pixel-level (1px borders, exact colors, icon fidelity)? Use a PNG baseline pathway. Pixel-diff exists for this. A vision model will not reliably catch a 1px shift.
- Is the question functional (did the discount apply, did the modal open, did the toast read "Coupon applied")? Use a JPEG q50 vision pathway. The question is layout meaning, not pixel state, and the maintenance per intentional UI change drops to zero.
- Are you trying to share a repro with a teammate after a CI failure? Use the WebM-frame pathway. A 30-second recording with the cursor overlay communicates more in 30 seconds than 8 paragraphs of Slack.
- Are you trying to prove the agent did what you think it did? Combine: judge with JPEG q50 during the run, archive the WebM after. The model decides pass or fail, the human watches the recording when something looks off.
- Is your suite already drowning in PNG diffs that nobody actually reviews? That is the symptom of pathway-mismatch, not a tooling failure. Move full-flow checks to the vision pathway, scope PNG diffs back to the design system, and the review queue clears within a sprint.
What to verify in the source if you want to confirm any of this
Every line-level claim on this page traces to the open-source assrt-mcp repository. The four greps below cover the load-bearing ones; run them yourself and inspect what comes back.
# 1. The JPEG q50 capture line
grep -n "browser_take_screenshot" assrt-mcp/src/core/browser.ts
# 2. The viewport declaration (CLI launch + WebM recorder)
grep -n "1600x900\|width: 1600" assrt-mcp/src/core/browser.ts
# 3. The injected DOM overlay (cursor, ripple, toast, heartbeat)
grep -n "__pias_heartbeat\|__pias_cursor\|__pias_toast" assrt-mcp/src/core/browser.ts
# 4. The on-disk filename pattern (JPEG bytes, .png extension)
grep -n "screenshotDir\|step\${currentStep}" assrt-mcp/src/mcp/server.tsWant to see all three pathways running on one of your flows?
15 minutes. We point Assrt at one of your real pages, run it live, and you walk away with the WebM, the JPEGs, and the scenario .md. No baseline files committed.
Frequently asked questions
What is a visual regression testing screenshot in plain language?
It is a captured image of UI state, taken at a known moment, used to detect a visual change you did not intend. The capture itself is the simple part. The hard parts are deciding the format (lossless PNG so a single pixel of drift is detectable, or lossy JPEG because a vision model is doing the judging), the timing (after fonts load, after animations settle, after the network goes quiet), the viewport size (so two runs on different machines produce comparable frames), and what you do with the bytes after (commit them as a baseline, send them to a model, drop them into a video, throw them away). The format and lifecycle decisions matter more than the capture step. Most flake comes from getting one of those wrong, not from the screenshot itself.
How are PNG baselines and JPEG vision-model screenshots actually different?
A PNG baseline lives on disk forever. It is committed to the repo, it gets diffed against a fresh PNG on every CI run, and the diff engine flags any pixel-level deviation that exceeds a numeric threshold. The format is lossless on purpose, because a 1px border or a 1-shade color shift is exactly what the test exists to catch. A JPEG quality 50 vision-model screenshot lives in memory for one model turn. It is base64-encoded into a single image part of an Anthropic or Google API request, the model reads it against a plain-English checklist, and the bytes are discarded. The format is lossy on purpose, because the model is judging layout meaning and the compression artifacts at q50 are below the model's perceptual threshold while halving the per-request payload. Same word, different artifact, different lifecycle, different cost line.
Where does the third pathway, the WebM frame, fit in?
It is the human-replay artifact. Pixel diffs and vision-model checks both fail the test or pass it, but neither lets a human re-watch what happened. A WebM recording captures every compositor frame at a known viewport (Assrt records at 1600x900, declared at /Users/matthewdi/assrt-mcp/src/core/browser.ts:628), and the trick is that the recording itself can be enriched. Assrt injects a DOM overlay before recording starts: a 20px red cursor at rgba(239,68,68,0.85), a click ripple that scales 0.5 to 2.0 over 50ms, a green monospace keystroke toast that fades after 2500ms, and a 6px green heartbeat dot at the bottom-right pulsing every 800ms. The overlay only appears in the WebM, not in the JPEGs the model judges, because the screenshot tool reads the page snapshot and the recorder reads the screencast. Different surfaces, different content, same source page.
Why does the heartbeat dot exist? It looks decorative.
It is not decorative, it is a compositor primitive. WebM recordings driven by Chrome's screencast API only emit a frame when the compositor decides something changed. If your scenario sits in a 6-second wait_for_stable while a streaming response settles, those 6 seconds can collapse into a single repeated frame, which makes the recording look like the browser hung. A 6px dot at the bottom-right that pulses every 800ms forces a guaranteed change in pixel state on a regular cadence, so the screencast keeps emitting frames during idle periods. The recording stays continuous, so the human reviewer can scrub through it and see real wall-clock time. The line that does this is at /Users/matthewdi/assrt-mcp/src/core/browser.ts:44-48 (the heartbeat.animate call with iterations: Infinity).
Why JPEG quality 50 specifically, and not PNG, for the model-judged screenshot?
Because the model is paying for it twice and would not see the difference. Every screenshot Assrt sends to the LLM is billed by image pixel count and by base64 payload size. JPEG quality 50 cuts the payload to roughly 30 to 50% of a lossless PNG of the same viewport, which compounds across a 30-step scenario. The compression artifacts at q50 are visible to a pixel-diff engine but not to a vision model judging layout, color, and text legibility. The exact line that makes the call is at /Users/matthewdi/assrt-mcp/src/core/browser.ts:601: callTool('browser_take_screenshot', { type: 'jpeg', quality: 50 }). Drop the quality lower and the model starts mis-reading small text. Bump it higher and the bill goes up without affecting any judgment. q50 is empirical, not arbitrary.
When in the lifecycle should the screenshot actually be taken?
After every visual action, but only after the DOM has stopped mutating. Assrt takes a fresh screenshot only when the prior tool was a navigate, click, type_text, scroll, press_key, or select_option. Anything else (snapshot, wait, assert, evaluate, http_request) does not trigger a capture, because the screen has not changed. The exclusion list is hard-coded at /Users/matthewdi/assrt-mcp/src/core/agent.ts:1024. Before the screenshot is taken, the agent calls wait_for_stable when needed: a MutationObserver attached to document.body polls every 500ms and the agent only proceeds once 2 seconds pass with zero childList, subtree, or characterData mutations (default, configurable, ceiling 30s). That is the difference between 'we caught the regression' and 'we caught the spinner mid-rotation'.
Can I keep both PNG baselines and vision-model screenshots in the same suite?
Yes, and it is what most mature suites end up doing. The two are complementary, not competing. PNG baselines belong on isolated component pages (Storybook routes, design-system primitives, marketing screenshots, icon sets) where a 1px shift is a real regression and the surface area is small enough that the baseline-approval workflow does not consume the team. Vision-model screenshots belong on full user journeys (checkout, signup, dashboard, search) where the question is functional rather than pixel-level and the surface area is too large for baseline review to scale. Run both, scope each to what it is good at, and stop pretending one tool covers the whole problem.
Where do the on-disk screenshots end up if I want to inspect them after a run?
Each Assrt run writes its screenshots to /tmp/assrt/<runId>/screenshots/, with filenames following the pattern {index:02d}_step{N}_{action}.png (the directory is created at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:431, the filename pattern is at server.ts:468). One small wrinkle worth knowing: the bytes inside those files are JPEG quality 50, even though the extension is .png. The pattern preserves a stable filename ordering for the replay player, but if you open one in an image viewer, your viewer is reading JPEG headers under a .png extension. Rename to .jpg if your tooling cares, or just let the files exist as-is. The replay player serves them as bytes regardless of extension.
What viewport size should I take screenshots at, and why does it matter?
Pick one viewport, hard-code it everywhere, and never change it without explicitly retaking baselines. Assrt uses 1600x900 across the entire pipeline: the @playwright/mcp launcher is started with --viewport-size 1600x900 at /Users/matthewdi/assrt-mcp/src/core/browser.ts:296, and the WebM recording is started with size: { width: 1600, height: 900 } at browser.ts:628. The reason for picking once and pinning is that responsive layouts cross breakpoint boundaries at viewport changes, and a screenshot taken at 1280px wide is genuinely a different screen than the same scenario at 1600px wide, even when no code has changed. Pin it once, document the choice, and cross-reference any responsive testing as a separate suite.
How do I avoid the screenshot itself becoming the source of flake?
Three rules cover most of it. First, never screenshot before the DOM is stable. A network-driven SPA is not done rendering 50ms after navigate; wait for mutations to settle (Assrt uses a 2s quiet window with a 30s ceiling). Second, pin everything that the rendered pixel state depends on: the viewport size, the operating system if you are doing pixel-diff, the timezone if your UI shows times, the locale if your UI shows numbers, the fonts (preload them, do not let them lazy-load mid-test). Third, mask elements that are honest non-determinism (timestamps, generated order IDs, ad iframes) so the diff engine does not flag them as regressions. If you are using a vision-model pathway instead, the second and third rules relax considerably, because the model tolerates anti-aliasing and font-load timing the way a human reviewer does.
Adjacent angles on the same problem
Related guides
Visual regression testing, the honest field guide
What pixel-diff baselines actually cost in PR review time, and a no-baseline alternative that records a Playwright WebM you can question after the fact.
Playwright visual testing without baseline PNGs
How an LLM watches every screen change in Assrt, what the JPEG quality 50 capture path looks like end-to-end, and why the model never sees the cursor overlay.
Visual regression baselines
The baseline approval workflow, where it scales, where it falls apart, and how to scope it so PR review stays sane.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.