The best visual regression tools, ranked by what they actually do.
Every other page for this keyword lists 10 to 20 brands as if they were 10 to 20 categories. They are not. Under the hood, most of them call pixelmatch (or a pixelmatch-equivalent) against a committed baseline PNG. One tier adds an AI overlay on top of that diff. One tool, Assrt, replaces the diff entirely with a multimodal LLM reading the frame as a JPEG. Three tiers, not twenty brands.
The premise every other "best of" page skips
If a list ranks Percy, Chromatic, and BackstopJS as three different things, it is ranking the same algorithm under three different brands.
That is not a dig at any of them. Pixel-diff against a committed baseline is a great primitive for the job it is designed for. What this page does differently: group tools by the primitive they implement, so you can pick a tier instead of a brand. The tier you pick is the architectural decision. The brand inside the tier is mostly UX taste.
The three tiers, with every tool in its real bucket
Here is the whole field sorted by comparison primitive. Tier 1 is where most of the cited tools live, because pixel-diff is the default since 2010. Tier 2 is a small group that runs a model over the diff regions to filter noise. Tier 3 is Assrt.
Tier 1 — Baseline pixel-diff
A committed golden PNG, a pixelmatch call, a tolerance knob. The default visual regression primitive since 2010. Works great on locked design systems, burns CI time on fast-moving products.
Tier 2 — AI-augmented pixel-diff
Still takes two PNGs and diffs them, but a model filters diff regions by semantic salience. Kills a lot of tier-1 flake, keeps the baseline model, keeps the committed PNGs, keeps the cloud.
Tier 3 — Model-as-judge
No baseline PNG. No diff step. Each frame is base64-JPEG-attached to a multimodal LLM as a tool result, and a plain-English plan says what the page should look like. Failures are sentences, not pixel counts.
The brands most "best of" lists stack next to each other
Eight of the ten above share one primitive. Two of the ten are tier 2. Zero of the ten are tier 3. That is the shape of the market, and the shape of the SERP gap this page fills.
The anchor fact: grep the tier-3 tool
This is the part of the page a reader can verify in ten seconds. Clone the assrt-mcp repo, run the grep, read the result. The claim is not marketing copy; it is what the file system returns.
The visual-regression step itself lives in one block of code inside assrt-mcp/src/core/agent.ts. Every click, type, navigate, scroll, and key press is followed by a fresh JPEG screenshot attached to the next LLM message. That is the whole pipeline.
The TestAssertion type, for comparison
In a tier-1 tool the failure signal is a number: N mismatched pixels above threshold T, plus a diff-image.png for the reviewer to squint at. In a tier-3 tool the failure signal is a sentence. Here is the full type, copied straight from source.
How a frame becomes a pass or fail, per tier
Same starting pixels, three different paths to a verdict. The beam diagram below is the shortest honest summary of what the primitives actually do once Chromium hands over a screenshot.
One frame in, three tiers of verdict out
“Grep the entire assrt-mcp source tree for pixelmatch, maxDiffPixels, toHaveScreenshot, or pngjs. Zero hits. The visual regression is a JPEG attached to Claude Haiku 4.5 at agent.ts line 987, and the assertion has three fields: description, passed, evidence. That is the whole pipeline.”
assrt-mcp/src/core/agent.ts:987 and src/core/types.ts:13-17
Tool-by-tool, in the only table that sorts by primitive
The left column is the feature. The middle column is what tier 1 and tier 2 tools do (I collapsed them because the primitive is the same, with a model layer on top for tier 2). The right column is Assrt, the one tier-3 tool. If you were looking for a brand-by-brand rank, the StepTimeline below this table is how to pick between brands once you have picked your tier.
| Feature | Tier 1 + 2 tools (pixel-diff, optionally AI-filtered) | Assrt (tier 3, model-as-judge) |
|---|---|---|
| What the tool is actually looking at | Two PNGs, pixel by pixel, via pixelmatch or an equivalent | One JPEG of the current frame, read by Claude Haiku 4.5 |
| Source of truth | Golden PNG in __snapshots__/ or the vendor cloud, committed to git | One-sentence #Case in scenario.md, plain text on disk |
| Tolerance model | maxDiffPixels, threshold, mask, animations: 'disabled' | No tolerance knob; the model judges the frame against the plan |
| First-run ritual | npx playwright test --update-snapshots, then commit PNGs | npx assrt-mcp run --url ... --plan-file scenario.md |
| When marketing tweaks a hero gradient | Regenerate baselines, review the diff image, hope you did not bake in a bug | Nothing changes, unless the plan said the gradient mattered |
| Dynamic content (toasts, shimmers, animations) | Must be masked or animations-disabled per call to stop CI flake | Fine by default; the plan specifies what to verify |
| Shape of a failure | N mismatched pixels plus diff-image.png for a human to interpret | passed: false, evidence: 'one-sentence description of what it saw' |
| Artifact portability | Baseline PNGs locked to the tool; scenario logic in vendor YAML or JS bindings | Plain .md plan runs under any Playwright MCP agent unchanged |
| Cost to run against your own app, $0 starting point | Open-source tier-1 tools are $0 + CI time; managed tier-1 $149-$400/mo; tier-2 quoted around $7,500/mo at team scale | npx assrt-mcp is open source, $0, self-hosted, plus LLM token cost |
| Best for | Component-level pixel QA on locked design systems (tier 1); large dynamic SaaS UIs where tier-1 flakes (tier 2) | Page-level user-journey visual regression on fast-moving products |
The grep-verifiable numbers
The shape of the market, in counts
How to pick a tier (then pick a brand inside the tier)
The tier decision is architectural. The brand decision is taste, budget, and integration fit. Most teams pick a brand without noticing the tier, which is why they bounce between Percy and Chromatic twice before realising both answer the same question.
Pick a tier in four questions
Is your design system locked and 1px drift matters?
Pick tier 1. Playwright toHaveScreenshot() is free and built-in. Chromatic is the nicest UX if you live in Storybook. BackstopJS is the nicest if you want full-local open-source with a config file.
Is your UI dynamic enough that tier 1 flakes weekly?
Tier 2. Applitools Visual AI clusters diff regions by salience and ignores rendering noise. You keep baselines, you keep the SaaS cloud, you pay the enterprise price. The primitive is still pixel-diff, just smarter.
Do you care about flow behavior, not pixel accuracy?
Tier 3. Assrt. The plan is a one-paragraph #Case in plain English. Every step screenshot goes to Claude Haiku 4.5 as base64 JPEG. Pass or fail is the model's sentence about what it saw. Best for page-level user-journey regression on fast-moving products.
Do you want both?
Run tier 1 and tier 3 in the same CI pipeline. They answer different questions. Tier 1 catches the 1px border drift on a locked button component. Tier 3 catches the 'checkout is completely broken but renders fine' class of regressions. Neither replaces the other.
What a failure looks like, tier 1 vs tier 3
Same bug: the primary CTA rendered as a loading spinner and never recovered. Two tools report the failure in two very different shapes. Neither shape is strictly better; they are optimised for different reviewer workflows.
Same bug, two failure shapes
FAIL: homepage.png differs from baseline by 8,412 mismatched pixels (threshold: 0.2, maxDiffPixels: 100). Diff image: test-results/homepage-expected.png / homepage-actual.png / homepage-diff.png. Reviewer opens three images side by side in a diff viewer and tries to spot what regressed. Possible causes: real bug, unrelated design change, shimmer loader leaked past mask, dynamic data changed between runs.
- Fail is a mismatched-pixel count above a tolerance
- Diff-image review workflow requires a viewer and human interpretation
- Root cause is ambiguous until the reviewer opens the images
- Noise from unrelated design changes is common and costly
What the tier-3 approach guarantees, verified from source
The checklist below is not aspiration. Every item corresponds to a specific file and line in the assrt-mcp repo or a grep command you can run yourself against src/.
The real trade, verified from source
Constraints that hold across every run
- Zero pixelmatch imports in assrt-mcp/src, verified by grep against the repo
- Zero maxDiffPixels, threshold, or toHaveScreenshot references anywhere in source
- Screenshots attached as base64 JPEG to Claude Haiku 4.5 tool result at agent.ts:987
- TestAssertion = { description, passed, evidence } — three fields, types.ts lines 13-17
- No __snapshots__/ folder, no --update-snapshots flow, no committed baseline PNGs
- Plans are plain-text .md files parsed by a regex, not a vendor YAML or proprietary DSL
- Open source; every artifact (plan, results JSON, WebM video, per-step PNGs) lives on your disk
The obvious trade: tier 3 cannot flag a 1px border drift. For a mature component library on a locked design system, keep a tier-1 tool. Tier 3 is additive, not a replacement. The question this page exists to answer is not "which tool wins," it is "which tier does your product actually need."
Try the one tier-3 tool on this list, on your own app
One npx command, one .md file, real Playwright MCP under the hood. The video of the run auto-opens when it finishes. No baseline PNG folder, no account, no cloud. When you cancel Assrt tomorrow, you still have the plan, the videos, and every JPEG the model read.
Install npx assrt-mcp →Best visual regression tools: specific questions
Is this just another 'top 10 visual regression tools' list?
No, and the structure proves it. Every other page for this keyword ranks 10 to 20 brands one after another without naming the algorithm any of them use. I read the top five before writing this. Four of them list Percy, Chromatic, BackstopJS, Playwright, Loki, and Happo as if they were six different categories. They are not: all six pixel-diff a current screenshot against a committed baseline PNG using pixelmatch or a pixelmatch-equivalent. Applitools adds an AI layer on top of the diff. Assrt replaces the diff with a model-as-judge. Three primitives, not twenty brands. The table lower on this page is organized that way.
What exactly is the 'comparison primitive' of a visual regression tool?
It is the single line of code that answers 'did this pixel change enough to matter.' For tier 1 tools, that is a pixelmatch call (or an equivalent PNG-diff library like odiff, resemblejs, or pngjs-based code) comparing two images with a tolerance. For tier 2 (Applitools Visual AI, TestMu SmartUI), a model clusters the diff regions and scores them by semantic salience, but the input is still two pixel-grids and the output is still a diff score. For tier 3 (Assrt), there is no diff step at all. The current frame is base64-encoded and attached to a multimodal LLM tool result; the model answers a plain-English question. Verify yourself: `grep -r 'pixelmatch\|maxDiffPixels\|toHaveScreenshot\|pngjs' assrt-mcp/src` returns zero hits.
Where is the actual attach-screenshot-to-model code in Assrt?
assrt-mcp/src/core/agent.ts around lines 972-990. Every visual action (navigate, click, type_text, select_option, scroll, press_key) is followed by a `browser.screenshot()` call that returns a base64 JPEG. The very next block wraps it in `{ type: 'image', source: { type: 'base64', media_type: 'image/jpeg', data: screenshotData } }` and appends it to the tool-result content the model reads on the next turn. The list of action names that skip the screenshot (snapshot, wait, assert, evaluate, http_request, etc.) is inline on line 974 because those do not change the pixels and would waste tokens. There is no image-comparison code anywhere in the file.
What does a visual regression 'assertion' look like when there is no pixel count?
Three fields. Here is the entire shape, from assrt-mcp/src/core/types.ts lines 13-17: description (the English thing the model was asked to verify), passed (boolean), evidence (a free-text sentence describing what the model saw). No tolerance, no mismatched-pixel number, no diff-image URL. A failing assertion reads like 'passed: false, evidence: the submit button shows only a loading spinner, no label text'. The whole point is that failures become one sentence a reviewer can read in a PR comment, without opening a three-pane diff viewer.
Which tool should I pick if I care about a one-pixel border drift on a button?
Tier 1, and Playwright's built-in `toHaveScreenshot()` is usually enough. If you own a locked design system and you need to catch a 1px shift, a hex color change, or a subpixel antialiasing difference, pixel-diff is strictly better than model-as-judge. Chromatic is the nicest UX of the tier-1 group for component-level review. BackstopJS is the nicest if you want it fully local and open-source. Tier-2 tools (Applitools) help when tier-1 flakes too much in CI on a changing product. Tier 3 (Assrt) is the wrong answer for that job and I will not pretend otherwise.
When is tier 3 (Assrt) the right answer?
Page-level, user-journey visual regression on a product that ships weekly. When the homepage gradient, the hero copy, the pricing card layout, or the dashboard navigation can legitimately change between deploys and you care whether the signup flow still works, not whether two pixels shifted. Tier 1 drowns in --update-snapshots noise in this case. Tier 3 answers a different question: 'from this JPEG, did the flow reach the expected state,' not 'do pixel blocks match.' The plan is English in a .md file; the only thing that would make a test invalid is a change in the English intent, not a change in the visual implementation.
How much does each tier cost, realistically?
Tier 1 open source (BackstopJS, Playwright, Loki) is $0 of software plus CI minutes. Tier 1 managed (Percy, Chromatic) is seat-based and screenshot-metered, commonly $149 to $400 per month at small scale. Tier 2 (Applitools Eyes, TestMu SmartUI) sits higher: the Applitools enterprise tier is frequently quoted around $7,500 a month at team scale based on public pricing notes and third-party reports. Tier 3 (Assrt) is `npx assrt-mcp`, self-hosted, $0 of software, plus whatever the LLM calls cost (Claude Haiku 4.5 is inexpensive at current rates). Cost is not the axis this page ranks on; primitive is. But the price-to-primitive ratio is worth naming.
Does Assrt lock me into a vendor format?
No. The plan is a plain-text .md file with a regex-parsed header (`#?\s*(?:Scenario|Test|Case)`, defined in assrt-mcp/src/core/scenario-files.ts). The tools the agent calls are the public Playwright MCP vocabulary (navigate, click, snapshot, type_text, scroll, press_key, select_option, wait). The result JSON schema is typed in src/core/types.ts. Every artifact (scenario.md, results JSON, WebM video, per-step PNGs under /tmp/assrt/<runId>/screenshots/) lives on your disk. If you cancel Assrt tomorrow, the .md file still runs under any other Playwright MCP agent, because there is no proprietary DSL between you and the browser.
Which model does Assrt use, and can I swap it out?
Default is Claude Haiku 4.5 (claude-haiku-4-5-20251001), pinned at assrt-mcp/src/core/agent.ts line 9. Override at the CLI with `--model`. Gemini is also supported via the Provider type ('anthropic' | 'gemini') on the same file. Both providers receive the same base64 JPEG after each visual action, so swapping providers is a one-flag change that does not affect the plan format or the artifact layout. This matters for the 'vendor lock-in' question: the model is a swappable dependency, not a service you rent from Assrt.
What about flaky CI from dynamic content (toasts, shimmers, animations)?
This is where tier 1 tools burn the most maintenance time. A shimmer loader, a fade-in toast, a rotating carousel, or any pseudo-random avatar image breaks a pixel-diff because two runs will differ by more than maxDiffPixels even when nothing is wrong. Tier 1 fixes are `animations: 'disabled'`, `mask: [locator]`, and per-region masks on every call. Tier 3 does not care: the plan says 'verify the success toast appears with the order ID,' and the model reads the frame regardless of what else moved. Evidence records what it actually saw. No mask configuration per test.
Is Assrt really open source? What is the catch?
Assrt ships as an open-source npm package: `npx assrt-mcp`. Self-hosted, MIT-compatible license, the full source is in the assrt-mcp repo. The catch, in the same spirit as this page: the quality of the failure evidence is only as good as the underlying multimodal LLM. Claude Haiku 4.5 is good enough for the page-level checks that this approach is designed for. It will occasionally miss a 1px drift or a subtle hex-color shift that a pixel-diff would flag. This is a deliberate trade, not a bug. For those cases, keep a tier-1 tool in the same CI pipeline. Tier 3 is additive.
Adjacent guides: the step-by-step tier-3 tutorial, the broader E2E framework taxonomy, and the self-healing side of agent-driven testing.
Keep reading
Visual regression tutorial without toHaveScreenshot()
How to run a semantic visual regression in under a minute with npx assrt-mcp. The step-by-step companion to this taxonomy.
E2E testing frameworks: the 4-tier taxonomy
Scripted, low-code, AI-compiled, agent-interpreted. Tier-4 agent-interpreted is the same layer that makes tier-3 VRT possible.
Automated self-healing tests
The flip side of model-as-judge VRT: when a selector moves, the agent re-snapshots and continues. No locator to patch.
Three primitives, one list
0 tier-1 brands, 0 tier-2 brands, 0 tier-3 tool.
That is the whole market. Pick the tier, then pick the brand inside it.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.