The two-phase AI pipeline SERP tools pretend is one

AI visual regression is two phases, not one.

Every result for this keyword on page one of Google sells you the same thing: swap pixelmatch for a proprietary visual model, keep the baseline, keep the dashboard. That is one phase of AI. Assrt runs two. Claude Haiku 4.5 judges each step screenshot live as the test runs. When it finishes, you can call a second tool and ask Gemini 3.1 Flash Lite Preview an open-ended English question about the entire WebM recording. Same test, two different AI surfaces, zero baselines.

Matthew Diakonov, Written with AI

Published April 20, 202611 min read

4.8from Assrt MCP users

Phase 1: Claude Haiku 4.5 judges each JPEG screenshot mid-run

Phase 2: Gemini 3.1 Flash Lite Preview answers prompts on the full WebM

Tool: assrt_analyze_video, defined at assrt-mcp/src/mcp/server.ts:925

AI visual regression, in two phases

Claude judges live. Gemini answers after.

Phase 1: every step screenshot goes to Claude Haiku 4.5

Pass/fail is a reasoning call, not a pixel count

Phase 2: the whole WebM goes to Gemini 3.1 Flash Lite

Ask English questions about the run, as many as you want

Two AI models. Zero baseline PNGs. One scenario.md.

0:00 / 0:05

The blind spot in every AI visual regression product on the SERP

Read the top five results for this keyword. Applitools Eyes, BrowserStack Percy, Mabl, LambdaTest SmartUI, Reflect. They all tell the same story: the pixel diff is brittle, so we added a vision model that compares a fresh render against a stored baseline more intelligently. It ignores dynamic regions, weighs structural elements, suppresses false positives. That is good work, but it is one surface. The model is glued to the diff step and nowhere else.

There is a second surface nobody in that list ships: after the run finishes, you have a full video recording of the browser session sitting on disk. Nothing stops you from handing that video to a different model and asking it a free-form question. That is what phase 2 is. The SERP has not named it because the SERP is selling baselines.

The pipeline

Phase 1 — Claude Haiku reads each frame as it lands

Every time the agent takes a visual action (navigate, click, type_text, select_option, scroll, press_key) the browser emits a fresh screenshot. That screenshot is pushed into the next tool-result message as a base64 JPEG. The attach site is agent.ts line 987. The model that reads it is set one line above at the top of the file: DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001" on line 9. Nothing compares that JPEG to a golden image. There is no golden image on disk. The model is the judgment surface.

assrt-mcp/src/core/agent.ts

Phase 2 — Gemini answers English questions about the whole run

Playwright records a WebM of the entire browser session natively. The file is finalized and moved into /tmp/assrt/<runId>/videos/recording.webm at the end of the run (server.ts lines 577-594). The path is cached in a module-level lastVideoFile variable (server.ts line 270) so later MCP calls know which recording to open. Now comes the second tool.

assrt_analyze_video is registered only when process.env.GEMINI_API_KEY is set (server.ts line 929). It reads the WebM off disk, base64-encodes the whole thing, and hands it to Gemini 3.1 Flash Lite Preview in a single prompt alongside your question. This is the uncopyable bit. The exact constant is at line 927: const GEMINI_VIDEO_MODEL = "gemini-3.1-flash-lite-preview". The mimeType that ships with the bytes is "video/webm" (line 977). That is the whole mechanism.

assrt-mcp/src/mcp/server.ts

1 tool

“Call assrt_analyze_video, pass an English prompt, get an English answer grounded in the frames of the run. No videoPath needed if you just ran assrt_test — the server remembers the last recording.”

assrt-mcp/src/mcp/server.ts line 931

What this looks like end to end

You run the test. Claude judges each frame. The WebM lands on disk. You ask Gemini whatever you want. You can rerun that last question with a different wording. You can ask ten questions against the same recording and only pay for the ten Gemini requests, because the video is not uploaded to a cloud baseline store — it is a file under /tmp/assrt.

bash

Phase 1 + Phase 2 in one sequence

AI models in the regression loop (Claude + Gemini)

Baseline PNGs, diff masks, or tolerance knobs

MCP tool you call for retrospective video analysis

Where the SERP stops and Assrt keeps going

Every row in the table below is a specific capability, not a mood-board adjective. The right column points back to the exact lines in the open-source repo, so you can verify any claim in thirty seconds.

Feature	SERP tool (single-phase AI)	Assrt (two-phase AI)
Judges a single frame against a golden image	Yes, core flow	No golden image exists
Lets you query the full test recording in English	No	Yes — assrt_analyze_video prompt='...'
Model used for live per-step judgment	Proprietary cloud model	Claude Haiku 4.5 (agent.ts line 9)
Model used for retrospective video analysis	Not a feature	Gemini 3.1 Flash Lite Preview (server.ts line 927)
Test artifact format	Proprietary DB rows, signed URLs	WebM, PNG, JSON, .md on /tmp/assrt
Self-hosted, zero-cloud option	No	Yes — open-source npm, run locally
Re-query the same run with new questions	No	Unlimited — video is on disk, rerun Gemini
Public, non-proprietary scenario format	YAML DSL or low-code builder	Plain English in scenario.md

Specifically about the AI visual regression surface, not overall tool scope.

When phase 2 pays off, specifically

Phase 1 is always on. Phase 2 is the one you pull out when a pixel diff would not have given you the answer anyway. Good prompts look like questions, not assertions.

Moments when assrt_analyze_video earns its keep

The test passed but something felt off in the recording — ask 'was there any layout shift during the first 3s'.
A toast, modal, or banner should appear for a specific duration — ask 'how long was the success toast visible'.
You want a spot-check on animation timing without adding screenshot assertions to the plan.
Someone on the team asks 'did the dashboard ever flash blank' and you need an answer from the actual recording.
A client calls with a UI complaint — re-query last night's CI video with their exact words as the prompt.
You are auditing a visual regression in production and need a second opinion on the frames, not the diff.

What this doesn't replace

Pixel-perfect component regression. If you maintain a design system where a 1px border color drift is a real bug, keep Playwright's toHaveScreenshot() on the component. A model will not flag one pixel. The two-phase AI approach is for page-level and journey-level regressions: did the onboarding flow render correctly, did the pricing table reflow, did the error toast appear when it was supposed to. Both stacks can live in the same repo, answering different questions.

Want to see both phases running against your app?

Fifteen minutes. Bring a flaky user journey. We will run Assrt against it live and ask Gemini a question you pick about the recording.

FAQ: AI visual regression in two phases

What actually makes this 'AI visual regression' and not regular visual regression?

A regular visual regression run pixel-diffs a fresh screenshot against a stored golden PNG, typically via Playwright's toHaveScreenshot(), pixelmatch, or resemble.js. The only knob you tune is maxDiffPixels. AI visual regression replaces that pixel diff with a model that looks at the frame and judges it by meaning: is the Submit button in the right state, did the success toast appear, is the avatar a real image or a gray placeholder. The Assrt codebase has zero calls to toHaveScreenshot, pixelmatch, resemble.js, looks-same, or odiff. The judgment lives in the model, so the test question is English (from scenario.md) and the answer is English (evidence string on each TestAssertion at assrt-mcp/src/core/types.ts lines 13-17).

Which model does Assrt use for the live judgment, and where is that set?

The default is Claude Haiku 4.5. The exact model ID claude-haiku-4-5-20251001 is declared at assrt-mcp/src/core/agent.ts line 9 as DEFAULT_ANTHROPIC_MODEL. After every visual action (navigate, click, type_text, select_option, scroll, press_key), the browser screenshot is captured and attached to the next tool-result message as { type: 'image', source: { type: 'base64', media_type: 'image/jpeg', data: screenshotData } }. That attach site is at agent.ts line 987. The model sees each new frame, reasons about it against the plan, and decides whether to continue, retry, or emit a pass/fail assertion. No baseline PNG is involved on this path.

What is the second phase, exactly?

After the run finishes, Playwright's native video recorder has written a full WebM of the browser session (stopped at server.ts line 578, finalized and moved into /tmp/assrt/<runId>/videos). You can call the assrt_analyze_video MCP tool with a natural-language prompt. The tool reads the WebM off disk (server.ts line 953), base64-encodes the entire file (line 964), and sends it to Gemini 3.1 Flash Lite Preview as a single inlineData part with mimeType 'video/webm' (lines 976-981). The model replies with an English analysis. You can ask 'did the confirmation banner appear for the full 2 seconds' or 'was there ever a flash of red on the pricing page' and get an answer grounded in the pixels of the run.

Why Gemini for phase 2 specifically and not Claude again?

Because Gemini accepts video as a first-class modality in a single prompt. The WebM is passed straight in via inlineData with mimeType 'video/webm'. Claude can see images, but sending an entire test recording frame-by-frame burns tokens and misses timing. The assrt_analyze_video tool at server.ts lines 925-1018 is registered only if process.env.GEMINI_API_KEY is set (line 929). If you don't set the key, the tool simply never appears on the MCP server. You still get phase 1 from Claude, just not the retrospective video phase. Nothing else in the codebase changes.

Can I run both phases against the same test?

Yes, and that is the intended flow. Call assrt_test to run the scenario (phase 1 happens automatically, Claude Haiku 4.5 judges per frame). The server remembers the path of the WebM in a module-level variable lastVideoFile (server.ts line 270). Now call assrt_analyze_video with any prompt, no videoPath argument needed (server.ts line 939 falls back to lastVideoFile). Repeat with different prompts for free: once the video is on disk, you can interrogate it as many times as you want. 'Did the left sidebar flicker', 'did any modal appear unexpectedly', 'estimate the LCP element visually'. Each question is a different Gemini request against the same base64 blob.

What does the SERP miss about this?

The top results for 'ai visual regression' (Applitools Eyes, BrowserStack Percy, Mabl Visual AI, LambdaTest SmartUI, Reflect) describe AI as a replacement for pixelmatch. They compare rendered screenshots against baselines more cleverly: they ignore dynamic zones, weight structural elements, and suppress false positives. That is one-phase AI. None of them let you take your recorded test run and ask it open-ended questions in English after the fact. The retrospective video analyst is a second surface the industry has not named yet. That is the spine of this page.

How is this different from the Assrt visual-regression-tutorial page?

The tutorial page at /t/visual-regression-tutorial explains the in-run model in depth: why each screenshot is a JPEG attached to Claude, why there is no baseline PNG, what a TestAssertion looks like. It is a phase-1 deep dive. This page is about the pair. Phase 1 plus the assrt_analyze_video retrospective that the tutorial does not cover. If you care about why baselines are gone, read the tutorial. If you care about the full AI loop (judge during, analyze after), read this.

What happens to the video if I never call assrt_analyze_video?

It stays on disk anyway. The WebM is recorded by Playwright's native page.video() regardless of whether Gemini is configured. It lands at /tmp/assrt/<runId>/videos/recording.webm and a self-contained HTML player is generated next to it at player.html (server.ts line 618). The MCP tool response includes a videoPlayerUrl served by a local static server (server.ts line 629) that auto-opens by default. So even without an API key, you can watch every run replay in any browser. Phase 2 only kicks in if you want programmatic queries over the recording.

Does AI visual regression actually replace toHaveScreenshot()?

Not for everything. Pixel diffs catch sub-pixel layout shifts and one-pixel color drifts a model will not flag. For a design-system component library where a 1px border change is a real regression, keep Playwright's toHaveScreenshot() on the component. Use the two-phase AI approach at the user-journey level: is the checkout flow in the right state end to end, did the dashboard render the correct data, did the onboarding modal show up exactly when it was supposed to. The realistic stack is both, scoped to what each is good at. The win of the AI approach is that you stop maintaining __snapshots__ folders for tests that are really asking page-level questions.

Is any of this proprietary, or can I walk away with everything?

Everything stays on your disk in standard formats. The scenario is a .md file (regex #?\s*(?:Scenario|Test|Case) for the case header). Results are JSON matching the TestReport shape at types.ts lines 28-35. Screenshots are PNGs, the video is a WebM, the player is a plain HTML file. There is no vendor baseline store, no proprietary diff format, no cloud-only dashboard. The comparable tier-3 AI testing platforms charge around $7,500 a month at scale. Assrt ships as an open-source npm package (npx assrt-mcp) you run against your own app. Cancel the relationship and the scenario still runs, the video still plays, the results still parse.

The rest of the Assrt regression stack, in the same repo.