Testing Infrastructure

Automated Visual Regression, End to End

A working reference for teams who want real visual regression running on every pull request, not a demo that falls apart the second production traffic hits it. Pipeline, code, CI wiring, and the flakiness fixes nobody talks about.

0msper pixel diff at 1440p
0%typical pixel threshold
0xmore UI regressions caught vs unit tests alone
$0cost for a self-hosted suite

TL;DR. Automated visual regression means a script captures screenshots of your app, compares them to stored baselines, and fails the build if the diff exceeds a threshold. The hard parts are not the comparison. They are the baselines, the flakiness, the CI wiring, and the workflow for rolling forward on intentional UI changes. This guide walks through every piece with runnable code.

What Counts as Automated

A visual test is automated when a machine runs it without human prompting, on a schedule or trigger, and produces a binary pass/fail that a CI system can gate on. Taking screenshots and eyeballing them in a Slack channel does not count. Running Percy or Chromatic once a week does not count either, because the feedback loop is too slow to catch regressions before they ship.

The bar is higher than most teams think. For a suite to actually be automated, five properties need to hold:

  • Reproducible capture. The same input produces the same screenshot, regardless of who runs it or which machine runs it.
  • Stored baselines. The expected images live in version control alongside the code, not in a vendor cloud where they can quietly drift.
  • Deterministic diff. The comparison algorithm returns the same result on every run with the same inputs. No sampling, no random seeds.
  • CI integration. A failing diff fails the build. A human has to approve the new baseline before it lands on the main branch.
  • Rollback path. Intentional UI changes update baselines atomically with the code change, so reverting one reverts both.

Miss any one of these and you end up with a theater of tests that pass locally, fail in CI, get quarantined, and eventually get deleted. This is the failure mode for roughly half the visual regression setups I have seen in the wild.

The Pipeline

Every automated visual regression setup follows the same shape, regardless of tooling. Understanding the shape first makes it obvious where each piece of configuration belongs.

Visual regression pipeline

1

Boot app

deterministic seed

2

Navigate

wait for network idle

3

Screenshot

mask dynamic regions

4

Compare

pixel or perceptual diff

5

Gate

pass, fail, or review

6

Update

commit new baseline

Notice that only two of the six steps are about pixel math. The rest are environmental. This is why teams that focus on the diff algorithm and ignore the setup tend to end up with flaky suites. Font rendering, image loading, animations, and layout reflow produce more failures than any real regression.

Baselines and the Diff Budget

A baseline is a PNG file that represents the correct appearance of a page or component at a specific viewport and browser. The name of the file encodes the test case, the browser, and the platform, so a single test produces multiple baselines when it runs across a matrix. Playwright stores them in a tests/*.spec.ts-snapshots/ folder by default.

The diff budget is the tolerance you allow before the test fails. Playwright exposes two knobs: maxDiffPixels for an absolute count, and maxDiffPixelRatio for a percentage. Use the ratio. Absolute counts break the second you change viewport size.

playwright.config.ts

The combination of animations: 'disabled', caret: 'hide', and a locked timezone kills the top three sources of flakiness in one shot. The remaining flakiness comes from dynamic content, which the test itself handles with masks.

A Real Test File

The pattern below is a production template, not a hello world. It covers full-page captures, element captures, dynamic content masking, and a before-hook that waits for fonts and images to load. Copy it, rename the test names, and you have a working suite.

tests/visual/dashboard.spec.ts

The third test is the interesting one. It intercepts the metrics API and returns an empty array, then captures the empty state. This is how you test UI states that are hard to reach through the real backend. The intercept makes the screenshot deterministic because the data never varies.

CI Wiring

Visual tests need to run in the same environment where the baselines were generated. Otherwise font rendering and subpixel positioning differ, and every test fails on the first run. The cheapest way to get this right is to generate baselines inside a Docker container that your CI also uses.

.github/workflows/visual.yml

The Microsoft Playwright Docker image ships with all browser binaries and matching system fonts, which is exactly what you need for byte-stable rendering. Skip the image and you will chase font hash mismatches for weeks. On failure, the workflow uploads the full test-results/ tree so reviewers can download the diff PNGs and see exactly what changed.

Hand Written vs Generated

A visual regression test is mostly boilerplate: navigate, wait, mask the usual suspects, call toHaveScreenshot(). Writing it by hand is a perfect target for generation, which is why Assrt exists. The key property is that generated tests are plain TypeScript files you check into git, not proprietary YAML that lives in a vendor console.

Hand-written vs Assrt-generated

import { test, expect } from '@playwright/test';

test.describe('pricing page', () => {
  test.beforeEach(async ({ page }) => {
    await page.goto('/pricing');
    await page.waitForLoadState('networkidle');
    await page.evaluate(() => document.fonts.ready);
  });

  test('full page matches baseline', async ({ page }) => {
    await expect(page).toHaveScreenshot('pricing.png', {
      fullPage: true,
      mask: [page.locator('[data-testid="countdown"]')],
      maxDiffPixelRatio: 0.01,
    });
  });

  test('plan card in isolation', async ({ page }) => {
    const card = page.locator('[data-testid="plan-card-pro"]');
    await expect(card).toHaveScreenshot('plan-pro.png');
  });

  test('mobile viewport', async ({ page }) => {
    await page.setViewportSize({ width: 375, height: 812 });
    await expect(page).toHaveScreenshot('pricing-mobile.png', {
      fullPage: true,
    });
  });
});
86% fewer lines

Both files test the same thing. The difference is that you own the generated file, you can edit it, and it runs on plain Playwright with no vendor SDK. When Assrt stops being useful to you, the tests keep working. This is the hard rule behind everything the project does: tests are yours to keep, zero vendor lock-in.

Generate real Playwright tests, not YAML

Assrt is open-source, self-hosted, and outputs standard TypeScript. Replace $7,500/month visual testing vendors with a pipeline you actually own.

Get Started

Killing Flakiness at the Source

A visual suite that flakes 10% of the time is worse than no suite, because the team learns to ignore failures. The fix is to cut flakiness at the source instead of widening the diff budget. Every one of the patterns below eliminates a specific class of false positives.

Animations. Set animations: 'disabled' in the Playwright config. This freezes CSS animations and transitions so a screenshot taken mid-animation produces the same frame every time. If a specific component uses JavaScript-driven animation (Framer Motion, GSAP, Lottie), add a data-testid that signals the end state and wait for it explicitly.

Fonts. networkidle does not wait for web fonts to finish rendering. document.fonts.ready does. Always wait for both.

Images. Preload every image the page needs before screenshotting. The quickest way is to wait for all img elements to have complete === true before capturing.

Dynamic content. Mask timestamps, avatars, ads, and anything else that changes between runs. The mask replaces the region with a solid color in the screenshot, so the comparison skips it cleanly. Prefer masking over hiding because hidden elements can shift the layout of everything below them.

Scrollbars. Different operating systems render scrollbars differently. Capture element-level screenshots when possible, or force overflow hidden on the body before a full-page capture.

tests/visual/helpers.ts

Import waitForStableUI in every visual test and call it before any toHaveScreenshot call. This single helper eliminates the large majority of real-world flakiness in my experience.

Updating Baselines Without Creating Chaos

The workflow question nobody answers in tutorials: how do you intentionally change the UI without burning a weekend updating baselines? The answer has two parts: generate updates inside the PR that changes the UI, and require a human to view the diff before merging.

The happy path looks like this. A developer opens a PR that changes a component. The visual regression job fails. The developer runs npx playwright test --update-snapshots inside a CI container (local generation produces the wrong hashes), commits the updated PNGs, and pushes. A reviewer looks at the diff in the PR (GitHub renders PNGs side by side) and approves or requests changes. The baselines and the code land in the same commit, so reverting the commit reverts both.

The anti-pattern is letting anyone update baselines on main without review. This turns the suite into a rubber stamp. I have watched teams lose an entire quarter of regression coverage this way because the baselines drifted further and further from what the product actually looked like.

Scaling Costs and the Self-Hosted Path

Commercial visual testing tools charge per screenshot or per build, and the pricing gets silly fast. A mid-size team with 300 visual checks running on 20 PRs a day hits 180,000 screenshots a month. At the rates Percy, Chromatic, and Applitools charge in 2026, that is between $3,500 and $7,500 per month, sometimes more once cross-browser multiplies the count.

A self-hosted Playwright suite running on GitHub-hosted runners costs the compute time of the CI job, which is usually under $200 a month for the same volume. The tradeoff is that you own the storage and comparison infrastructure. For teams with a real CI platform this is fine. For teams that do not want to own any of it, a paid service is reasonable. The middle path is the one Assrt targets: generate the tests with AI, keep them as plain Playwright files, run them on your own CI, store baselines in git.

Storage is worth a note. Baselines are PNGs and they grow with the suite. A 500-test suite at 1440p produces roughly 300MB of baselines, which git handles fine. At 5,000 tests you want git LFS. Past 20,000 tests you probably want a blob store referenced by hash. These are happy problems and you will cross them one at a time.

What to Automate First

Do not try to cover the whole app on day one. Start with the pages that cost the most when they break, which is almost always the top of the funnel and the money pages. A practical order of operations:

  1. Marketing home, pricing, and signup. One full-page capture each at desktop and mobile. Six baselines, done in an afternoon.
  2. Checkout or core onboarding flow. Element-level captures of the form, the summary, and the confirmation screen. This is where the revenue leaks happen.
  3. The three most-viewed screens inside the app. Full-page captures with dynamic content masked. This is where the support tickets come from.
  4. Design system components. A Storybook-based visual test per variant. This is where the slow drift shows up.

Ten tests in the first week will catch more real regressions than a hundred tests written over a quarter. The constraint is not test count, it is how fast the feedback loop runs and how quickly people trust it. Build the trust first, then add coverage.

Closing

Automated visual regression is not hard to set up. It is hard to keep running. The tooling in 2026 is good enough that any team with Playwright and a CI pipeline can have a working suite by the end of a Friday. What kills most setups is the stuff that happens next: flakiness that erodes trust, baselines that drift, reviewers who stop looking at diffs, and vendor bills that creep up until someone decides to rip everything out.

The setup in this guide is boring on purpose. It uses plain Playwright, stores baselines in git, runs on standard CI, and fails loudly when something looks wrong. Assrt exists to shortcut the generation step so you can skip the boilerplate and keep the parts that matter. Everything it emits is code you own.

Related Guides