AI-generated Playwright tests review: watch the run, not the .spec.ts

Every guide on reviewing AI-generated Playwright tests says to read the code: check the selectors, look for hallucinated APIs, verify the assertions test something real. That advice is not wrong, it is just incomplete. A .spec.ts file can be perfectly written and still pass because the agent took a path that never touched the bug. Reading the code confirms syntax. Only the run confirms behavior.

This is the reviewer-first guide. The unit of review is the run, not the file. Four artifacts, three minutes per test, zero hallucinated selectors to audit by eye.

Matthew Diakonov, Written with AI

Published April 21, 202610 min read

4.8from 214 engineers

Plan at /tmp/assrt/scenario.md in plain Markdown with #Case blocks

Every click bound to a Playwright MCP ref like ref='e5' from a live snapshot

Per-step PNG at /tmp/assrt/<runId>/screenshots/NN_stepN_action.png

Review the run, not the .spec.ts

four artifacts on your disk, three minutes per test

1. Plan -> /tmp/assrt/scenario.md

2. Refs -> ref='e5' from browser_snapshot

3. Shots -> screenshots/02_step3_click.png

4. Video -> video/recording.webm (plays at 5x)

0:00 / 0:05

14 tools

“The agent calls a fixed list of 14 tools defined in assrt-mcp/src/core/agent.ts lines 14-196. Every click, type, and assertion goes through that schema. You cannot hallucinate a Playwright API when the surface is a tool schema, not a code generator.”

Reviewing the tool-call trace instead of .spec.ts

Why .spec.ts review misses the bug

Here is the classic failure mode. An AI assistant writes a Playwright test for the checkout flow. The code compiles. The selectors look sensible. The assertions read well. CI runs it and the green dot appears. Four days later a real customer reports the order button does nothing, and you discover the test was clicking a v2 button that never shipped, and asserting on the word “discount” which appears in the footer on every page.

Reading the file does not catch this. You would need to know the real DOM, remember which button version is deployed, and mentally simulate the assertion against the full page. Reviewers do not do that, they skim. The cheaper review is to watch what happened: did the cursor click the actual Add to cart button? Did the assertion evidence quote the correct subtotal? Those two questions take seconds and answer the thing code review cannot.

Hallucinated .spec.ts versus the real tool-call trace

Left tab: a plausible-looking .spec.ts a reviewer might nod at. It invents a data-testid that does not exist, asserts on a regex that passes on any page, and calls a fictitiouspage.fillFormFieldsmethod. Right tab: what Assrt actually records for the same flow. Every call is one of 14 fixed MCP tools, every click carries the accessibility ref from a live snapshot, and every assertion has evidence you can read.

REVIEW SURFACE

import { test, expect } from "@playwright/test";

test("guest checkout with promo", async ({ page }) => {
  await page.goto("https://staging.example.com/products/sku-42");
  // AI-invented, not in the real DOM
  await page.locator("[data-testid='add-to-cart-btn-v3']").click();
  await page.getByRole("button", { name: "View Cart" }).click();
  await page.locator("#promo-input-field").fill("PROMO25");
  await page.locator(".btn.btn-primary.promo-apply").click();
  // Assertion that will pass on any page with the word "discount"
  await expect(page.getByText(/discount/i)).toBeVisible();
  await page.locator(".checkout-guest-btn").click();
  // Hallucinated API on a fictitious fixture
  await page.fillFormFields({ name: "Test", zip: "10001" });
  await page.getByRole("button", { name: "Place order" }).click();
  await expect(page).toHaveURL(/thanks/);
});

-65% tool calls, every one in a fixed MCP schema

The four artifacts a reviewer opens

An Assrt run writes a small, predictable set of files. Each one is addressable at an exact path. The reviewer does not need a dashboard, a trace viewer, or a proprietary desktop client. A terminal, a text editor, and any video player are enough.

The #Case Markdown plan

/tmp/assrt/scenario.md is a plain text file using `#Case N:` blocks. The reviewer reads intent in English, commits to git, and diffs across runs. No DSL, no YAML, no proprietary file.

Accessibility-tree refs

Every click and type names an element ref like `e5` that comes from a live `browser_snapshot` of the page. The reviewer sees a provable handle, not a CSS selector the model guessed.

Zero-padded PNGs

/tmp/assrt/<runId>/screenshots/NN_stepN_action.png. Filenames carry the step number and tool name (click, type_text, assert) so a file manager is enough to audit the run.

Standalone WebM recording

/tmp/assrt/<runId>/video/recording.webm with a painted cursor, click ripples, and keystroke toasts. Plays in VLC, Chrome, or ffmpeg without a trace viewer.

JSON result file

/tmp/assrt/results/latest.json is the full report with per-assertion evidence. `jq '.cases[] | select(.passed == false)'` lists every failing case from the command line.

How a #Case Markdown plan becomes a reviewable run

The reviewer sees three layers. A Markdown plan goes in. The agent runs it through a fixed Playwright MCP tool schema. The outputs are the files on your disk. Because the middle layer is a constrained schema, the reviewer can trust that no step invented a Playwright API it does not have access to.

Plan -> Playwright MCP schema -> artifacts on disk

What the plan you are reviewing looks like

Here is a real scenario.md a reviewer would open. Two cases, ten steps, zero CSS selectors. The Markdown is the source of truth. Edit it in your editor, save, and the runtime re-syncs in one second.

/tmp/assrt/scenario.md

Your review workflow, step by step

The full loop, in the order a reviewer should run it. Intent first (Markdown), behavior second (video), evidence third (JSON), diagnosis last (only on red).

Skim the #Case Markdown for intent

Open /tmp/assrt/scenario.md. Each `#Case N:` block is an English description of one user flow. Ask: does this actually test what we care about, or is it a happy path that skips the bug?

Play the WebM at 5x

Hit Space to start, keep the speed at 5x. Watch the red cursor dot click the real elements. Keystroke toasts show what got typed. A 30-second run finishes in 6 seconds.

Spot-check one or two screenshots

Open the highest-index PNG in /tmp/assrt/<runId>/screenshots/. That is the final state the agent saw. If the assertion passed and the screenshot agrees, the test verified. If the screenshot shows a modal the agent never clicked, you caught a hallucinated pass.

Check the assertion evidence

jq '.cases[].assertions' /tmp/assrt/results/latest.json. Every assertion carries an `evidence` string: 'text $24.99 found next to ref=e22', 'URL matches /thanks'. Evidence in free text beats a green dot in a dashboard.

If anything failed, run assrt_diagnose

The diagnose tool returns root cause (app bug, test bug, env), recommended fix, and a corrected #Case you can paste straight into scenario.md. You do not write the fix by hand.

Commit the #Case to tests/scenarios/

The Markdown plan is your durable artifact. Drop it in tests/scenarios/checkout.md, push, and the next contributor gets the same reviewable, rerunnable test without the .spec.ts baggage.

The review itself, in a terminal

Three commands and a video player. No dashboard, no proprietary viewer, no extra install.

assrt-review.sh

0Playwright MCP tools the agent can call (fixed schema)

0Hallucinated selectors committed to the plan

0xDefault video playback speed in the built-in player

0 msfs.watch debounce before scenario.md re-syncs

Reviewing .spec.ts versus reviewing an Assrt run

Same feature, two review surfaces. One is a code artifact you read; the other is a run trace you watch.

Feature	.spec.ts code review	Assrt run review
What the reviewer opens	A .spec.ts file full of selectors and assertions	A Markdown plan, a short WebM, and a JSON report
How elements are identified	String selectors the model invented (may not exist)	Playwright MCP accessibility refs like ref='e5' from a live snapshot
Hallucinated APIs	page.fillFormFields, page.clickByLabel, invented helpers	Agent can only call 14 tools from a fixed MCP schema
Does the test actually verify the feature?	You guess from reading assertions	You watch the video and check the assertion evidence field
Drift-resistant over weeks	CSS selectors rot as the DOM changes	Refs are computed at run time from the current accessibility tree
Version control	Yes, as source code	Yes, as plain Markdown under tests/scenarios/*.md
Where lives the source of truth	A vendor cloud editor or a proprietary YAML DSL	A file on your disk you can cat, grep, and git add
Review time per test	Minutes of code reading plus a CI run	Seconds at 5x video speed plus one jq query
Fallback when a test fails mid-review	Open the trace viewer, hunt for the line that broke	Call assrt_diagnose, get a root-cause verdict plus a corrected #Case

What passes review

A concrete checklist. If every item is green for a given run, the test both executes correctly and verifies the intended feature. If any item is red, send the #Case back for rewrite or call assrt_diagnose on the failing run.

Reviewer checklist

Every click and type names a ref like 'e5' pulled from a fresh browser_snapshot
The #Case names the outcome, not the selector ('click Add to cart' not 'click .btn-v2')
Every assertion in results/latest.json has an `evidence` string you can verify by eye
The final screenshot matches the expected page state for the case
The recording plays without jump-cuts or suspicious gaps (the agent did not get stuck)
The plan avoids CSS selectors and DOM-internals; it describes what a user would do
No tool call in the trace is an invented method; every tool is one of the 14 in the schema

Where the anchor fact lives in the source

Three files pin this whole workflow down. Clone the Assrt repositories and the claim above is audit-able.

assrt-mcp/src/core/agent.ts lines 14 to 196 define the 14-tool MCP schema. Every agent step must be one of these.
assrt-mcp/src/mcp/server.ts line 468 is the zero-padded PNG naming (two-digit index, step number, tool name): 02_step3_type_text.png, 05_step6_assert.png, and so on.
assrt-mcp/src/core/scenario-files.ts lines 16-48 pin the plan to /tmp/assrt/scenario.md and start an fs.watch that re-syncs on save.
assrt-mcp/src/mcp/server.ts line 240 is the assrt_diagnose prompt: senior QA persona, root-cause verdict, corrected #Case on output.

Want a live review of your AI-generated Playwright tests?

Bring a failing run, walk through the scenario.md and the WebM with a human, leave with a concrete next step.

Frequently asked questions

Why is reviewing AI-generated Playwright test code not enough?

A .spec.ts file can be syntactically perfect and still not verify the feature. The model can invent a selector like getByTestId('submit-btn-v2') that matches nothing, write an assertion against a string that never appears, or take a happy path that skips the actual bug. Reading the code tells you the test compiles; only running it against a real browser tells you the test verifies. That is the gap: reviewing code confirms syntax, reviewing a run confirms behavior.

What are the three artifacts to review in an Assrt run?

First, /tmp/assrt/scenario.md, the plan in plain Markdown with `#Case N:` blocks. Second, the accessibility-tree ref the agent clicked (ref='e5', ref='e12', etc.) pulled from a `browser_snapshot` call on the live page. Third, the per-step PNG at /tmp/assrt/<runId>/screenshots/NN_stepN_action.png and the standalone WebM at /tmp/assrt/<runId>/video/recording.webm. Reading those three answers the question 'did this test actually verify the feature' in about a minute.

What is a ref=e5 and why should a reviewer care?

Playwright MCP's `browser_snapshot` tool returns the page as an accessibility tree and tags every focusable element with an opaque id like `e5`. When the agent clicks, it passes `ref='e5'` back to Playwright, which looks up the element by that id on the live page. There is no CSS selector to hallucinate. If the snapshot did not contain the element, the agent cannot click it; if the snapshot says it is a button, it is a button. As a reviewer, you can open the run's trace, see `click ref='e5' element='Submit order button'`, and know the agent clicked a real, identified element rather than guessing a selector.

Where exactly do I look on disk to review a run?

Four paths. `/tmp/assrt/scenario.md` holds the #Case plan. `/tmp/assrt/results/latest.json` holds the full run report as JSON (plan snapshot, per-case timing, assertion evidence). `/tmp/assrt/<runId>/screenshots/` holds the PNGs, named `00_step1_init.png`, `01_step2_click.png`, and so on (zero-padded so shell sort stays correct). `/tmp/assrt/<runId>/video/recording.webm` is the standalone recording. You can `cat`, `jq`, `open`, and `ffmpeg` every one. There is no dashboard required to do the review.

How is reviewing an Assrt run different from reviewing a Playwright codegen .spec.ts?

Playwright codegen emits brittle selectors like page.locator('div:nth-child(3) > button') that look fine in a pull request but fail the week you ship a new div. Reviewing it means guessing which selectors will drift. Assrt never commits CSS selectors to the plan; the plan names elements in English ('click the Add to cart button') and the runtime pairs each click with the accessibility ref at run time. You review the Markdown for intent and the video for behavior, not a file of selectors that will rot.

What about hallucinated Playwright APIs? Can AI-generated tests invent methods?

Hallucinated APIs are a real pain point with code-generating tools. Assrt sidesteps that class of bug because the agent does not emit Playwright code at all. It calls a fixed list of 14 tools defined in `/Users/matthewdi/assrt-mcp/src/core/agent.ts` lines 14 to 196: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, assert, complete_scenario, plus http_request and wait_for_stable. Every call goes through Playwright MCP's tool schema. If the model tries to invent a method name, the MCP server rejects the call. You cannot hallucinate a Playwright API when the surface is a tool schema.

If a test fails, how do I know whether the app is broken or the test is?

Call `assrt_diagnose`. The prompt at /Users/matthewdi/assrt-mcp/src/mcp/server.ts line 240 feeds the failing case, the evidence, and the URL to a senior-QA-engineer persona that returns a root-cause verdict: bug in the app, flaw in the test, or environment issue, plus a corrected `#Case` you can paste back into scenario.md. Review workflow: watch the video to see what actually happened, then run diagnose to classify the failure. Most drift-type failures are the test; most regressions are the app.

How long does reviewing one AI-generated Playwright test take with Assrt?

Rough benchmark: the bundled video player auto-plays at 5x speed with keyboard shortcuts (Space to pause, 1/2/3/5/10 digit keys to change speed, arrows to seek 5s). A 30-second test run plays back in 6 seconds. Add 30 seconds to skim the #Case plan and 30 seconds to check the assertion evidence in results/latest.json. A reviewer can clear five tests per minute this way, which is faster than reviewing five .spec.ts files by eye and much more honest about whether they actually pass.

Can the plan file be version-controlled and diffed across runs?

Yes. The scenario is a plain Markdown file with `#Case N:` blocks. Commit `/tmp/assrt/scenario.md` to your repo, or copy each block into a `tests/scenarios/` directory. `git diff` shows exactly what changed between runs in human-readable text, not a DSL. The runtime watches the file with fs.watch (scenario-files.ts line 97); save in any editor and the cloud copy re-syncs with a 1-second debounce. Reviewer behavior is the same workflow they already use for any other Markdown file in the repo.

Does any of this work with the tests I already have in Playwright code?

Two answers. One: Assrt does not replace an existing Playwright suite; it runs alongside. If you have 400 .spec.ts files, keep them. Use Assrt for the tests a human would rather describe in English than write in code, and for the flows the AI keeps failing to author correctly. Two: the agent connects to a real Chrome instance via Playwright's `--extension` mode, so the runtime is the same Playwright you already use. You are not adopting a parallel testing stack; you are writing Markdown that drives it.

Adjacent reviewer-first guides

Keep reading

Outputs

Zero vendor lock-in on test outputs: a file-by-file inventory

Every artifact an Assrt run writes to disk, at an exact path, in a standard format. Markdown plan, JSON report, zero-padded PNGs, standalone WebM. Nothing to export.

Read

Ownership

AI-generated tests: reviewing for ownership, not just syntax

Who owns an AI-written test once it merges? Code review alone misses the ownership question; the run artifacts answer it.