For autonomous runs, not hand-written tests

Debuggable Playwright automation, when the driver is an LLM

Every guide on the first page of Google for this phrase is about debugging tests you wrote. Breakpoints. Inspector. Trace Viewer. That playbook falls over the moment the entity picking the next click is a language model. You cannot set a breakpoint on a decision that has not happened yet. What you actually need is a trail: a recording where the mouse is visible, a player where you can scrub at speed, and a folder on disk that tells you what the agent saw and what it decided, step by step. Assrt ships all of this automatically. This guide walks through the pieces and the exact files they come from.

Matthew Diakonov, Written with AI

Published April 20, 202612 min read

Debuggable means the trail, not the breakpoint

AI-driven Playwright runs need a different playbook

Red cursor baked into the recording

Self-contained player, 5x default, Space and arrows

events.json on disk, every tool call timed

keepBrowserOpen drops you into the live state

All real Playwright, no vendor format

0:00 / 0:05

4.9from Assrt MCP users

Overlay injected per action (browser.ts:33-98)

Player autoplay at 5x with keyboard scrubbing (server.ts:92-107)

Every run dumps webm + player.html + events.json on disk

Open source, self-hosted, real Playwright under the hood

The top results for this keyword all debug tests you wrote

Search "debuggable Playwright automation" and the first page is uniformly the Playwright Inspector, UI Mode, Trace Viewer, VS Code breakpoints. Excellent tools. Useless in 2026 when the thing driving the browser is a model that chose its own click. You cannot pause a decision. You cannot step into a snapshot the agent made seventeen tool calls ago. The debug question is no longer "what does this line of my test do?" It is "what did the agent see, what did it pick, and what happened on the page when it did it?". Answering that needs a recording where the cursor is visible and a structured log you can read after the fact.

What does not help

A breakpoint

The LLM already picked its next tool call before you could set one. The choice is in the past by the time the error surfaces.

What does help

A scrubbable, overlaid recording + a structured log

You replay at 5x, find the frame where the button was clicked, then diff events.json against a successful run to see which tool call changed.

Traditional debugging vs. AI-driven run debugging

// Debugging a Playwright test YOU wrote.
// The driver is deterministic. The playbook is well-known.

// 1. Add a breakpoint
await test.step("click submit", async () => {
  await page.pause();               // Inspector opens
  await page.getByRole("button", { name: "Submit" }).click();
});

// 2. Step through. Inspect locators. Edit selectors.
// 3. Trace viewer shows every action you called.
// 4. You already know what each line is supposed to do.

// npx playwright test --debug
// npx playwright show-trace trace.zip

-27% fewer lines

The red cursor is injected into the DOM before every action

Headless Chromium does not render a native mouse cursor. If you just recorded the browser tab, the .webm would be a slideshow of DOM states with no indication of what was clicked or typed. Assrt's fix is to treat the overlay as part of the page: before each click or keystroke, it injects a small script that creates four fixed-position elements, each with z-index 2147483647(signed int32 max, the canonical "above everything" number). A red dot for the cursor, a 40px ripple that scales on click, a black keystroke toast, and a 6px green heartbeat in the bottom right. The heartbeat exists purely to force CDP compositor frames during idle moments so the recording has no gaps while the agent is thinking. Every one of those choices is in one file you can read.

assrt-mcp/src/core/browser.ts

20px

“The overlay is the difference between a .webm you can diagnose and a .webm you have to re-run.”

Assrt changelog, overlay v2

0xDefault player speed

0kSnapshot char cap per step

0pxCursor dot diameter

0msHeartbeat pulse cycle

What a debuggable run leaves on disk

Every run dumps a predictable set of files under /tmp/assrt/ so you never have to log into a dashboard to see what happened. The paths are defined in scenario-files.ts lines 16-20 and referenced by every tool in the stack. Grep, jq, diff, and tail work on all of them.

recording.webm

A .webm of the session with the red cursor, click ripple, keystroke toast, and heartbeat pulse overlaid into the DOM so the video is actually watchable at 5x. Output of Playwright's --caps devtools video capture.

player.html

A single self-contained HTML file with keyboard scrubbing (Space, ←, →, 1/2/3/5/0). Ships next to the .webm in the run directory. Works offline, no JS bundle.

events.json

Every tool call the agent made, in order, with duration in ms and the args it used. Diff two of these to see where behavior drifted between runs.

scenario.md

The test plan as plain Markdown at /tmp/assrt/scenario.md. Edit it live during a run; it is watched with fs.watch and auto-synced back to cloud storage on debounce.

results/latest.json

Structured pass/fail, per-assertion evidence, per-step description. The shape expected by CI systems. Readable in three seconds, parseable in one.

execution.log

Full stderr: [mcp] <tool> args (Xms) lines plus reasoning deltas. Tail this while the run is live to watch the agent think.

How a single click becomes a debuggable frame

When the agent decides to click "Submit", here is the exact chain that fires. Every arrow corresponds to a real tool call you can find in events.json after the run.

click('Submit') — the round trip

Everything a run takes in, everything a run gives back

The shape of the trail is deliberate. The inputs are loose (a URL, a plan in Markdown). The outputs are strict and file-system-native (a .webm, a player, a JSON log, a pass/fail record). This is why grepping, diffing, and scripting over debug sessions is trivial.

Assrt: the shape of a debuggable run

A failing run, debugged end-to-end, in two terminal blocks

Here is the full workflow when a scenario fails. One run produces enough artifacts that you can reach a verdict without re-running.

~/dev/myapp

The five things Assrt does differently, per run

Overlay is injected before each action

Assrt calls browser_evaluate with CURSOR_INJECT_SCRIPT, drops four fixed-position elements into the page DOM, and restores the cursor's last position without animation so it does not jump in from off-screen on navigation.

Every tool call is logged with its duration

browser.ts line 408 emits '[mcp] <name> args (Xms)' on stderr per call. Navigations, snapshots, clicks, types, all timed. Transport failures mark the client dead so the next call returns a clear error instead of hanging.

Accessibility trees are written to disk, not inlined

Playwright MCP is launched with --output-mode file --output-dir ~/.assrt/playwright-output. Snapshots land as .yml files up to 120k chars. Huge pages (Wikipedia-scale) are truncated with a visible note so the agent's context does not blow up mid-run.

Video, log, events, screenshots all land under one run directory

Per-run artifacts go to /tmp/assrt/<runId>/. The run directory is the unit of investigation: every recording has its events next to it, every events file has its log next to it, every failure has its screenshot next to it.

Browser is kept alive on opt-in

keepBrowserOpen: true detaches the Playwright MCP child process (unref stdin/stdout/stderr), clears the transport without calling close(), and lets you take over the exact state the agent left. You can open DevTools, inspect Redux, check network waterfalls.

The player ships with the recording, zero runtime required

Every run generates a player.html next to its recording.webm. It is a single file, no bundler, no vendor iframe. Open it anywhere a browser runs. The default speed is 5x because agentic runs are paced by LLM decisions; at 5x clicks still feel connected to the text you can read on the page. The keyboard map mirrors what most developers expect from video editors.

assrt-mcp/src/mcp/server.ts

What a debuggable run looks like, in practice

The features below are not marketing bullets. They each trace to a specific line in the Assrt MCP source.

Feature	Typical AI QA tool	Assrt
Visible mouse and keystrokes in the video	Native cursor is not rendered in headless Chromium	DOM-injected red cursor, 40px click ripple, keystroke toast
Player	Raw .webm in a file manager, clunky to scrub	player.html with Space, ←, →, 1/2/3/5/0 keyboard shortcuts
Default playback speed	1x by default; you scrub with the timeline	5x by default, tuned for autonomous-agent pacing
Takeover after a run	Browser closes the moment the run ends	keepBrowserOpen: true detaches the child process, browser stays
Real Chrome session	Headless Chromium, lose your logins and cookies	extension: true attaches to your running Chrome via MCP extension
Tool-call audit log	Opaque vendor timeline in a cloud UI	events.json on disk, grep-able and diff-able
Idle frame capture	Compositor skips frames while the model thinks	6px heartbeat pulse forces continuous CDP frames
Test format	Proprietary YAML or vendor dashboard rows	Plain Markdown #Case at /tmp/assrt/scenario.md
Execution engine	Custom runner, opaque internals	@playwright/mcp subprocess over stdio, real browser_* calls
Cost	Closed competitors at $7.5K/mo	$0 plus LLM tokens, self-hosted, open source

The ten things that make a Playwright run debuggable in 2026

The mouse and keystrokes appear in the recording
The player opens at 5x by default
Space pauses, 1 drops back to 1x, arrows seek 5s
events.json lists every tool call with millisecond timings
scenario.md is editable and watched for changes
execution.log streams to stderr during the run
results/latest.json has pass/fail and per-assertion evidence
keepBrowserOpen leaves the browser on the exact final state
extension mode reuses your real Chrome logins
Tests are plain Markdown you can commit with the PR

__pias_cursor__pias_heartbeat__pias_ripple__pias_showClickCURSOR_INJECT_SCRIPTrecording.webmplayer.htmlevents.jsonexecution.logscenario.mdresults/latest.jsonkeepBrowserOpen--caps devtoolsbrowser_snapshotbrowser_clickrgba(239,68,68,0.85)

The point is you own every artifact

The reason to care about any of this is that when an AI-driven test goes wrong in CI at 3am, you have minutes, not hours, to reach a verdict. A closed vendor dashboard that was fine during the demo becomes a wall: someone has to log in, someone has to grant access, someone has to figure out which dashboard the failing run landed in. A folder on disk with a recording, a player, a log, and a results file is a flashlight you can just point. That is what debuggable means in 2026. Assrt is the version of that flashlight that is open-source, self-hosted, and built on the exact same Playwright your team already uses.

If you want to see it run against a repo of your choice and watch the trail it leaves, book a call and we will walk through it live, including the keep-browser-open takeover.

See a debuggable run end-to-end

20 minutes, your repo or ours. We scrub the recording, diff events.json, and hand you the keep-open browser.

FAQ: debuggable Playwright automation

What does 'debuggable' actually mean for AI-driven Playwright automation?

It means the opposite of what it means for a hand-written Playwright test. When you wrote the test, debuggable means a breakpoint, the Playwright Inspector, and stepping through statements you chose. When an LLM is picking the next click, debuggable means you need a trail that lets you answer three questions after the fact: what did the agent see, what did it decide to do, and what happened on the page when it did it. That is a video plus a structured log plus an on-disk snapshot of the plan and the results. Assrt paints the mouse cursor and keystrokes into the recording, writes every tool call to stderr with timings, saves the plan at /tmp/assrt/scenario.md, the outcome at /tmp/assrt/results/latest.json, and leaves a .webm plus events.json per run under /tmp/assrt/<runId>/. You scrub the video at 5x, read the log, and know inside of a minute why the run went sideways.

How is the red cursor in the recording implemented, and why bother?

It is implemented by CURSOR_INJECT_SCRIPT in /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 33-98. Before each click or keystroke, the runner injects a small DOM overlay: a fixed-position 20px red dot at rgba(239,68,68,0.85) with a white border, a 40px click ripple that scales from 0.5 to 2 on each click, and a black keystroke toast at the bottom of the screen that shows what text is being typed. Without this, a .webm of a headless Chromium playing back at 5x is genuinely hard to follow because the browser does not render a native cursor in headless mode. With it, you can watch the agent navigate, click, type, and the video reads like a demo. The same file also plants a 6px green heartbeat pulse at rgba(34,197,94,0.6) in the bottom right. That pulse exists to force CDP compositor frames during otherwise-idle moments so the recording does not have gaps where the agent was thinking.

Where do the test artifacts actually live on disk after a run?

Predictable, static paths, defined in /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20. The plan is at /tmp/assrt/scenario.md (editable, auto-syncs to cloud when you change it), metadata at /tmp/assrt/scenario.json, the latest run result at /tmp/assrt/results/latest.json, and per-run history at /tmp/assrt/results/<runId>.json. For each run, a separate run directory at /tmp/assrt/<runId>/ contains recording.webm, a self-contained player.html, events.json (the full tool-call trace), execution.log (stderr), and per-step screenshots. You can tail the log in another terminal while the run is live, or diff two events.json files between runs to see where the agent's decisions diverged.

What is the scrubbable player, and why does it default to 5x?

The player is generated as a single self-contained HTML file per run by generateVideoPlayerHtml in server.ts lines 42-108. It ships with Space to play/pause, ArrowLeft/ArrowRight to seek 5 seconds, and 1/2/3/5/0 for 1x/2x/3x/5x/10x playback. 5x is the default because autonomous runs are slower than a human recording: the agent snapshots the page, waits on the model, then acts, which makes real time too slow to watch. 5x is empirically the speed where clicks still feel connected to what you are reading on the page. If the run is short, 10x reduces a 90-second run to 9 seconds of watching. If something went wrong at a specific step, you tap 1, seek precisely, and see the overlay tell you exactly what happened.

What is the 'keep the browser open' mode and why is it useful for debugging?

Pass keepBrowserOpen: true to assrt_test and the browser stays alive after the run instead of closing. The implementation lives in /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 672-695: it detaches the Playwright MCP child process, unrefs stdin/stdout/stderr, and clears the transport reference without calling client.close(). What you get is the exact browser state the agent left: same page, same cookies, same scroll position, same DOM. You can open DevTools and poke at it like you would after any manual repro. Most QA tools close the browser the moment the run ends and leave you with only a recording. Recordings are great for understanding intent; they are lousy for inspecting a live redux store or a CSS rule cascade. Keepalive + the recording together is the pair you want.

How does Assrt let me reuse my real Chrome session for debugging?

Pass extension: true to assrt_test. This uses Playwright's --extension flag (see browser.ts lines 299-307) to connect to a running Chrome instance via the Playwright MCP Chrome extension, rather than launching a headless or headed Chromium. You keep all your logins, cookies, tabs, and devtools extensions. The first time you use it, Chrome surfaces a one-time approval dialog and Assrt saves the resulting token at ~/.assrt/extension-token for future use. This is the mode you want when the bug only reproduces against a production-ish session or a site with aggressive bot detection, because the request fingerprint is your real browser. You still get the red cursor overlay, the video, and the scenario.md trail.

Does this actually use real Playwright, or is it a custom runner?

Real Playwright. The runner spawns @playwright/mcp/cli.js as a subprocess over stdio (browser.ts lines 280-377) and every action the agent takes is a call to a Playwright MCP tool like browser_navigate, browser_click, browser_type, browser_snapshot. There is no proprietary YAML, no closed DSL, no vendor format in between. If you read events.json, you see the same tool names you would see in any Playwright MCP client. If tomorrow you want to stop using Assrt, the tests you wrote in #Case format port trivially because the underlying execution model is already plain Playwright. That is the point of building on Microsoft's MCP rather than reinventing.

How do I diff two failed runs to see where behavior changed?

Each run writes events.json under /tmp/assrt/<runId>/. The structure is a flat array of events with type, payload, and timestamp, captured in the order they were emitted. To see where runs diverged, run 'diff <(jq -c .[] /tmp/assrt/<runA>/events.json) <(jq -c .[] /tmp/assrt/<runB>/events.json)'. You will see the first place the agent's tool calls differ, which is usually the first place the pages differed. Because each step logs its own duration from the [mcp] <name> (Xms) stderr pattern in browser.ts line 408, you can also diff wall-clock times and spot flakiness where one run hit a 12-second render and the other hit a 200ms one.

What is the OTP-paste trick in the system prompt, and why is it documented?

Because OTP forms that split the code across six maxlength=1 inputs are a classic failure mode for any keystroke-driven automation: typing a character fires a focus change, which eats the next keystroke, which desyncs the form. The Assrt system prompt (agent.ts lines 234-236) hard-codes a ClipboardEvent-based paste in a single call to evaluate() that dispatches a synthetic paste to the parent element of the first maxlength=1 input and fills all six fields in one shot. It is a workaround documented at the place where future readers will be debugging OTP failures: the agent's own instructions. That is the template for how to ship debug knowledge in this product: bake it into the prompt next to the tool that needs it.

Does any of this have a vendor lock-in?

No. The runner is open source and self-hosted. The tests are Markdown in /tmp/assrt/scenario.md or in your repo. The execution backend is Microsoft's Playwright MCP. The on-disk artifacts (webm, events.json, execution.log) are all open formats. You can copy /tmp/assrt elsewhere and lose nothing. Compared with closed competitors in the $7.5K/month range where the tests live in a vendor dashboard and depart with the subscription, Assrt's trade is the opposite: you run it locally, you own the artifacts, and if you migrate, the #Case files go with you. That is what 'debuggable' means at the business level, too: you can still reason about what the tests did after you stop paying for the tool.