AI code E2E verification

The AI agent that writes the code should be the one that verifies it.

Most "AI e2e testing" pages describe a pipeline: your agent writes code, some other system ingests a URL, a dashboard shows green. That is not verification, that is reporting after the fact. Assrt collapses the loop. Three MCP tools, one singleton browser, and a { description, passed, evidence } assertion primitive that fails a scenario the moment any single assert returns false. Everything lands on disk at /tmp/assrt/ so the next turn of the same conversation can re-read its own output.

Install the MCP server Jump to the assert primitive ->

Matthew Diakonov, Written with AI

Published April 18, 202611 min read

4.9from real Playwright code, never YAML

Three MCP tools: assrt_plan, assrt_test, assrt_diagnose (src/mcp/server.ts:335, 764, 867)

One false assert flips scenarioPassed (src/core/agent.ts:886)

Artifacts on disk at /tmp/assrt/results/latest.json and /tmp/assrt/<runId>/video/*.webm

MIT licensed, bring your own API key, no cloud dependency

The verification loop

One agent, one conversation, three MCP tools

Agent writes the code

Same agent calls assrt_test in the same turn

Browser runs, assert emits {description, passed, evidence}

One false flips the scenario (agent.ts:886)

Next turn re-reads /tmp/assrt/results/latest.json

0:00 / 0:08

What changes when the test tool lives inside the same conversation

Standard "AI e2e" products are shaped like a CI pipeline with a friendly face. A coding agent opens a PR, a runner ingests the PR URL, a dashboard reports green or red, a human reads the dashboard and pings the agent in chat to fix the thing. Four hops, two context switches, and the error message arrives as a blob of screenshot + video the coding agent has to interpret second-hand.

Assrt removes the hops. The coding agent itself calls assrt_test, reads the structured assertion payload in its tool response, and can fix the bug or edit the scenario before the human is even back from lunch. The hub here is assrt-mcp; the sources are any MCP-speaking client; the destinations are three tools that all share one browser and one on-disk artifact tree.

One MCP server, three tools, one browser

The assert primitive: one boolean flips the whole scenario

This is the file the rest of the page is about. The browser-driving agent calls a tool named assert every time it wants to claim something about the page. Three fields, not four, not a score, not a probability. A single passed: false sets the scenario to failed and keeps it failed. This is intentional: LLMs are bad at graded scoring, and "the page looks roughly fine" is the failure mode that AI e2e tools most need to avoid.

src/core/agent.ts

1 false = fail

“if (!passed) scenarioPassed = false;”

/Users/matthewdi/assrt-mcp/src/core/agent.ts:886

The tools the browser agent has at its disposal

assert is the keystone, but it is not alone. The agent inside assrt_test can also call any of the following. The full tool list is in src/core/agent.ts around lines 100-200.

assert

The single verification primitive. description, passed, evidence. On line 886 a single false flips scenarioPassed for the whole case.

complete_scenario

Called by the agent when it has made all the assertions it needed. Emits the scenario summary and ends the agent loop so the tool can respond.

suggest_improvement

Orthogonal to pass/fail. The agent can flag confusing UX, missing aria-labels, or broken links without failing the scenario. Landed as improvements[] in the report.

wait_for_stable

A MutationObserver polled every 500ms, default 2s quiet window, capped at 10s. The opposite of waitForTimeout. Described in agent.ts:941-994.

snapshot

Forwarded to @playwright/mcp. Returns the accessibility tree with ref IDs (e5, e12...) that the agent uses instead of CSS selectors, because refs survive re-renders.

create_temp_email + wait_for_verification_code

For signup flows. The agent requests a disposable inbox and polls it for the OTP. No brittle hard-coded test accounts.

What a single round of the loop actually looks like

No slides, no sequence diagrams that hide what happened. This is the same conversation: one agent, one editor, one tool call, one artifact trail. When a scenario fails the agent reads the evidence string and either patches the code, edits the scenario, or calls assrt_diagnose for structured root-cause analysis.

claude-code: localhost:3000

Six steps, one conversation

The coding agent edits the file

Claude Code writes the AddToCart button, saves, and the dev server hot-reloads on http://localhost:3000. No human intervention yet. This is the only step where code is produced.

Same agent calls assrt_test in the same turn

Not a separate CI job. The agent issues a tool call with { url: 'http://localhost:3000', plan: '#Case 1: signed-out user adds item to cart…' }. The MCP server at src/mcp/server.ts:335 receives it, writes /tmp/assrt/scenario.md so the plan is auditable, then spawns or reuses the singleton Playwright browser.

Browser agent executes and emits typed events

Inside the assrt_test handler a TestAgent runs the SYSTEM_PROMPT at agent.ts:198. It calls snapshot, click, type_text, and assert. Each assert pushes onto assertions[] and emits an 'assertion' event with { description, passed, evidence } visible in the test report.

Results return structured, not as a 200 OK blob

When the agent calls complete_scenario, the run finalizes. The calling agent (Claude Code) sees { passed, failedCount, assertions[], improvements[], screenshots[] } inline in its tool response, plus all of it written to /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json for later turns.

On failure, assrt_diagnose runs, not guesswork

The coding agent reads the false assertion, sees 'price prop missing from AddButton' in evidence, and either fixes directly or calls assrt_diagnose for a structured Root Cause / Analysis / Recommended Fix from Claude Haiku. Defined at server.ts:867 with the DIAGNOSE_SYSTEM_PROMPT on line 240.

Next turn re-reads the artifacts

Nothing is lost between turns. The agent can open /tmp/assrt/results/latest.json at the start of the next message, confirm the fix worked by re-running the same scenarioId (it is a UUID, cached at ~/.assrt/scenarios/<uuid>.json), and move on. No re-upload, no cloud round-trip.

The scenario file is a plain markdown block, not YAML

The test plan lives at /tmp/assrt/scenario.md, editable mid-run. The parser at agent.ts:597 accepts any heading shaped like #Case N:, Scenario N:, or Test N:. Each block runs independently in the shared browser, with cookies and localStorage carrying over between cases so signup + login + use flows naturally compose.

/tmp/assrt/scenario.md

The JSON the next turn of the conversation will read

When the run finishes, two files are written: a run-specific record at /tmp/assrt/results/<runId>.json, and a convenient alias at /tmp/assrt/results/latest.json. Structured, not scraped. Every assertion the agent made is there with its evidence string, so the next turn of the same conversation can reason about what failed without calling a dashboard.

/tmp/assrt/results/latest.json

Concrete constants the model is working against

Short list, all from the source. No invented benchmarks.

0MCP tools

0sseconds default scenario timeout

0smax MutationObserver quiet window

0pxviewport width for recordings

0MCP tools registered at server.ts lines 335, 764, 867

0mspoll interval for wait_for_stable

How this differs from cloud AI e2e products

The short version: Assrt treats the coding agent as a peer, not a client. Cloud products treat the coding agent as something that uploads work and waits for a verdict. If your reason for using an agent in the first place was to reduce the handoffs, adding a cloud handoff for verification defeats the point.

Feature	Cloud AI e2e products	Assrt
Where the test runs	Their cloud, URL-reachable only	Your laptop, localhost dev server
Tool the coding agent calls	HTTP dashboard or GitHub Action	MCP (assrt_test, assrt_plan, assrt_diagnose)
Output format	YAML files or web UI run logs	Structured assertions + /tmp/assrt/results/latest.json
Verification primitive	Vendor DSL or 'AI said it passed'	assert({ description, passed, evidence })
Video recording	Cloud player behind login	Local .webm + Range-served player.html
Scenarios on failure	Re-record flow via a browser extension	Editable /tmp/assrt/scenario.md, fs.watch synced
Cost	$1.5K to $7.5K / month	$0 runtime + your own API key
License	Proprietary, data leaves your machine	MIT

Install assrt-mcp in two minutes

npx assrt-mcp registers three tools with any MCP client: assrt_plan, assrt_test, assrt_diagnose. The first assrt_test call spawns the browser, records a video to /tmp/assrt, and returns structured assertions. No YAML, no cloud dashboard, no vendor DSL.

See the install docs →

Frequently asked questions

What does 'AI code E2E test verification' actually mean when the same model wrote the code?

It means the agent that just wrote the React component, API route, or Playwright fixture is also the one calling the E2E tool, reading structured assertion results, and deciding whether to keep or revise. Not a separate CI job, not a human handing the diff to a QA bot. In Assrt this is three MCP tools (assrt_plan, assrt_test, assrt_diagnose) wired into the same conversation as the editor. Source: /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 335, 764, and 867.

How does Assrt decide a scenario has actually passed?

The agent running the browser calls a tool named assert with three fields: description, passed (boolean), evidence (string). On /Users/matthewdi/assrt-mcp/src/core/agent.ts:886 a single passed: false sets scenarioPassed = false for the whole scenario. No fuzzy scoring, no 'mostly green.' One false flips it. That is the entire verification primitive.

Why not a YAML DSL like other AI testing products?

YAML hides the real action. When the test fails at 2am in CI you are grepping a proprietary file format that only the vendor's runner can execute. Assrt runs the official @playwright/mcp server as a stdio child process, so every click, type, and setOffline is a real Playwright action under the hood. What you save to /tmp/assrt/scenario.md is a readable #Case block; what you read back in /tmp/assrt/results/latest.json is structured JSON. If you ever leave Assrt the scenarios are portable and the evidence is plain text.

Where are the verification artifacts on disk so the next turn of the conversation can read them?

Three deterministic paths. The test plan lives at /tmp/assrt/scenario.md and is editable mid-run via fs.watch. Structured results land at /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json. Per-run screenshots go to /tmp/assrt/<runId>/screenshots/. Video recording goes to /tmp/assrt/<runId>/video/*.webm and is served by a single persistent localhost HTTP server with Range request support so the agent or player can seek without re-downloading.

Does the coding agent really call the test tool inside the same conversation, or is it a pipeline?

Same conversation. Assrt ships as an MCP server (npx assrt-mcp), and Claude Code, Cursor, and any other MCP client pick up the three tools the moment it is registered. On /Users/matthewdi/assrt-mcp/src/mcp/server.ts:282 the server advertises instructions that tell the agent exactly when to call them: after implementing a feature, before committing, and when a test fails use assrt_diagnose. The model does not need to remember this, it is in the tool metadata.

What keeps the browser state coherent when one conversation fires multiple assrt_test calls?

A singleton McpBrowserManager, declared on /Users/matthewdi/assrt-mcp/src/mcp/server.ts:31 as sharedBrowser. The first assrt_test call launches @playwright/mcp as a stdio child with --viewport-size 1600x900 --output-mode file --output-dir ~/.assrt/playwright-output --caps devtools. Subsequent calls reuse the same browser, cookies, localStorage, and dev server tab. That is why the system prompt at agent.ts:239 tells the agent 'Scenarios run in the SAME browser session. Cookies, auth state carry over.'

How is a failing test handled? Does the agent just guess a fix?

The agent is steered toward assrt_diagnose, a separate tool that sends the failed scenario + error to Claude Haiku with the DIAGNOSE_SYSTEM_PROMPT defined at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:240. The prompt forces structured output: Root Cause, Analysis, Recommended Fix, Corrected Test Scenario in the same #Case format. The diagnosis distinguishes between three failure classes: application bug, flawed test, environment issue. It is not a generic 'try again' loop.

What is pass criteria and why is it MANDATORY in the prompt?

pass criteria is free text you pass to assrt_test describing the conditions that must hold true (for example: 'Cart total shows $42.99' or 'Error toast does NOT appear'). On /Users/matthewdi/assrt-mcp/src/core/agent.ts:655 the runner injects a section titled ## Pass Criteria (MANDATORY) into the agent's user prompt. The phrasing is not decorative. It tells the model to fail the scenario if any one condition is unmet, which is what prevents the 'the page loaded so it must be fine' failure mode common to shallow AI test agents.

Can I run this against localhost and have the agent record a video of what it did?

Yes, and you get the player HTML for free. On /Users/matthewdi/assrt-mcp/src/mcp/server.ts:541 startVideo() is called after the browser is confirmed connected. The run ends with stopVideo() and the .webm is moved into /tmp/assrt/<runId>/video. A player HTML is generated with speed controls from 1x to 10x and keyboard shortcuts (Space, arrow keys, number keys 1/2/3/5). By default it auto-opens in your browser, controlled by autoOpenPlayer which defaults to true.

Is it really free, and what is the open source trade-off?

The code is MIT licensed and the runtime is free. You bring your own Anthropic API key (or an existing Claude Code OAuth token that Assrt reads via keychain) for the agent brain. That is the trade-off: competitor products charge $1.5K to $7.5K a month because they bundle the inference in a SaaS price. Assrt unbundles it, so you pay only the token cost of the test runs you actually execute. Everything else (browser, artifacts, CLI, MCP server) runs on your machine.

What changes when the test tool lives inside the same conversation

One MCP server, three tools, one browser

The assert primitive: one boolean flips the whole scenario

The tools the browser agent has at its disposal

assert

complete_scenario

suggest_improvement

wait_for_stable

snapshot

create_temp_email + wait_for_verification_code

What a single round of the loop actually looks like

Six steps, one conversation

The coding agent edits the file

Same agent calls assrt_test in the same turn

Browser agent executes and emits typed events

Results return structured, not as a 200 OK blob

On failure, assrt_diagnose runs, not guesswork

Next turn re-reads the artifacts

The scenario file is a plain markdown block, not YAML

The JSON the next turn of the conversation will read

Concrete constants the model is working against

How this differs from cloud AI e2e products

Frequently asked questions

Comments (••)

Comments ()