The AI agent that writes the code should be the one that verifies it.
Most "AI e2e testing" pages describe a pipeline: your agent writes code, some other system ingests a URL, a dashboard shows green. That is not verification, that is reporting after the fact. Assrt collapses the loop. Three MCP tools, one singleton browser, and a { description, passed, evidence } assertion primitive that fails a scenario the moment any single assert returns false. Everything lands on disk at /tmp/assrt/ so the next turn of the same conversation can re-read its own output.
What changes when the test tool lives inside the same conversation
Standard "AI e2e" products are shaped like a CI pipeline with a friendly face. A coding agent opens a PR, a runner ingests the PR URL, a dashboard reports green or red, a human reads the dashboard and pings the agent in chat to fix the thing. Four hops, two context switches, and the error message arrives as a blob of screenshot + video the coding agent has to interpret second-hand.
Assrt removes the hops. The coding agent itself calls assrt_test, reads the structured assertion payload in its tool response, and can fix the bug or edit the scenario before the human is even back from lunch. The hub here is assrt-mcp; the sources are any MCP-speaking client; the destinations are three tools that all share one browser and one on-disk artifact tree.
One MCP server, three tools, one browser
The assert primitive: one boolean flips the whole scenario
This is the file the rest of the page is about. The browser-driving agent calls a tool named assert every time it wants to claim something about the page. Three fields, not four, not a score, not a probability. A single passed: false sets the scenario to failed and keeps it failed. This is intentional: LLMs are bad at graded scoring, and "the page looks roughly fine" is the failure mode that AI e2e tools most need to avoid.
“if (!passed) scenarioPassed = false;”
/Users/matthewdi/assrt-mcp/src/core/agent.ts:886
The tools the browser agent has at its disposal
assert is the keystone, but it is not alone. The agent inside assrt_test can also call any of the following. The full tool list is in src/core/agent.ts around lines 100-200.
assert
The single verification primitive. description, passed, evidence. On line 886 a single false flips scenarioPassed for the whole case.
complete_scenario
Called by the agent when it has made all the assertions it needed. Emits the scenario summary and ends the agent loop so the tool can respond.
suggest_improvement
Orthogonal to pass/fail. The agent can flag confusing UX, missing aria-labels, or broken links without failing the scenario. Landed as improvements[] in the report.
wait_for_stable
A MutationObserver polled every 500ms, default 2s quiet window, capped at 10s. The opposite of waitForTimeout. Described in agent.ts:941-994.
snapshot
Forwarded to @playwright/mcp. Returns the accessibility tree with ref IDs (e5, e12...) that the agent uses instead of CSS selectors, because refs survive re-renders.
create_temp_email + wait_for_verification_code
For signup flows. The agent requests a disposable inbox and polls it for the OTP. No brittle hard-coded test accounts.
What a single round of the loop actually looks like
No slides, no sequence diagrams that hide what happened. This is the same conversation: one agent, one editor, one tool call, one artifact trail. When a scenario fails the agent reads the evidence string and either patches the code, edits the scenario, or calls assrt_diagnose for structured root-cause analysis.
Six steps, one conversation
The coding agent edits the file
Claude Code writes the AddToCart button, saves, and the dev server hot-reloads on http://localhost:3000. No human intervention yet. This is the only step where code is produced.
Same agent calls assrt_test in the same turn
Not a separate CI job. The agent issues a tool call with { url: 'http://localhost:3000', plan: '#Case 1: signed-out user adds item to cart…' }. The MCP server at src/mcp/server.ts:335 receives it, writes /tmp/assrt/scenario.md so the plan is auditable, then spawns or reuses the singleton Playwright browser.
Browser agent executes and emits typed events
Inside the assrt_test handler a TestAgent runs the SYSTEM_PROMPT at agent.ts:198. It calls snapshot, click, type_text, and assert. Each assert pushes onto assertions[] and emits an 'assertion' event with { description, passed, evidence } visible in the test report.
Results return structured, not as a 200 OK blob
When the agent calls complete_scenario, the run finalizes. The calling agent (Claude Code) sees { passed, failedCount, assertions[], improvements[], screenshots[] } inline in its tool response, plus all of it written to /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json for later turns.
On failure, assrt_diagnose runs, not guesswork
The coding agent reads the false assertion, sees 'price prop missing from AddButton' in evidence, and either fixes directly or calls assrt_diagnose for a structured Root Cause / Analysis / Recommended Fix from Claude Haiku. Defined at server.ts:867 with the DIAGNOSE_SYSTEM_PROMPT on line 240.
Next turn re-reads the artifacts
Nothing is lost between turns. The agent can open /tmp/assrt/results/latest.json at the start of the next message, confirm the fix worked by re-running the same scenarioId (it is a UUID, cached at ~/.assrt/scenarios/<uuid>.json), and move on. No re-upload, no cloud round-trip.
The scenario file is a plain markdown block, not YAML
The test plan lives at /tmp/assrt/scenario.md, editable mid-run. The parser at agent.ts:597 accepts any heading shaped like #Case N:, Scenario N:, or Test N:. Each block runs independently in the shared browser, with cookies and localStorage carrying over between cases so signup + login + use flows naturally compose.
The JSON the next turn of the conversation will read
When the run finishes, two files are written: a run-specific record at /tmp/assrt/results/<runId>.json, and a convenient alias at /tmp/assrt/results/latest.json. Structured, not scraped. Every assertion the agent made is there with its evidence string, so the next turn of the same conversation can reason about what failed without calling a dashboard.
Concrete constants the model is working against
Short list, all from the source. No invented benchmarks.
server.ts lines 335, 764, 867wait_for_stableHow this differs from cloud AI e2e products
The short version: Assrt treats the coding agent as a peer, not a client. Cloud products treat the coding agent as something that uploads work and waits for a verdict. If your reason for using an agent in the first place was to reduce the handoffs, adding a cloud handoff for verification defeats the point.
| Feature | Cloud AI e2e products | Assrt |
|---|---|---|
| Where the test runs | Their cloud, URL-reachable only | Your laptop, localhost dev server |
| Tool the coding agent calls | HTTP dashboard or GitHub Action | MCP (assrt_test, assrt_plan, assrt_diagnose) |
| Output format | YAML files or web UI run logs | Structured assertions + /tmp/assrt/results/latest.json |
| Verification primitive | Vendor DSL or 'AI said it passed' | assert({ description, passed, evidence }) |
| Video recording | Cloud player behind login | Local .webm + Range-served player.html |
| Scenarios on failure | Re-record flow via a browser extension | Editable /tmp/assrt/scenario.md, fs.watch synced |
| Cost | $1.5K to $7.5K / month | $0 runtime + your own API key |
| License | Proprietary, data leaves your machine | MIT |
Install assrt-mcp in two minutes
npx assrt-mcp registers three tools with any MCP client: assrt_plan, assrt_test, assrt_diagnose. The first assrt_test call spawns the browser, records a video to /tmp/assrt, and returns structured assertions. No YAML, no cloud dashboard, no vendor DSL.
See the install docs →Frequently asked questions
What does 'AI code E2E test verification' actually mean when the same model wrote the code?
It means the agent that just wrote the React component, API route, or Playwright fixture is also the one calling the E2E tool, reading structured assertion results, and deciding whether to keep or revise. Not a separate CI job, not a human handing the diff to a QA bot. In Assrt this is three MCP tools (assrt_plan, assrt_test, assrt_diagnose) wired into the same conversation as the editor. Source: /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 335, 764, and 867.
How does Assrt decide a scenario has actually passed?
The agent running the browser calls a tool named assert with three fields: description, passed (boolean), evidence (string). On /Users/matthewdi/assrt-mcp/src/core/agent.ts:886 a single passed: false sets scenarioPassed = false for the whole scenario. No fuzzy scoring, no 'mostly green.' One false flips it. That is the entire verification primitive.
Why not a YAML DSL like other AI testing products?
YAML hides the real action. When the test fails at 2am in CI you are grepping a proprietary file format that only the vendor's runner can execute. Assrt runs the official @playwright/mcp server as a stdio child process, so every click, type, and setOffline is a real Playwright action under the hood. What you save to /tmp/assrt/scenario.md is a readable #Case block; what you read back in /tmp/assrt/results/latest.json is structured JSON. If you ever leave Assrt the scenarios are portable and the evidence is plain text.
Where are the verification artifacts on disk so the next turn of the conversation can read them?
Three deterministic paths. The test plan lives at /tmp/assrt/scenario.md and is editable mid-run via fs.watch. Structured results land at /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json. Per-run screenshots go to /tmp/assrt/<runId>/screenshots/. Video recording goes to /tmp/assrt/<runId>/video/*.webm and is served by a single persistent localhost HTTP server with Range request support so the agent or player can seek without re-downloading.
Does the coding agent really call the test tool inside the same conversation, or is it a pipeline?
Same conversation. Assrt ships as an MCP server (npx assrt-mcp), and Claude Code, Cursor, and any other MCP client pick up the three tools the moment it is registered. On /Users/matthewdi/assrt-mcp/src/mcp/server.ts:282 the server advertises instructions that tell the agent exactly when to call them: after implementing a feature, before committing, and when a test fails use assrt_diagnose. The model does not need to remember this, it is in the tool metadata.
What keeps the browser state coherent when one conversation fires multiple assrt_test calls?
A singleton McpBrowserManager, declared on /Users/matthewdi/assrt-mcp/src/mcp/server.ts:31 as sharedBrowser. The first assrt_test call launches @playwright/mcp as a stdio child with --viewport-size 1600x900 --output-mode file --output-dir ~/.assrt/playwright-output --caps devtools. Subsequent calls reuse the same browser, cookies, localStorage, and dev server tab. That is why the system prompt at agent.ts:239 tells the agent 'Scenarios run in the SAME browser session. Cookies, auth state carry over.'
How is a failing test handled? Does the agent just guess a fix?
The agent is steered toward assrt_diagnose, a separate tool that sends the failed scenario + error to Claude Haiku with the DIAGNOSE_SYSTEM_PROMPT defined at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:240. The prompt forces structured output: Root Cause, Analysis, Recommended Fix, Corrected Test Scenario in the same #Case format. The diagnosis distinguishes between three failure classes: application bug, flawed test, environment issue. It is not a generic 'try again' loop.
What is pass criteria and why is it MANDATORY in the prompt?
pass criteria is free text you pass to assrt_test describing the conditions that must hold true (for example: 'Cart total shows $42.99' or 'Error toast does NOT appear'). On /Users/matthewdi/assrt-mcp/src/core/agent.ts:655 the runner injects a section titled ## Pass Criteria (MANDATORY) into the agent's user prompt. The phrasing is not decorative. It tells the model to fail the scenario if any one condition is unmet, which is what prevents the 'the page loaded so it must be fine' failure mode common to shallow AI test agents.
Can I run this against localhost and have the agent record a video of what it did?
Yes, and you get the player HTML for free. On /Users/matthewdi/assrt-mcp/src/mcp/server.ts:541 startVideo() is called after the browser is confirmed connected. The run ends with stopVideo() and the .webm is moved into /tmp/assrt/<runId>/video. A player HTML is generated with speed controls from 1x to 10x and keyboard shortcuts (Space, arrow keys, number keys 1/2/3/5). By default it auto-opens in your browser, controlled by autoOpenPlayer which defaults to true.
Is it really free, and what is the open source trade-off?
The code is MIT licensed and the runtime is free. You bring your own Anthropic API key (or an existing Claude Code OAuth token that Assrt reads via keychain) for the agent brain. That is the trade-off: competitor products charge $1.5K to $7.5K a month because they bundle the inference in a SaaS price. Assrt unbundles it, so you pay only the token cost of the test runs you actually execute. Everything else (browser, artifacts, CLI, MCP server) runs on your machine.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.