AI E2E testing, tool-surface edition

AI E2E testing: the tool surface, not the script.

Every article ranked for this topic is a list of products with one paragraph each. None of them tell you what an AI actually needs to drive a browser. It is 0 tool definitions, an uncapped step loop, and a small background side channel that proposes new cases for every page the agent touches. This is the view from inside the runner.

Matthew Diakonov, Written with AI

Published April 22, 202611 min read

Install npx assrt-mcp

4.8from Assrt MCP users

18 tools defined in agent.ts lines 16-196

MAX_STEPS_PER_SCENARIO = Infinity at line 7

3 concurrent page-discovery calls during every run

AI E2E testing, from the runner's point of view

What happens between Navigate and complete_scenario.

18 tools: navigate, snapshot, click, assert, ...

Every click binds to ref=e5 from a live snapshot

wait_for_stable: MutationObserver, 2s quiet window

MAX_STEPS_PER_SCENARIO = Infinity

3 parallel discoveries per scenario, 20 max

0:00 / 0:05

18 tools, not 180Playwright MCP under the hoodrefs, not selectorsMutationObserver waitsuncapped step loopparallel page discoverydisposable inbox + OTP pasteMarkdown #Case planMIT-licensedself-hosted

Two architectures sharing one label

The phrase covers two very different systems. In the first, the model writes a .spec.ts file against Playwright or Cypress; a normal runner executes that file. The AI is a code generator; the runtime is unchanged. In the second, the model IS the runner. It receives the page as an accessibility tree, decides on each turn what tool to call, and the browser executes it through a fixed schema. The plan is plain Markdown. No .spec.ts file is ever emitted.

The distinction matters for one practical reason. In the first architecture, selector drift surfaces at the next regression run: the generated file looked fine on commit, broke when the UI shipped a redesign. In the second, selector drift is absorbed at click time: the agent calls snapshot, binds a fresh ref, then clicks. Same plan on Tuesday and on Friday, even if the team renamed every class in between.

ai-code-generation.spec.ts

ai-drives-the-browser.md

The 18-tool surface, in one file

Here is the exact list. It lives in a single TOOLS array at assrt-mcp/src/core/agent.ts lines 16 to 196. Every name is a key the MCP server recognizes; any other name the model emits is rejected before it reaches the browser. Call it the blast radius of an AI test agent: the union of these 18 behaviors is the whole thing it can do.

assrt-mcp/src/core/agent.ts

navigate, snapshot, click, type_text, select_option, scroll, press_key

The 7 browser-motion primitives. Every click and type names an accessibility ref like e5 from the preceding snapshot, so there is no CSS selector to hallucinate.

wait, wait_for_stable

wait is a short fixed sleep or wait-for-text. wait_for_stable injects a MutationObserver and returns when the DOM is quiet for 2s. No magic-number timeouts.

create_temp_email, wait_for_verification_code, check_email_inbox

The OTP round-trip in-process. Disposable inbox, 120s poll for the code, split-digit paste via ClipboardEvent DataTransfer hard-coded in the system prompt.

assert, complete_scenario

Every assertion records a description, a pass/fail boolean, and a free-form evidence string. complete_scenario ends the scenario with a summary you can read in results/latest.json.

screenshot, evaluate

Explicit screenshot when the auto-capture-after-visual-action is not enough. evaluate runs a JS expression in the page, used for the OTP paste and for DOM assertions that refs cannot express.

http_request, suggest_improvement

http_request verifies external integrations (Telegram, Slack, a webhook) in the same test flow. suggest_improvement logs UX bugs the agent noticed that were not part of the plan.

The anchor: three constants, infinity, and the discovery caps

Most of the architectural decisions collapse into six lines at the top and middle of agent.ts. If you want to know whether any of this is real, open the file and scroll to these lines.

assrt-mcp/src/core/agent.ts

Infinity is the interesting one. Most agentic systems pick a step budget (25, 50, 100) because an unbounded loop is scary. In an E2E test with a human-authored plan, a cap is a subtle failure mode: a long scenario that legitimately needs 60 turns hits the cap, reports FAILED with an "out of steps" summary, and you have to guess whether the bug was in the test or in the cap. By leaving it uncapped, Assrt lets the model decide: it calls complete_scenario when the scenario is done. If you need a hard timeout, set one at the CLI or CI layer where it belongs.

The runtime loop

Six stages from the moment you run assrt run to the moment the artifacts land on disk. Every step below maps to a concrete location in assrt-mcp/src/core/agent.ts.

preflightUrl with an 8-second timeout

HEAD the URL before spawning Chrome. A wedged dev server fails fast with an actionable error instead of a 3-minute browser.navigate() hang.

launch Playwright MCP over stdio

Spawn a local Playwright MCP process. Cookies and logins persist under ~/.assrt/browser-profile unless --isolated is passed.

navigate, capture first snapshot + screenshot

Every scenario's first message to the model includes the initial accessibility tree plus a JPEG screenshot. The agent always sees the page before deciding what to call.

tool-call loop until complete_scenario

The model returns one or more tool calls per turn. Each one runs through Playwright MCP, the result plus a fresh screenshot go back. Uncapped: the loop ends when the model calls complete_scenario.

parallel page discovery on every navigate

queueDiscoverPage is called on every navigate. Up to 3 background LLM calls generate test cases for pages the agent saw but the plan did not explicitly test.

write artifacts and exit

The scenario saves a plan file, a per-assertion JSON report, zero-padded PNGs, and a WebM recording. No dashboard required.

One test run, ten turns, as the agent saw it

A real 10-turn trace from a guest-checkout scenario. Note the interleaved background discovery events: while the agent is on turn 2 clicking Add to cart, a second LLM call is already generating case ideas for /products/sku-42. The completion lands on turn 4 as discovered_cases_complete, while the primary scenario is still in flight.

assrt run --json

Inside the agent: inputs, tool hub, outputs

Three inputs feed the agent on every turn: the running scenario text, the latest accessibility snapshot, and the most recent screenshot. Those get turned into tool calls through the fixed 18-tool schema. The outputs are the test artifacts on disk and a live event stream that the CLI prints.

Inputs -> 18-tool hub -> artifacts

The side channel no other tool has: parallel page discovery

This is the piece everyone else leaves out. Every time the agent calls navigate inside your scenario, Assrt also queues the new URL for discovery. A separate short-prompt LLM call examines the page and returns 1-2 candidate #Case blocks for it, running in the background while your plan continues to execute. Up to 3 discoveries run concurrently and the total caps at 20 pages per run.

assrt-mcp/src/core/agent.ts

The practical effect: one 30-second test run can surface 8 to 10 candidate cases for adjacent flows you did not plan to test. The cases arrive on the emit stream as discovered_cases_complete events and can be appended to discovered.md for human review. A second pair of eyes on the app, automatically, and it shares the session the plan already opened.

The agent to the browser, turn by turn

The sequence below shows five turns of a single scenario. The agent asks for a snapshot, picks a ref, acts, gets a new snapshot back, and repeats. The browser is the source of truth; the model never acts on a stale view of the page.

One scenario, five turns

wait_for_stable, the one tool that kills a whole category of flake

Most Playwright tests fail for one reason: a hardcoded wait that was fine on the author's machine, flaky in CI. 1000 ms, 2000 ms, the dreaded page.waitForTimeout. The AI version of this problem is worse, because the model loves to pick a round number for reasons it cannot defend.

wait_for_stable replaces that. It injects a MutationObserver, increments __assrt_mutations on every DOM change, and polls every 500 ms. The moment N consecutive seconds pass without a new mutation, it returns. A fast login finishes in 2 seconds. A slow chat stream finishes when the stream ends. No number to tune.

assrt-mcp/src/core/agent.ts

The plan that drives all of this

The input to the whole runtime is a small Markdown file. No imports, no fixtures, no page-object class. Every #Case is a named scenario. The parser at agent.ts line 620 splits on the #Case N: header and hands each block to the runScenario() method. Cookies persist between cases in the same run.

checkout.md

What the runtime numbers actually are

Four constants that govern the architecture. All are readable in assrt-mcp/src/core/agent.ts.

0tools in the agent schema

0concurrent page discoveries

0max pages discovered per run

0seconds of stability to return from wait_for_stable

Architecture checklist: is a given tool actually AI-driven?

Use this when evaluating any product in the category. Anything that cannot check all 9 is operating a layer above the browser: an AI code generator, not an AI driver.

Nine architectural tells

The AI emits tool calls at runtime, not a file that runs later
Every click carries an accessibility ref from a live snapshot
The tool surface is finite and named; new methods cannot be invented
Waits are adaptive (MutationObserver), not fixed milliseconds
Preflight the URL before launching the browser
Uncapped step loop; the model, not a constant, decides when done
Every assertion records free-form evidence a human can read
The plan is a plain text format (Markdown) you can version-control
Page discovery runs in parallel with the explicit scenario

Hosted platforms versus a tool-surface agent

Most of the listed products in this category are hosted SaaS with proprietary DSLs and team-tier pricing in the thousands per month. The architectural comparison is not about feature count; it is about where the runtime executes, what the plan format is, and whether the selector layer still exists.

Hosted AI QA platform vs Assrt tool-surface agent

Same category, different runtime architecture.

Feature	Hosted AI QA SaaS	Assrt (tool-surface agent)
Where selector brittleness surfaces	At next regression run, when an AI-emitted selector drifts	Never; refs are rebound every snapshot, the plan has no selectors
Max steps per scenario	Hard cap (often 25 to 50), silent FAIL on overflow	Infinity (agent.ts line 7), model decides when done
Plan format	Proprietary DSL or YAML, rendered by a vendor UI	Plain Markdown #Case blocks, commit the file to your repo
Runtime	Hosted SaaS, run executes on the vendor's cloud	Local Chromium via Playwright MCP, or your existing Chrome via --extension
Page-discovery side channel	Not exposed; the run only does what the plan says	3 concurrent LLM calls proposing cases for every new page
License and price	Closed source, $1K-$7.5K / month at team tier	MIT-licensed, self-hosted, free

Want to see the 18-tool loop run your own app?

Bring a URL and a one-paragraph scenario. We will watch the trace together and pick out which turns would have flaked on a selector-based runner.

Frequently asked questions

What does AI E2E testing actually mean? Is it code generation, runtime driving, or both?

The label covers two different architectures that get marketed the same way. The first architecture treats AI as a code generator: the model writes a .spec.ts file against a framework like Playwright or Cypress, then a normal runner executes the file. Maintenance shifts from hand-writing selectors to reviewing AI-written selectors; nothing about the runtime changes. The second architecture treats AI as the runner: the model is given a tool surface (navigate, click, type, assert), receives an accessibility-tree snapshot of the page, and decides on each turn what to call next. No .spec.ts file is ever written. Assrt is the second kind. The distinction matters because only the second kind adapts to UI drift at the moment of the click, rather than at the next regression run.

Exactly how many tools does the AI need to drive an E2E test? What is in the list?

Assrt defines 18 tools in the TOOLS array at assrt-mcp/src/core/agent.ts lines 16 to 196: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, and wait_for_stable. That is the whole API the model sees. It cannot invent a nineteenth, because every call goes through the MCP tool schema and the server rejects names it does not know. The practical consequence for reviewers: 'can the AI call a hallucinated Playwright method?' is not a question you have to ask. The schema mechanically prevents it.

What is MAX_STEPS_PER_SCENARIO = Infinity and why does that matter?

At agent.ts line 7, Assrt sets MAX_STEPS_PER_SCENARIO = Infinity, and at line 8 sets MAX_CONVERSATION_TURNS = Infinity. Most tools with an AI in the loop pick a hard cap: 25 steps, 50 turns, whatever feels safe. A cap is a quiet failure mode; a long or recovery-heavy scenario hits it, the test reports FAILED with an 'out of steps' summary, and the human has to guess whether the test was right or the cap was too tight. By leaving it uncapped, Assrt lets the agent exhaust its own budget naturally: it either calls complete_scenario, runs out of API retries, or the run timeout at the browser layer kicks in. The model decides when the scenario is done, not a hard-coded constant. If you want a cap, you set it at the CLI or CI layer.

What is the page-discovery side channel and how does it run in parallel with my test?

Every time the agent calls navigate() inside a scenario, Assrt also calls queueDiscoverPage() for that URL (agent.ts line 775). The queue is drained by flushDiscovery() (lines 564 to 583) which runs up to MAX_CONCURRENT_DISCOVERIES = 3 simultaneous LLM calls to generate 1-2 test cases for each newly-seen page, using a separate short prompt (DISCOVERY_SYSTEM_PROMPT, line 256). A cap at MAX_DISCOVERED_PAGES = 20 stops it from consuming the run. So while your explicit plan is running on page A, three background calls are examining pages B, C, and D and proposing #Case blocks for them. The results arrive on the emit stream as page_discovered and discovered_cases_complete events. This is how one 30-second test run can surface 8-10 candidate test cases for adjacent flows without you writing them.

Why accessibility refs instead of CSS selectors?

Every snapshot call returns the page as an accessibility tree where each focusable element has an opaque id like e5 or e12. When the agent clicks, it passes ref='e5' and Playwright MCP resolves that id on the live page. Three consequences. First, there is no selector string for the model to hallucinate; if an element is not in the snapshot, the agent cannot click it. Second, UI redesigns that change class names do not break the test, because the ref is rebound on each snapshot call. Third, assertion evidence is human-readable: a trace reads 'click e5 (Submit order button)' rather than 'click div:nth-child(3) > button.v2'. The SYSTEM_PROMPT at line 198 enforces 'ALWAYS call snapshot FIRST' because stale refs are the only way for this scheme to fail.

How does wait_for_stable replace hardcoded timeouts?

Traditional Playwright tests pick a number: wait 2 seconds after clicking submit, wait 500ms for an animation, wait 10 seconds for the API. Too short and you get flake; too long and your suite takes an hour. wait_for_stable (agent.ts lines 956 to 1009) injects a MutationObserver into the page, counts DOM mutations in __assrt_mutations, and returns as soon as N consecutive seconds pass without a new mutation (default N is 2, max wait is 30). It adapts to the actual page: a fast login returns in 2 seconds, a slow chat response returns when the streaming stops. No magic number. The cleanup after the wait disconnects the observer and deletes the window globals, so your app sees no lingering overlay.

Does the agent actually watch a video or a screenshot? How much does each turn cost?

Every turn sends the Anthropic API a list of messages that includes the latest accessibility snapshot (text, up to 3000 chars) plus a JPEG screenshot of the current viewport. After every tool call except snapshot, wait, assert, and a few other non-visual ones (see the exclusion list at line 1024), the agent automatically captures a fresh screenshot so the next turn has up-to-date vision. The system prompt tells the model to prefer refs from the snapshot over visual matching, but the screenshot is there as a correctness check (the tree can be ambiguous, the image is not). Sliding window logic at lines 1064 to 1080 keeps the conversation from growing unboundedly by trimming at assistant/model boundaries, so a long scenario does not blow the context.

What happens when the page under test hangs or the server dies mid-run?

The preflightUrl() method at agent.ts line 518 does a HEAD/GET check with an 8-second timeout before Assrt even launches Chrome. A wedged dev server fails fast with an actionable 'Target URL did not respond within 8000ms' error, instead of burning time on a Chrome launch and then surfacing an opaque 'MCP client not connected' three minutes later. Once the run is going, the navigate wrapper at line 443 applies a 30-second timeout per navigation for the same reason. These are not nice-to-haves; without them, a single misbehaving server turns a 30-second test into a 3-minute hang plus a confusing error message.

Can I test flows that require email verification, like signup with an OTP code?

Yes, in the same plan, without a Mailosaur account. create_temp_email spins up a disposable inbox via the DisposableEmail class at core/email.ts (a call to an ephemeral-email service). wait_for_verification_code polls the inbox for up to 120 seconds and parses the OTP out. If the OTP input is the common split-across-six-fields pattern, the system prompt at line 234 instructs the agent to use evaluate() with a specific ClipboardEvent + DataTransfer expression that pastes all digits at once; typing into each field separately breaks on most React OTP components. That exact expression is hard-coded in the system prompt so the model does not have to rediscover the trick each run.

Is this the same as self-healing tests? Is it related to Momentic or QA Wolf?

Self-healing is a patch on top of a selector-based stack: the framework tries alternate selectors when the primary one breaks. Assrt removes the selector layer entirely, which is a different architecture. Momentic and QA Wolf are hosted SaaS products with their own runtimes, closed test formats, and per-seat pricing that tends to sit in the thousands per month. Assrt is npm-installed, MIT-licensed, runs on your machine or your CI, and the tests are plain Markdown files you commit to your repo. Same category on a shelf, very different cost and portability profile. The code is at github.com/assrt-ai/assrt-mcp and you can read every file yourself.

AI E2E testing: the tool surface, not the script.

Two architectures sharing one label

The 18-tool surface, in one file

navigate, snapshot, click, type_text, select_option, scroll, press_key

wait, wait_for_stable

create_temp_email, wait_for_verification_code, check_email_inbox

assert, complete_scenario

screenshot, evaluate

http_request, suggest_improvement

The anchor: three constants, infinity, and the discovery caps

The runtime loop

preflightUrl with an 8-second timeout

launch Playwright MCP over stdio

navigate, capture first snapshot + screenshot

tool-call loop until complete_scenario

parallel page discovery on every navigate

write artifacts and exit

One test run, ten turns, as the agent saw it

Inside the agent: inputs, tool hub, outputs

Inputs -> 18-tool hub -> artifacts

The side channel no other tool has: parallel page discovery

The agent to the browser, turn by turn

wait_for_stable, the one tool that kills a whole category of flake

The plan that drives all of this

What the runtime numbers actually are

Architecture checklist: is a given tool actually AI-driven?

Hosted platforms versus a tool-surface agent

Hosted AI QA platform vs Assrt tool-surface agent

Want to see the 18-tool loop run your own app?

Frequently asked questions

Comments (••)

Comments ()