AI for testing, read-the-source edition

AI for testing: the four failure modes every other guide pretends do not exist.

Every article about this topic is a listicle of 0 tools with pricing, feature bullets, and a screenshot. None of them show you what the agent actually does when the dev server is wedged, the OTP field is six single-character inputs, the page is still streaming, or the DOM reference you just saw is gone. This guide reads the Assrt agent source aloud. Eighteen tools, four source-level moves, zero vendor lock.

M
Matthew Diakonov
11 min read
4.8from agent.ts lines cited
agent.ts line 9: default driver is Claude Haiku 4.5
agent.ts line 235: DataTransfer paste expression, quoted verbatim
agent.ts lines 518-543: 8-second HEAD preflight probe
agent.ts lines 956-1009: MutationObserver stability gate
wedged dev serversplit-digit OTP inputsstreaming DOMstale element refsbroken CSS selectorssilent stdio hangsintegrations with no UI proofbackend side-effectslate-rendered modalslazy-loaded lists

What "AI for testing" actually means at the tool-call level

The phrase covers two different things most guides never distinguish. The first is AI that writes test code for you: a model outputs a .spec.ts file which a conventional Playwright runner then executes. The second is AI that is itself the runner: the model receives a live accessibility tree of the current page, picks one of a small fixed set of browser tools, executes it, reads the next tree, picks the next call, and loops. No script file is ever produced.

Assrt is squarely the second. The model in question, by default, is Claude Haiku 4.5 (DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001" at agent.ts line 9). The tool surface is exactly eighteen calls. The rest of this guide is the four ugly realities that break any naive implementation of that loop, and the four specific pieces of source that defeat each one.

The run loop, end to end

A single scenario looks like this: a user gives the agent a URL and a plan in plain markdown. The agent probes the URL, launches Playwright MCP, navigates, takes an initial snapshot, and enters the decide-act-reread loop. Each turn the model receives a page tree and emits one or more tool calls. Tool results feed back into the next turn. A scenario ends when the model calls complete_scenario.

One test scenario, one call at a time

PlanAgentPlaywright MCPBrowserTarget app#Case 1: sign uppreflight HEAD200 OK (142ms)launch local browsernavigate(url)open Chromesnapshot with refsclick(ref=e6)POST /send-code200 + email sentevaluate(DataTransfer paste)wait_for_stablepage stable 3.1sassert passed, complete

Failure mode 1. The dev server that silently hangs

This one eats first-run debugging sessions. A local dev server in a bad state accepts the TCP handshake but never returns an HTTP response. Chrome opens. Playwright MCP calls page.goto(). The navigate tool blocks. The stdio pipe between the agent and Playwright MCP eventually drops. The only thing that surfaces to the engineer is a cryptic MCP client not connected error, 180 seconds later.

The fix is twenty lines of source. Before the browser is touched, a HEAD request is fired at the URL with an 8-second AbortController. A 404 or a 500 is fine: any HTTP response proves the server is alive. Only a timeout or a connection refused fails preflight, and it fails with a message the engineer can act on immediately.

assrt-mcp/src/core/agent.ts

Failure mode 2. The split-digit OTP input

Modern auth flows split the six-digit verification code across six separate input[maxlength="1"] elements. Each one has its own onChange handler that auto-advances focus to the next box. Naive automation types "4" into the first box, the handler moves focus, but the second keystroke arrives before the focus event has settled. Keystrokes get dropped, focus jumps unpredictably, the form reads half the code.

The Assrt system prompt solves this by telling the model exactly what JavaScript expression to pass to evaluate(). The expression constructs a synthetic DataTransfer, sets the code as plain text on it, and dispatches a ClipboardEvent("paste") at the parent element. One atomic paste. Six boxes fill at once. The instruction is explicit: do not modify the expression except to replace CODE_HERE.

agent.ts SYSTEM_PROMPT (the exact string the model reads)
1 call

The model is told: 'Do NOT modify this expression except to replace CODE_HERE with the actual code.'

assrt-mcp/src/core/agent.ts line 236

Failure mode 3. Asserting against a page that is still rendering

A chat UI streams LLM tokens for two to four seconds. A lazy list paints the first three rows immediately, fetches the rest, then paints them 600 ms later. A framework that does client-side routing paints a skeleton first and swaps in content when the data resolves. Any assertion that fires before the page is actually stable either flakes or asserts against a half-rendered DOM.

The Assrt primitive for this is wait_for_stable. It injects a MutationObserver onto document.body, polls every 500 ms, and returns as soon as the mutation counter has been quiet for a configurable window. Default quiet window is 2 seconds, default overall timeout is 30 seconds. The tool cleans up its observer after returning, so it leaves no residue on the page.

assrt-mcp/src/core/agent.ts

Failure mode 4. The DOM reference that went stale

Every page-object-model test suite eventually dies from this. A designer renames a class. A framework upgrade changes the rendered HTML. An A/B test wraps the button in a new div. The hand-written selector div.auth-v2 > button.primary-btn.sign-in-btn passed yesterday. Today it throws a TimeoutError.

The agent does not write selectors. Before every interaction it calls snapshot(). Playwright MCP returns the accessibility tree with a freshly minted reference ID on every element. Click and type take the ref directly. When an action fails, the agent's error path re-reads the tree and hands the model new refs. There is nothing to rot between runs because nothing is cached between runs.

What the snapshot tool returns to the model

Hand-written selector vs. fresh ref per turn

await page.click('div.auth-v2 > button.primary-btn.sign-in-btn'); // Works today. // Breaks the moment a class is renamed, a wrapper div is added, // a framework upgrade changes the render tree, or an A/B test fires. // Fix: open the file. Edit the selector. Re-run. Repeat every sprint.

  • Selector lives in your test code
  • Any markup change breaks it
  • Fix requires engineering time
  • Suite rots by default

The four failure modes, side by side with the source that fixes them

1

Failure mode 1: the dev server that silently hangs

A wedged local server accepts the TCP connection but never responds to HTTP. A naive agent navigates, Playwright's stdio pipe blocks for minutes, and the run eventually surfaces an opaque 'MCP client not connected'. Assrt fixes this with an 8-second HEAD probe before Chrome ever launches. An unreachable URL becomes a 1-line actionable error instead of a 3-minute mystery.

agent.preflight.fail url=http://localhost:3000 aborted=true durationMs=8014
Error: Target URL http://localhost:3000 did not respond within 8000ms.
2

Failure mode 2: the six-box OTP field that eats keystrokes

Modern signup forms split the six-digit code across six input[maxlength=1] elements. Type into box one and a framework event handler auto-advances focus; type into box two and it drops the keystroke; by box six your agent has typed '473918' but the form reads '39181_'. The fix is not 'type slower' — it is a single synthetic clipboard paste against the parent element.

The model passes the DataTransfer + ClipboardEvent expression to evaluate() verbatim. One call. Six fields fill atomically.
3

Failure mode 3: the page that is still rendering when you assert

AI chat UIs stream tokens for 2 to 4 seconds. Lazy lists render above-the-fold items first, then fetch and paint the tail. A naive 'wait 500ms then assert' either flakes or passes against half-rendered DOM. The MutationObserver stability primitive watches document.body and only returns after the page has actually stopped changing.

Page stabilized after 3.1s (47 total mutations)
4

Failure mode 4: the CSS selector that broke between runs

div.auth-v2 > button.primary-btn.sign-in-btn passes today, fails tomorrow when a designer renames a class. The agent never writes selectors. It calls snapshot(), reads the accessibility tree, and clicks by the ref the tree hands it right now. There is nothing to rot between runs.

- button "Sign in" [ref=e6] → click({ ref: "e6" })

A real run, annotated

Here is what a single scenario looks like end to end. The shape is always the same: preflight, launch, snapshot, decide, act, re-read, assert, complete. Notice where the OTP paste and the stability wait appear.

Terminal — assrt run against a staging signup flow

Where the tools fit in the broader data path

The eighteen tools are not the only moving parts. The agent also talks to a disposable email service for OTP verification, can call arbitrary external APIs to verify backend side-effects, and writes scenario results to disk for post-run inspection.

The full picture: what the agent connects to during a run

Test plan
Target URL
Variables
Assrt agent
Playwright MCP
temp-mail.io
External APIs
Results + video

The agent by the numbers

Specific numbers, all verifiable in the agent source.

0Tools in the agent surface
0sPreflight HEAD timeout
0msMutation poll interval
$0/moAssrt license cost

What you get when the driver is the runner

18 tools. That's the entire surface.

agent.ts lines 16 to 196 define the complete TOOLS array. navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. No hidden SDK.

Driver model: Claude Haiku 4.5

DEFAULT_ANTHROPIC_MODEL = 'claude-haiku-4-5-20251001' at agent.ts line 9. Fast enough to keep a test run conversational, cheap enough to run a scenario for a few cents.

Gemini driver available

Swap provider = 'gemini' and the same 18 tools run against gemini-3.1-pro-preview via a shared function-declaration layer. The scenario file does not change.

Preflight in 4 lines of source

agent.ts lines 518 to 543: HEAD request with an 8-second AbortController. Fails fast on wedged servers with a message an engineer can act on.

Browser stays warm after the report

The comment at agent.ts line 491 is deliberate: 'Don't close the browser here — keep it alive so the user can take over.' Failed scenarios leave a live page to inspect.

Script-gen AI testing vs. Assrt, at the mechanics level

The two popular product categories wear the same label but are fundamentally different runtimes.

Same label, different runtime

FeatureScript-generating AI testing platformsAssrt
What the AI producesA .spec.ts or YAML test fileTool calls against a live accessibility tree
Selector strategyCSS or XPath baked into generated codeFresh ref from snapshot() each turn
OTP-field handlingTyped character by character (often flakes)DataTransfer paste via evaluate() (atomic)
Stream / lazy-load waitspage.waitForTimeout() guessesMutationObserver stability gate
Wedged-server protectionNavigate hangs until Playwright timeout8-second HEAD preflight probe
External side-effect verificationRequires a separate test harnesshttp_request tool in the agent surface
License cost$1,000 to $7,500+ per monthFree (open-source)
Source readableClosedgithub.com/assrt-ai/assrt-mcp

The anchor fact, in one place

The system prompt at assrt-mcp/src/core/agent.ts line 235 hard-codes this exact JavaScript expression, and instructs the model to pass it verbatim to the evaluate tool whenever the page shows a split-digit OTP field:

() => {
  const inp = document.querySelector('input[maxlength="1"]');
  if (!inp) return 'no otp input found';
  const c = inp.parentElement;
  const dt = new DataTransfer();
  dt.setData('text/plain', 'CODE_HERE');
  c.dispatchEvent(new ClipboardEvent('paste', {
    clipboardData: dt,
    bubbles: true,
    cancelable: true
  }));
  return 'pasted ' + document.querySelectorAll('input[maxlength="1"]').length + ' fields';
}

It is not an example in a blog post. It is the production string the model receives on every run. Grep the file and you will find it. That level of specificity is the difference between an AI that markets itself as a QA agent and an AI that actually passes a signup flow the first time.

What an Assrt agent actually gives you

Every item below corresponds to a specific function or constant in assrt-mcp/src/core/agent.ts. Open the file and verify each one.

The real surface, line by line

  • Live accessibility tree snapshot before every action (agent.ts snapshot tool)
  • Ref-based targeting — no hand-written CSS selectors
  • 8-second preflight HEAD probe against the target URL
  • MutationObserver stability gate for streaming DOM
  • DataTransfer + ClipboardEvent paste for split-digit OTPs
  • Disposable email + verification-code polling (temp-mail.io)
  • http_request tool for verifying external side-effects
  • Retry with exponential backoff on 429/503/529 API responses
  • Sliding conversation window that never orphans tool_use / tool_result pairs

Want to see the agent survive your ugliest flow?

Bring a staging URL with a real OTP step, a streaming page, or a flow that keeps flaking. We will run it live and walk through every tool call the agent emits.

Book a call

Frequently asked questions

What does 'AI for testing' actually mean when you read the source of a working agent?

It means an LLM that receives a live accessibility tree of the current page as a text snapshot, picks one of 18 concrete browser tools (navigate, click, type_text, assert, wait_for_stable, evaluate, http_request, and a dozen more), executes that tool against a real Playwright browser, receives the result, and loops. Nothing is pre-rendered into a .spec.ts file. The model is not writing code that another runner executes later — the model is the runner. In Assrt's case the default driver is Claude Haiku 4.5 (agent.ts line 9: DEFAULT_ANTHROPIC_MODEL = 'claude-haiku-4-5-20251001').

How does Assrt handle the split-digit OTP verification field that breaks most AI agents?

This is the single most specific trick baked into the system prompt. At agent.ts line 235, the prompt tells the model that if the OTP input is split across six single-character fields (a common pattern), the model must NOT type into each one individually. Instead it must call evaluate() with an exact JavaScript expression that constructs a DataTransfer object, sets the plain-text code on it, and dispatches a paste ClipboardEvent against the parent element. The full expression is quoted in the prompt verbatim, and the model is instructed not to modify it except to substitute the real code for CODE_HERE. After the paste, the model calls snapshot() to verify all six fields filled, then clicks Verify.

What is the preflight probe and why does it matter for AI-driven tests?

agent.ts lines 518 to 543 implement preflightUrl. Before launching Chrome, the agent does a HEAD request (falling back to GET on 405 or 501) against the target URL with an 8-second AbortController timeout. If the server is unreachable, the agent fails fast with an actionable message: 'Target URL <url> did not respond within 8000ms. The server may be wedged, still starting, or unreachable.' Without this probe, a wedged dev server causes Playwright MCP's stdio connection to silently hang for minutes, eventually surfacing as an opaque 'MCP client not connected' error. The preflight is four lines of source that save most first-run debugging sessions.

How does Assrt decide that a page has finished streaming before making an assertion?

The wait_for_stable tool at agent.ts lines 956 to 1009. When called, the agent injects a MutationObserver onto document.body watching childList, subtree, and characterData mutations. It then polls every 500 ms and tracks when the mutation counter last changed. If the counter has been stable for the configured window (default 2 seconds) within the overall timeout (default 30 seconds), the tool returns 'Page stabilized after Ns'. If not, it returns 'Timed out after Ns (page still changing)'. This is the correct primitive for asserting against a streaming AI chat response, a lazy-loading search result list, or any page whose final DOM is not ready at navigation time.

Why does Assrt use ref identifiers like ref=e5 instead of CSS selectors?

The agent never writes selectors by hand. Before every interaction it calls snapshot(), which returns the page's accessibility tree with each element tagged with a reference ID such as ref=e5. Clicks and type operations take that ref directly. This works for two reasons: first, the ref is generated by Playwright MCP at the moment of the snapshot, so it corresponds to whatever the DOM looks like right now, not yesterday. Second, when an action fails, the agent's error path re-calls snapshot() and hands the model a fresh tree with fresh refs. UI changes that would break a brittle CSS selector just produce a new ref, which the model picks up on the next turn. See agent.ts lines 206 to 226 for the selector strategy rules the model follows.

Does the AI actually verify external side-effects, or only DOM state?

Both. The http_request tool at agent.ts lines 925 to 955 lets the model call any external HTTP endpoint during a test run with a 30-second abort timeout. Concrete example from the system prompt: after a flow that 'connects Telegram' in a web app, the agent can fire a GET against https://api.telegram.org/bot<token>/getUpdates to verify the expected message actually arrived at Telegram's side. Same pattern works for verifying a webhook fired to your own API, a Slack message landed, or a GitHub comment was posted. This turns the agent from a DOM-only checker into an end-to-end integration verifier.

How many tools does the agent actually have access to in a single scenario?

Eighteen. They are defined as a TOOLS array at agent.ts lines 16 to 196: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. That's the entire surface area. No hidden SDKs, no proprietary DSL, no YAML schema. If you can describe a verification in terms of those 18 calls, the agent can execute it.

What happens to the browser after a test scenario completes?

It stays open. agent.ts line 491 has an explicit comment: 'Don't close the browser here — keep it alive so the user can take over and interact after the test finishes.' The design choice is deliberate: when a scenario fails, the engineer wants to poke the live browser state to see what the AI saw. Closing the browser the moment the report is written forces a full re-run just to inspect a button. Assrt keeps the session warm via the keepBrowserOpen path in close(), and screenshots, snapshots, and a shareable local video player at 127.0.0.1 remain reachable after the test report is emitted.

Is Assrt free? And what is an actual run priced at?

Assrt MCP is MIT licensed and free to install from npm. The only cost on a run is the Anthropic API invoice for Claude Haiku calls. A typical 10-step scenario costs a few cents in tokens, a full order of magnitude cheaper than commercial AI testing platforms that start around $1,000 per month and climb to $7,500+ per month for midmarket enterprise tiers. There is no Assrt subscription, no per-seat fee, no cloud lock. The full source of the MCP server is at github.com/assrt-ai/assrt-mcp.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.