AI for software testing is three perception channels, not one
Every guide on this topic writes the same sentence: the AI sees your app. That sentence hides the mechanism. Assrt's agent reads a text accessibility tree on every turn, captures a JPEG only after 0 specific tool calls, and fires HTTP requests to verify outcomes the browser cannot observe. The rule for each channel is a specific line in agent.ts.
The sentence every other guide skips past
Read any introduction to this topic and you will see a variant of the same claim: modern tools let an AI agent read your application, decide what to do, and verify it worked. That sentence is true but it hides three very different operations behind one verb. What does "read your application" actually mean, at the wire level, when an LLM is driving a browser?
In Assrt the answer is three separate channels, each governed by a rule you can audit. The accessibility tree feeds the model on every turn. Screenshots are captured conditionally, and the condition is a named deny-list. External APIs are reachable through a tool that returns status code and body text, not pixels. Together these are the full perception surface of the agent. Individually, each exists because the others miss something.
Below is each channel with the exact source-level rule that governs it, and then the diagram of how all three combine in a single turn of the agent loop.
The three channels converging on one model
Three inputs, one decision, three possible outputs
Channel one: the accessibility tree
The primary channel is the text tree the model receives from the snapshot tool. Playwright MCP walks the DOM and emits each interactive element with a role, a name, and a reference ID the model can quote back when it wants to click or type. The tree is text, which is cheap; it is structured, which is deterministic; and it uses stable ref IDs, which removes the need for the model to guess pixel coordinates or generate XPath selectors. Below is what the model actually reads on a typical turn.
The system prompt given to the model is explicit about how to use this tree: always call snapshot first, find the target element in the returned tree, cite its ref value in the next click or type_text call, and pass a human-readable description for logging. If a ref goes stale because the page changed, the next failure message nudges the model to re-snapshot and pick a fresh ref rather than retry with the old one.
Channel two: screenshots, conditional on which tool just ran
A screenshot on every turn would blow up both token cost and latency. Assrt's rule, defined as a literal list of tool names in agent.ts, is that only six tools trigger a screenshot after they run. The remaining eleven tools suppress it. The model still has full access to the accessibility tree on the following snapshot call, so suppressing the screenshot is not a loss of signal when the UI did not change.
This is the anchor fact of this guide, and you can verify it directly. Open /Users/.../assrt-mcp/src/core/agent.ts and search for the expression that starts with ![ near line 1024. The eleven-name array is the perception policy.
Six tools that trigger a screenshot
- navigate — new URL loaded; viewport likely fully different
- click — pressing a button can open a modal, submit a form, or replace content
- type_text — typing can trigger live validation, autocomplete, or search results
- select_option — choosing a dropdown value can change visible fields
- scroll — new region of the page is now in view
- press_key — keys like Enter, Tab, Escape can submit, advance, or dismiss
Eleven tools that suppress the screenshot
- snapshot — returns the accessibility tree; page has not changed
- wait — the model just waited for text or time; no user action happened
- wait_for_stable — MutationObserver-gated wait; same reasoning
- assert — records a pass/fail judgement; no DOM change
- complete_scenario — terminates the loop; nothing to capture
- create_temp_email — provisions a disposable address via the email service
- wait_for_verification_code — polls the disposable inbox, not the browser
- check_email_inbox — another inbox read, browser untouched
- screenshot — the model asked for one explicitly; captured separately
- evaluate — runs JavaScript for a value; screenshot would not add context
- http_request — calls an external API; response is text, not pixels
The shape of this rule matters beyond cost control. It encodes a design claim: when the agent calls a non-visual tool, the state of the page is unchanged, and sending another JPEG to the model would waste tokens on a duplicate frame. A screenshot after assert or http_request or wait_for_stable would be identical to the one taken after the last visual action. The deny-list is a lossless compression of the perception stream.
Channel three: HTTP requests to everything the browser cannot see
The browser is opaque to its own side effects. When a form submission sends a Telegram message, emails a webhook, or writes to a Slack channel, the browser only sees the confirmation screen. Any guide that stops at screenshot plus DOM assertion misses half the system under test.
Assrt exposes an http_request tool. The model passes a URL, method, headers, and optional body; the tool fires a 30-second-timeout fetch with a Content-Type: application/json default header and returns the status line plus up to 4000 characters of response text. The model then asserts against that payload the same way it asserts against a visible heading.
In a typical test this tool fires once or twice per scenario, not on every turn. It is the least common channel but the most decisive when what you care about is server-side state. Writers of test plans do not need to request it explicitly; the model decides to use it when the step description mentions an external service and the browser alone cannot confirm the outcome.
How the three combine in a single turn
Across one turn of the agent loop, the three channels are interleaved deterministically. The model snapshots (channel one), decides a tool call, executes it, optionally receives a screenshot (channel two) or an HTTP response body (channel three), and passes the combined context into its next decision. The loop does not terminate on a step count; it runs until the model emits complete_scenario or stops producing tool calls.
One turn of the agent loop
Snapshot
Model emits snapshot(). Assrt calls Playwright MCP, resolves the file-output .yml, returns the accessibility tree as compact text. Tokens: low.
Decide
Model reads the tree, writes a reasoning paragraph, emits the next tool call with the ref ID it just saw. No pixel guessing.
Execute
Assrt dispatches the tool. Visual tools change the page; the other 11 tools return data without UI change.
Capture
If the tool was one of the six visual tools (navigate, click, type_text, select_option, scroll, press_key), a JPEG is captured and attached to the next model message. Otherwise the reply is text-only.
Append
Tool result + optional screenshot appended to conversation. Sliding window trims at assistant boundaries so tool_use/tool_result pairs never orphan.
Next turn
Loop continues until the model emits complete_scenario, or stops emitting tool calls. There is no hard step limit.
What this looks like next to the standard "AI sees your app" framing
Most introductions to this topic collapse the three channels into a single line. That framing is fine as a tagline and broken as a model of the system. Side by side, the difference between an auditable perception policy and a black box looks like this.
Perception, stated explicitly vs. stated vaguely
| Feature | Typical AI testing framing | Assrt |
|---|---|---|
| Primary perception input | Usually unstated — "the AI sees your app" | Compact accessibility tree with ref=eN IDs, returned on every snapshot call |
| When a screenshot is taken | Either every turn (cost heavy) or unspecified | Only after the 6 visual tools; suppressed for the 11 non-visual tools (agent.ts line 1024) |
| External side-effect verification | Browser-only; webhooks and bot messages are not checked | http_request tool lets the model call any external API from inside the test |
| How the model targets elements | Text matching, XPath, or pixel coordinates | ref IDs from the most recent snapshot; re-snapshot on stale |
| Waiting for async content | Fixed sleep, or a self-healing wrapper | wait_for_stable with MutationObserver; polls every 500ms, defaults to 2s quiet window |
| Can you audit the perception rules? | Closed implementation | Open-source. Read agent.ts TOOLS array and SYSTEM_PROMPT |
The numbers that define the policy
Every number below is countable in the source. The tool total comes from the length of the TOOLS array. The visual and non-visual splits come from the deny-list condition at line 1024.
Reading the policy yourself
This page is written as a derivation of the source, not an abstract summary. The claims above are each anchored to a specific section of the agent file. If you want to verify or extend them, these are the landmarks.
Where each claim lives in agent.ts
- TOOLS array at line 16 — every tool the model can call, with its input schema
- SYSTEM_PROMPT at line 198 — the playing rules (always snapshot first, cite ref IDs)
- snapshot case at line 778 — resolves the Playwright MCP file output into the text tree
- http_request case at line 925 — 30-second timeout, 4000-char truncation
- wait_for_stable case at line 956 — MutationObserver injection, 500ms poll
- Screenshot deny-list at line 1024 — the eleven tool names that suppress capture
- Sliding window at line 1060 — trims at assistant boundaries so tool_use never orphans
Want to see the three channels fire on your app?
Bring a URL and a flow. We will run it together and walk you through the tree, the screenshots, and the HTTP calls the agent makes during the run.
Book a call →Frequently asked questions
What does AI for software testing actually look at when it runs a test?
Assrt's agent has three distinct perception channels and uses them on different turns. On every turn, the model receives the page's accessibility tree as compact text, with each interactive element tagged by a reference like ref=e5. After any of six visual tools (navigate, click, type_text, select_option, scroll, press_key), the agent captures a JPEG screenshot and attaches it to the next model call. After the remaining eleven tools (snapshot, wait, wait_for_stable, assert, complete_scenario, create_temp_email, wait_for_verification_code, check_email_inbox, screenshot, evaluate, http_request) the screenshot is explicitly suppressed. Separately, the http_request tool lets the model call any external API from inside the test, so it can verify things the browser never sees, like a Telegram bot message or a webhook payload.
Why are screenshots conditional instead of taken on every turn?
A JPEG on every turn would blow up both cost and latency. Screenshots sent with every model call balloon prompt size by roughly 30-100KB per image, and a 40-step scenario would send the equivalent of a 2-4MB image stream through the API. Assrt's rule is simple: only take a screenshot when the last action could have visibly changed the page. Navigating, clicking, typing, selecting, scrolling, and pressing keys can all change the viewport. Calling snapshot or asserting cannot. The rule is defined explicitly in agent.ts line 1024 as a deny-list of eleven tool names; if the tool that just ran is not on that list, a screenshot is captured and attached.
How does the accessibility tree differ from a screenshot for an AI test?
The accessibility tree is a structured text representation of every interactive element on the page, already labelled with role, name, and a stable reference ID the model can cite in the next tool call. A screenshot is a flat image that requires the model to re-read pixels every turn. The tree is cheap, deterministic, and lets the model target elements by ref rather than by pixel coordinate, which is why it is the primary channel. The screenshot is a fallback for visual state that the tree cannot express: styling bugs, overlap, images that rendered blank, and modals that obscure content. Assrt uses both because neither alone is complete.
Why does an AI testing tool need to make HTTP requests of its own?
Because some test outcomes never touch the browser. If your app submits a form that triggers a Telegram bot message, the browser sees only the confirmation screen. To verify the bot actually received the message, the agent needs to call the Telegram Bot API directly and inspect the getUpdates response. Assrt exposes this as the http_request tool: the model passes a URL, method, headers, and body, and gets back the status code plus up to 4000 characters of response. The test then asserts on that data the same way it asserts on page content. This makes AI for software testing a verification channel for server-side and third-party state, not just UI state.
Can the model target elements by visual position instead of by ref ID?
It can, but Assrt's system prompt strongly discourages it. The canonical rule given to the model is to always call snapshot first, find the element in the returned tree, and use the ref ID in the next click or type_text call. Targeting by visible text alone is slower and less reliable because repeated text on the page produces collisions. Targeting by pixel position fails whenever layout shifts between runs. The ref IDs are stable for the duration of a single snapshot and are the intended addressing scheme. If a ref goes stale, the agent re-snapshots and picks a fresh ref rather than guessing coordinates.
What happens if the page is still loading when the model tries to act?
Assrt ships a dedicated wait_for_stable tool that the model can invoke instead of a fixed timeout. The tool injects a MutationObserver into the page, counts DOM mutations, and polls every 500ms until no mutations have been observed for a configurable stable window, defaulting to two seconds. Maximum wait is sixty seconds. This adapts to real load time rather than padding every test with a three-second sleep. It is the preferred way to wait for AI chat responses, search results, or any async content the page streams in after a tool call.
How does Assrt decide which tool the model gets to call next?
It does not decide. The model does. Assrt publishes the full tool surface to the LLM (navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable) and the model picks the next one based on the current accessibility tree and the test goal. There is no scripted sequence, no DSL, and no intermediate rule engine. Each tool returns its result, Assrt appends it to the conversation, and the model decides the next move. The agent loop runs until the model calls complete_scenario or the model stops emitting tool calls.
Can I read the source code that governs how Assrt's AI perceives the page?
Yes. Assrt MCP is open-source and published at github.com/assrt-ai/assrt-mcp. The perception rules live in src/core/agent.ts. The TOOLS array starting at line 16 defines every tool the model can call, with its input schema. The SYSTEM_PROMPT at line 198 contains the playing rules (always snapshot first, use ref IDs, use wait_for_stable for async). The screenshot decision rule is at line 1024 as a deny-list. The http_request implementation is at line 925. Every behaviour discussed on this page is verifiable by reading those lines.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.