AI Testing Guide

How AI-Powered Agentic Test Execution Works: The Perception Loop Nobody Explains

Search for "agentic test execution" and you will find dozens of pages explaining that an AI agent "interacts with your app like a human." They describe the benefits: self-healing tests, natural language authoring, reduced maintenance. What they never explain is the mechanism. How does the agent actually see the page? How does it know which element to click? What happens when the page changes between steps? This guide answers those questions by walking through the perception-action-recovery loop that makes agentic testing work, with implementation details from Assrt, an open-source tool you can inspect yourself.

$0/mo

Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.

Assrt vs commercial platforms

1. What "Agentic" Actually Means in Test Execution

Most testing tools follow a script: go to URL, find element by CSS selector, click it, assert result. If the selector breaks, the test fails and a human fixes it.

An agentic test runner replaces the fixed script with a decision loop. Instead of following pre-written selectors, the agent observes the current state of the page, decides what action to take based on its test objective, executes the action, observes the result, and decides what to do next. If something goes wrong, it can try a different approach without stopping the run.

This is not a metaphor. It is a literal loop: perceive, act, observe, repeat. The agent runs inside that loop for every step of every test case, up to a configurable maximum (60 steps per scenario in Assrt). The loop only exits when the agent calls a completion function, either marking the test as passed or failed with evidence.

The critical difference from traditional automation is that the agent makes real-time decisions at every step. It does not have a pre-compiled list of selectors. It reads the page, reasons about what it sees, and chooses an action. That reasoning step is what makes it agentic.

2. Perception: How the Agent Sees the Page

This is the part that every other guide skips. When an agentic test runner says it "sees" the page, what does that actually mean? There are three possible approaches, and they have very different tradeoffs.

Approach 1: Screenshot + Vision Model

Take a screenshot of the page, feed it to a vision-capable LLM, and ask the model to identify elements. This is intuitive but expensive. Every step requires a large image payload, the model must infer element boundaries from pixels, and coordinate mapping between the screenshot and actual clickable areas is error-prone.

Approach 2: Raw DOM/HTML

Dump the full HTML of the page and send it to the LLM. This gives complete structural information but generates enormous context. A typical SPA page has tens of thousands of DOM nodes, most of which are irrelevant to test execution. The signal-to-noise ratio is terrible, and the token cost per step is prohibitive.

Approach 3: Accessibility Tree Snapshots

This is what Assrt uses. Instead of screenshots or raw HTML, the agent requests an accessibility tree snapshot from Playwright. The accessibility tree is a structured representation of the page that browsers already maintain for screen readers. It contains only the semantically meaningful elements: buttons, links, text fields, headings, labels, and their relationships.

Each element in the snapshot gets a unique reference ID (like ref="e5"). The tree is compact (hundreds of lines vs. thousands of DOM nodes), structured (the LLM can reason about element roles and labels), and directly actionable (those ref IDs can be used to target elements for clicks and typing).

The agent's system prompt in Assrt enforces a strict rule: "ALWAYS call snapshot FIRST to get the accessibility tree with element refs." Before any action, the agent must perceive. This is defined in agent.ts as part of the system prompt that governs the agent loop. No perception, no action.

See the accessibility tree in action

Run Assrt against any URL and watch the agent read the page structure before each action. Free and open-source.

Get Started

3. Action: Ref-Based Element Targeting

Once the agent has the accessibility tree with ref IDs, targeting an element is straightforward. If the agent wants to click the "Sign In" button, it scans the tree, finds the button element labeled "Sign In" with ref e12, and sends a click command with that ref.

This is fundamentally different from CSS selector-based targeting. A CSS selector like button.auth-btn.primary breaks when someone renames the class, moves the button into a different container, or refactors the component hierarchy. A ref-based approach does not depend on class names, DOM depth, or component structure. It depends on the element being present in the accessibility tree with a recognizable label.

Ref IDs are ephemeral. They are assigned fresh on every snapshot call. The ref e12that pointed to "Sign In" before a page transition might not exist after the page loads new content. This is by design. The agent is expected to take a new snapshot after every action that might change the page. The refs are not persistent identifiers; they are frame-by-frame labels.

Assrt's browser manager wraps 18 distinct browser tools through the Playwright MCP protocol: navigate, click, type, snapshot, scroll, press key, wait, screenshot, evaluate JavaScript, and select option, among others. Every tool call is logged with argument summaries and duration. A typical log line looks like [mcp] browser_click el="Sign In" (243ms), giving full observability into what the agent does and how long each action takes.

4. Recovery: Fuzzy Matching When Refs Go Stale

Ref-based targeting works cleanly when the page is stable. But pages are not always stable. A modal might appear. A navigation might redirect. An AJAX call might replace half the DOM. When that happens, the ref the agent planned to use no longer exists.

This is where self-healing happens, and it is worth understanding how it actually works instead of accepting the marketing claim at face value.

In Assrt, when a ref-based action fails, the recovery mechanism is a function called showClickAt() in the browser module. It queries the page for all interactive elements: links, buttons, inputs, elements with role="button", selects, textareas, labels, and elements with onclick handlers or href attributes.

Then it scores each candidate against the intended target using a three-tier matching system:

Match TypeScoreCondition
Exact match3Element text content equals the target string exactly
Partial match2Element text contains the target string as a substring
Word overlapProportionalScore based on the ratio of shared words between element text and target

The highest-scoring element wins. If someone renamed "Sign In" to "Sign in to your account," the partial match scores 2 and the agent clicks the right button. If they renamed it to "Log In," the word overlap on "In" still produces a nonzero score, though a lower one. This is not magic. It is fuzzy text matching across a known set of interactive elements.

On top of this, the agent-level recovery follows a defined escalation path. If an action fails, the agent takes a fresh snapshot to see the current page state. It looks for page changes like modals or navigation. It tries a different ref or approach. If stuck after three attempts, it scrolls and retries. If still stuck, it marks the step as failed with evidence and moves on.

5. The Agent Loop: How All Three Phases Connect

The perception-action-recovery cycle runs inside a loop that continues until the agent decides the scenario is complete or it hits the 60-step maximum. Here is the sequence for each iteration:

  1. The agent calls the LLM with the current conversation history (all prior observations and actions) plus available tools
  2. The LLM returns one or more tool calls (snapshot, click, type, navigate, etc.)
  3. Each tool call executes against the live browser and returns its result, which can include text content and screenshots
  4. Results feed back into the conversation for the next LLM call
  5. The loop repeats until the agent calls complete_scenario with a pass/fail verdict

This is not a pre-planned sequence of actions. The agent decides at each step what to do next based on what it observes. If the page loads slowly, the agent sees a loading spinner in its snapshot and waits. If a modal blocks the intended target, the agent reads the modal content and decides whether to dismiss it or interact with it. If an unexpected error message appears, the agent can read it and adjust its approach.

Error handling during the loop has two layers. At the API level, if the LLM provider returns a rate limit (429) or service error (503), the system retries up to 4 times with exponential backoff starting at 5 seconds. At the agent decision level, if a browser action fails, the error message is augmented with a fresh accessibility tree snapshot (first 2,000 characters) so the LLM can see both what went wrong and what the page currently looks like.

This error context propagation is critical. When a traditional test fails, the error message says something like "element not found: #submit-btn." When an agentic test fails an intermediate step, the agent sees the error plus the current page state, which often reveals that the page navigated somewhere unexpected or a popup appeared. The agent can then course-correct without human intervention.

Watch the agent loop in real time

Assrt records video of every test run with live cursor overlays. See exactly where the agent clicks and what it reads at each step.

Get Started

6. Stability Waiting: Why the Agent Does Not Use sleep()

One of the most common reliability problems in test automation is timing. Traditional tests either use fixed waits (await page.waitForTimeout(3000)) that slow down fast pages and break on slow ones, or they poll for specific selectors (await page.waitForSelector('.loaded')) that require knowing in advance what to wait for.

Assrt takes a different approach with a function called wait_for_stable. Instead of waiting a fixed amount of time or watching for a specific element, it uses a MutationObserver to monitor all DOM changes: child list mutations, subtree changes, and character data updates. It polls every 500 milliseconds and checks whether the mutation count has changed. If the count has not changed for a configurable quiet period (default: 2 seconds), it considers the page stable.

This adapts automatically. A page that finishes rendering in 200ms gets marked stable in about 2.2 seconds. A page that takes 8 seconds to finish loading API data and rendering results will wait the full 8 seconds plus the 2-second quiet window. No developer needs to tune the wait time per page.

The maximum timeout is 60 seconds. If the page is still mutating after a minute, the agent proceeds anyway. This prevents infinite hangs on pages with continuous animations or real-time data feeds.

7. Setting Up Agentic Test Execution with Assrt

Assrt runs locally as an MCP (Model Context Protocol) server. This means it integrates directly with AI coding assistants like Claude Code. There is no separate dashboard to log into, no cloud service to configure, and no monthly subscription.

Install and run

Add the MCP server to your Claude Code configuration, then use the testing tools directly from your terminal or IDE. The primary command to run a test:

assrt_test({ url: "http://localhost:3000", plan: "#Case 1: Login flow\nNavigate to /login\nFill email and password\nClick Sign In\nVerify dashboard loads" })

That is the complete API. Pass a URL and a plan (or a scenario ID from a previous run to re-execute it). The agent handles browser launch, page navigation, element targeting, action execution, stability waiting, and pass/fail verification.

Auto-generate test plans

If you do not know what to test, use assrt_plan to point the agent at a URL and have it generate test cases automatically. It navigates the page, analyzes the accessibility tree, identifies interactive elements, and writes 3-5 action test cases in the #Case N: name format.

Diagnose failures

When a test fails, assrt_diagnose sends the full failure report (steps taken, assertions that failed, error messages) to a diagnostic model. The diagnosis distinguishes between three root causes: a genuine application bug, a flawed test case, or an environment issue. It returns a corrected test case you can re-run immediately.

What you get after each run

Every test run saves results locally:

The scenario files are editable. Change the plan in scenario.md, and the changes sync back to cloud storage. The test plan is yours to keep, modify, and run outside of Assrt using standard Playwright.

FAQ

Does the agent use screenshots or computer vision to interact with the page?

No. The primary perception mechanism is accessibility tree snapshots, not screenshots or vision models. The accessibility tree provides structured, labeled data about every interactive element on the page. This is faster, cheaper (no image tokens), and more reliable than visual approaches. Screenshots are taken during test runs for human review and video recording, but the agent does not use them for decision-making.

How does the agent handle single-page applications where content loads asynchronously?

The wait_for_stable function monitors DOM mutations via MutationObserver. It watches for child list changes, subtree modifications, and character data updates. Once mutations stop for 2 seconds, the page is considered stable and the agent proceeds. This adapts automatically to fast and slow page loads without requiring per-page timeout configuration.

What happens if the LLM makes a wrong decision during test execution?

The agent has up to 60 steps per scenario and a defined escalation path: retry with a different approach, scroll to reveal hidden elements, and if still stuck after three attempts, mark the step as failed with evidence. Failed steps include the current page state as context, so you can see exactly what the agent saw when it got confused. The 60-step limit prevents infinite loops from a confused agent.

Can I run the same test scenario multiple times without rewriting it?

Yes. Every test run generates a unique scenario ID (UUID). Pass that ID back to assrt_test instead of a plan, and it re-executes the same scenario. Scenarios are stored locally under ~/.assrt/scenarios/ and synced to cloud storage. The scenario ID acts as a capability URL, so you can share it with teammates without configuring access permissions.

Is there vendor lock-in? Can I use the generated tests outside of Assrt?

No lock-in. Assrt generates standard Playwright code and saves test plans as plain Markdown files. You can copy any generated test directly into your own Playwright test suite and run it with npx playwright test. There is no proprietary format, no YAML DSL, and no required runtime dependency on Assrt to execute saved tests.

How much does AI inference cost per test run?

Assrt uses Claude Haiku for the inner test execution agent, which is one of the cheapest available LLMs. A typical test scenario with 10-20 steps costs fractions of a cent in inference. The tool itself is free and open-source. You bring your own API key or use it through Claude Code, which handles the inference. There is no per-seat, per-run, or monthly platform fee.

Does the agent need to be trained on my specific application?

No training or configuration is required. The agent reads the accessibility tree of your page fresh on every snapshot. It does not maintain a model of your application between runs. Point it at any URL and it works immediately. The tradeoff is that it does not learn patterns from previous runs, but it also never carries stale assumptions about your UI.

Run agentic tests in your terminal. No dashboard required.

Point Assrt at any URL. The agent reads the page, runs your test plan, and produces real Playwright code you own. Free, open-source, self-hosted.

$npx assrt-mcp