AI-Powered Agentic Test Execution with Tool: What the Agent Actually Calls
Every page about agentic test execution describes the concept the same way: "an AI agent runs your tests." Then it stops. The agent becomes a black box. You are told it "perceives the page," "reasons about actions," and "recovers from errors," but nobody shows you the actual interface between the AI and the browser.
This guide opens that black box. Assrt's inner test agent (a Claude Haiku instance that drives the browser during each assrt_test call) has access to exactly 18 tools, defined inagent.ts lines 16 through 196. Those 18 tools are the complete vocabulary of what the agent can do. Understanding them is the difference between trusting a marketing claim and understanding a system.
“18 tools, open-source, defined in a single TypeScript array. Read the code yourself.”
agent.ts in assrt-mcp
1. Why the Tool Vocabulary Is the Thing That Matters
An LLM without tools is a language model. An LLM with tools is an agent. The tools define the boundary of what the agent can do. If the agent does not have ascroll tool, it cannot scroll. If it does not have anhttp_request tool, it cannot verify an API response after submitting a form.
This is not a philosophical point. When evaluating any "AI-powered agentic test execution" product, the first question should be: what tools does the agent have? The answer tells you what it can test, what it will miss, and where it will break.
Assrt answers this question in the open. The TOOLS array in agent.ts is a plain TypeScript constant. It is the contract between the Anthropic API (which decides which tools to call) and the Playwright browser session (which executes those calls). Every tool has a name, a description, and a typed input schema. The LLM reads the descriptions to decide when to use each tool. The input schema constrains what parameters it can pass.
2. Five Categories, 18 Tools
The 18 tools fall into five functional groups. Each group serves a distinct role in the test execution loop.
| Category | Tools | Role |
|---|---|---|
| Perception | navigate, snapshot, screenshot, evaluate | See and understand the page |
| Interaction | click, type_text, select_option, scroll, press_key | Manipulate page elements |
| Timing | wait, wait_for_stable | Adapt to async page behavior |
| Intelligence | assert, complete_scenario, suggest_improvement | Record judgments and findings |
| External | http_request, create_temp_email, wait_for_verification_code, check_email_inbox | Reach beyond the browser |
The rest of this guide walks through each category in detail, with the actual tool definitions from the source code.
See these tools in action
Assrt is free and open-source. Run assrt_test from your coding agent and watch the 18 tools execute against your own app.
Get Started →3. Perception: How the Agent Sees the Page
Before the agent can test anything, it needs to understand what is on the page. Four tools handle perception, and the design choices here are what separate reliable agentic testing from brittle screen-scraping.
navigate
Opens a URL in the browser. This is the starting point of every test scenario. The agent calls navigate with the target URL and Playwright handles page load, redirects, and any initial JavaScript execution.
snapshot
This is the most important perception tool. It returns the accessibility tree of the current page, with each interactive element tagged with a[ref=eN]identifier. The tool description in agent.ts explicitly says: "ALWAYS call this before interacting with elements."
The accessibility tree is a structured representation of the page that assistive technologies use. It includes element roles (button, link, textbox), labels, and states (disabled, checked, expanded). It does not include CSS, layout position, or pixel coordinates. This is deliberate. The agent reasons about what elements are(a "Submit button," an "Email input") rather than where they are on screen.
The ref IDs (e1,e2,e3...) are regenerated on every snapshot call. They are stable within a single snapshot but not across snapshots. This forces the agent to call snapshot before every interaction, which means it always acts on the current page state, not a stale representation.
screenshot
Takes a pixel screenshot of the current page. This is the visual complement tosnapshot. The agent uses it when it needs to verify something the accessibility tree cannot capture: a color, a layout, an image, or a visual state like a loading spinner. Screenshots are expensive in tokens (they are sent as images in the next API call), so Assrt captures them selectively, after visual actions rather than after every step.
evaluate
Runs arbitrary JavaScript in the browser and returns the result. This is the escape hatch for anything the accessibility tree and screenshots cannot capture: checkinglocalStorage values, reading cookies, inspecting network responses, or querying the DOM directly. The agent can verify that a JWT was stored after login, that an analytics event fired, or that a specific API was called with the right parameters.
4. Interaction: How the Agent Acts
Five tools let the agent manipulate the page. Each one maps to a user action.
click
Clicks an element. Takes two parameters:element(a human-readable description like "Submit button") and an optional ref (theeNID from the latest snapshot). The ref is "preferred when available" per the tool description, because it is an exact match. The element description serves as a fallback and as documentation for what the agent intended.
This dual-targeting approach is significant. Traditional test automation uses CSS selectors or XPath. Those break when the DOM changes. Assrt's agent uses semantic references (what the element is) backed by deterministic refs (which exact node to target). Neither approach alone is sufficient; together they are both reliable and readable.
type_text
Types text into an input field. Clears existing content first. Takeselement,text, and an optionalref. The "clears first" behavior is important: when retesting after a fix, input fields may still contain values from a previous attempt. The agent does not need to worry about residual state.
select_option, scroll, press_key
select_option handles dropdowns by passing values to select.scroll takes pixel offsets (positive y scrolls down, default 400px).press_key sends keyboard events (Enter, Tab, Escape). These three tools cover the remaining user interactions thatclick andtype_text do not handle.
Zero vendor lock-in
The agent, the tools, and the browser are all open-source. Your test plans are Markdown files. Results are JSON. Nothing requires Assrt to keep running.
Get Started →5. Timing: How the Agent Waits for the Right Moment
Timing is where most test automation breaks. A button click triggers an API call. A form submission loads a new page. A search input populates results after a debounce. Traditional tests handle this with fixed-duration sleeps or fragile "wait for element" selectors. Assrt gives the agent two timing tools that adapt to what the page is actually doing.
wait
The simpler of the two. It can wait for specific text to appear on the page (preferred), or wait a fixed number of milliseconds (capped at 10 seconds). The text-based mode is a content assertion combined with a wait: the agent says "I expect this text to appear" and blocks until it does or times out.
wait_for_stable
The more sophisticated tool. It attaches a MutationObserver to the DOM and monitors for child list mutations, subtree changes, and character data updates. It polls every 500ms and considers the page "stable" when no mutations have occurred for a configurable period (default: 2 seconds). The maximum wait is configurable too (default: 30 seconds).
This tool exists specifically for testing applications with streaming or progressive content. Think of an AI chat interface that streams tokens, a search page that progressively loads results, or a dashboard that hydrates widgets asynchronously. The agent calls wait_for_stable after triggering an async action and proceeds only when the page has stopped changing. No hardcoded timeouts. No guessing how long the server takes.
6. Intelligence: How the Agent Reports Results
Three tools let the agent record its judgments. These are what turn raw browser interaction into structured test results that a calling agent (or a human) can act on.
assert
The core testing primitive. Takes three required fields:description (what is being asserted),passed (boolean), andevidence (the concrete observation that supports the judgment). Every assertion produces a structured record with all three fields. This is not a pass/fail flag; it is a judgment with reasoning. The calling agent (or a human reviewing results) can read the evidence and independently evaluate whether the assertion is correct.
complete_scenario
Marks a test scenario as finished. Takes a summary and an overall pass/fail boolean. This is the signal that moves the agent from one test case to the next. Without it, the execution loop would run until it hit the 60-step limit (MAX_STEPS_PER_SCENARIO in agent.ts).
suggest_improvement
A tool for reporting bugs or UX issues the agent discovers while testing, even if those issues are not part of the test plan. Takes a title, severity (critical, major, or minor), a description of what is wrong, and a suggestion for how to fix it. This turns the test agent into an opportunistic QA reviewer: it follows the test plan, but if it notices a broken link, a confusing form label, or a missing error message, it reports that too.
7. External: How the Agent Reaches Beyond the Browser
Four tools let the agent interact with systems outside the browser. These exist because real user flows often span multiple services.
http_request
Makes an HTTP request to any URL. Supports GET, POST, PUT, and DELETE with custom headers and body. The tool description in agent.ts says: "Use for verifying webhooks, polling APIs (Telegram, Slack, GitHub), or any external service interaction."
This is the tool that lets the agent verify side effects. After submitting a form, the agent can call the application's API to confirm the record was created. After triggering a notification, it can poll the notification service to verify delivery. Without this tool, the agent could only verify what appears in the browser, not what happened on the backend.
create_temp_email, wait_for_verification_code, check_email_inbox
These three tools work together to test signup and verification flows end to end. The agent calls create_temp_email to get a disposable email address beforefilling in a signup form (the tool description explicitly says "Use BEFORE filling signup forms"). After submitting the form, it callswait_for_verification_code, which polls the inbox for up to 60 seconds looking for an OTP or verification link.check_email_inbox provides a direct inbox check for cases where the agent needs to read the full email content, not just a code.
This is the kind of capability that separates a demo from a production test tool. Signup flows with email verification are one of the most common user journeys, and most testing tools cannot test them end to end because they have no way to receive email.
8. How the Tools Compose into a Perception Loop
The 18 tools do not execute in isolation. They form a perception-action loop that runs for up to 60 steps per test scenario. Here is the sequence, from agent.ts:
- The agent receives a test scenario in natural language plus an initial page snapshot.
- The LLM (Claude Haiku by default) reads the scenario and the snapshot, then responds with one or more tool calls.
- Each tool call is executed against the Playwright browser session. The result is packaged as a
tool_resultcontent block. - After visual actions (click, type_text, select_option, scroll, press_key), a screenshot is captured and included in the tool result as an image. This gives the LLM visual feedback on what changed.
- The tool results are appended to the conversation history, and the loop repeats. The LLM sees the cumulative history of what it has done and what the page looked like at each step.
- When the LLM calls
complete_scenario, the loop ends for that test case. The agent moves to the next case in the plan.
A sliding window manages context length. The system keeps the initial user message (the test plan) plus the most recent turns, and trims older turns. The trimming algorithm never cuts between a tool_use block and its corresponding tool_result, because breaking that pair would confuse the LLM about what happened.
Rate-limit responses (HTTP 429, 503) trigger retries with exponential backoff (up to 4 retries with 5-second delays). Fatal errors (malformed tool calls, unrecoverable browser crashes) end the scenario immediately. This distinction matters: a transient API hiccup should not fail a test, but a broken page should.
Frequently Asked Questions
Can I add custom tools to the agent?
The TOOLS array in agent.ts is a TypeScript constant. Because Assrt is open-source, you can fork the repo and add your own tools. Each tool needs a name, description (the LLM reads this to decide when to use it), and a Zod-style input schema. The execution handler in the agent loop dispatches based on the tool name.
Why does the agent use accessibility tree snapshots instead of pixel coordinates?
Accessibility trees are semantic: they describe what elements are, not where they are positioned. This makes them resilient to layout changes, responsive breakpoints, and CSS updates. A button labeled "Submit" in the accessibility tree is the same button whether it is at the top of the page or the bottom. Pixel-based targeting breaks when the viewport size changes or when content above the target element shifts position.
How many LLM calls does a single test scenario take?
Each step in the perception loop is one LLM call. A simple test (navigate, fill a form, click submit, verify result) might take 8 to 12 calls. Complex multi-page flows with email verification can take 30 to 40. The maximum is 60 steps per scenario, enforced by MAX_STEPS_PER_SCENARIO in agent.ts. Using Claude Haiku (the default), each step costs fractions of a cent, so a typical test scenario runs for well under a dollar in inference cost.
What happens when the agent encounters an element it cannot identify?
The agent falls back to the evaluate tool. It can run JavaScript to query the DOM directly, inspect element attributes, or check computed styles. If the accessibility tree does not expose a needed element (which can happen with custom web components or canvas-based UIs), the agent can still interact with it through JavaScript evaluation.
How does wait_for_stable differ from a fixed timeout?
A fixed timeout waits the same duration whether the page loads in 200ms or 5 seconds.wait_for_stable uses a MutationObserver to detect when the DOM stops changing. A fast page resolves in under a second. A slow page gets the full 30-second window. The agent never waits longer than necessary and never moves on too early. This makes tests both faster (on fast pages) and more reliable (on slow pages) compared to fixed sleeps.
Is this approach specific to Assrt, or is it how all agentic testing works?
The general pattern (LLM plus tools plus perception loop) is common to agentic systems. What differs between products is the specific tool vocabulary, how the perception loop is implemented, and whether any of this is visible to the user. Most commercial platforms keep their tool definitions proprietary. Assrt's are in a public TypeScript file you can read, modify, and extend.
What LLM providers does the test agent support?
Anthropic (Claude Haiku by default) and Google Gemini. The provider is configurable via the provider parameter. The default model is claude-haiku-4-5-20251001 for Anthropic and gemini-3.1-pro-previewfor Gemini. The tool definitions are translated to each provider's native format automatically.
See all 18 tools in action. One command.
Assrt is a free, open-source MCP server. Point your coding agent at a URL, call assrt_test, and watch the perception loop execute against your app.