AI browser automation testing is not one tool. It is eighteen, and you can read every line.

The SERP makes it sound like the whole field is an agent that clicks and types. That is ten of the tools. The other eight are where real end-to-end QA happens: a disposable inbox, OTP polling, a MutationObserver stability primitive, arbitrary HTTP against third-party APIs, and a mid-test bug-filing primitive. All in one 1,087-line file.

Matthew Diakonov, Written with AI

Published April 20, 202610 min read

4.9from live production tests

Open source: 1,087-line agent in one file

Real Playwright calls under the hood, via @playwright/mcp

No proprietary YAML, no SaaS dashboard lock-in

Self-hosted, runs fully against localhost

Eighteen tools. One file.

What an AI browser agent needs, beyond the mouse

Click and type is ten of the tools.

create_temp_email spawns a real inbox.

http_request verifies webhooks, not just toasts.

wait_for_stable injects a MutationObserver into the page.

suggest_improvement files bugs while the test runs.

0:00 / 0:05

0Agent tools, total

0Beyond mouse + keyboard

0Lines in the agent loop

0Proprietary DSL you learn

The mental model: a plan, a hub, four real worlds

The thing no competitor diagram shows is what the agent actually touches during a run. It is not one puppet arm on a browser. It is four sources fanning into a single hub and four artifacts fanning out. This is why AI browser automation testing feels qualitatively different from a plain Playwright script.

What the agent reads from, what it writes to

The anchor fact: eighteen tools in one array

Open /Users/matthewdi/assrt-mcp/src/core/agent.ts. Go to line 16. You will see an array called TOOLS with exactly eighteen entries. That array is the entire tool surface. There is no hidden cloud API that adds more at runtime. There is no plugin mechanism that loads more at startup. The agent can do what is in that array and nothing else, and you can read the whole thing on a coffee break.

assrt-mcp/src/core/agent.ts

Eight tools that most SERP articles miss

If you read enough "AI browser automation testing in 2026" posts, you come away thinking the field is ten tools: click, type, scroll, and a few variations. These are the eight that actually separate a useful QA agent from a fancy cursor.

create_temp_email

POST https://api.internal.temp-mail.io/api/v3/email/new. The agent gets a real inbox with a real token before it fills the signup form.

wait_for_verification_code

Polls the disposable inbox every 3 seconds for up to 60. Regex-extracts the 4-to-8-digit code from the first message that matches.

http_request

GET, POST, PUT, DELETE with custom headers. 30-second timeout. Response text is fed back into the agent context so the next step can assert on it.

wait_for_stable

Injects a MutationObserver into document.body, polls window.__assrt_mutations every 500ms, and unblocks only after stable_seconds of zero-mutation silence.

suggest_improvement

title + severity (critical, major, minor) + description + suggestion. Files UX bugs while the test is still running.

evaluate

Arbitrary JavaScript in the page context. The agent uses it for multi-field OTP paste, clipboard reads, and anything the accessibility tree cannot express.

assert

description + passed + evidence. Multiple assertions per scenario. One failed assert marks the whole scenario failed. Nothing else flips that bit.

complete_scenario

summary + passed. The agent calls this when it believes the scenario is done. Missing this is the one way to reach a timeout verdict.

The http_request tool is the one that closes the loop

Most AI browser automation testing frames the browser as the whole world. But the browser is the lie layer. A green checkmark does not mean the webhook fired, the Telegram message arrived, or the Stripe event posted. Your test has to reach out and check. That tool is thirty lines of TypeScript.

assrt-mcp/src/core/agent.ts

30s

“The agent can hit your backend directly, right in the middle of a test, and assert on the response body.”

agent.ts lines 925-955

The suggest_improvement tool files bugs while the test is running

Every other AI browser automation testing tool treats "did it work?" and "is the UX broken?" as two separate questions you answer in two separate runs. Assrt does not. The agent has a tool whose only job is to flag issues the scenario did not ask about: a button label that contradicts its action, a modal that traps focus, a loading state that never resolves. Severity is one of three words. You get those bugs in the same report as the pass/fail verdict.

assrt-mcp/src/core/agent.ts

A scenario that uses five of the tools in one pass

Here is the plan most AI browser automation testing tools cannot express without extra plumbing. It is plain markdown. You can paste it straight into scenario.md and run it. Note how case 2 uses http_request to verify the webhook hit Telegram, not just that the UI toast said it did.

scenario.md

Watch it run, top to bottom

This is the output from case 2 above, annotated. The five tool calls that matter are click, type_text, wait_for_stable, http_request, and assert. Every one of them maps to a specific line in agent.ts.

npx assrt run --url http://localhost:3000

How an AI browser automation test actually executes

Zoom in on a single scenario. This is the agent loop with the scaffolding stripped away. Notice that there is no hidden retry loop, no cloud call that adds features, no closed step. Every bit of decision-making is the LLM picking from the eighteen names and the switch statement routing to a real action.

The agent loop enters runScenario

agent.ts:633 kicks off. initialSnapshot and initialScreenshot capture the starting state. The user prompt template at agent.ts:679 bakes in the scenario, a disposable-email hint, and the raw accessibility tree.

The LLM picks a tool by name

Claude Haiku 4.5 (default) or a Gemini model sees all 18 tools as function declarations. It picks one. The switch statement at agent.ts:766 translates the name into a real action against the live browser or a live external service.

The tool runs against real systems, not stubs

create_temp_email talks to temp-mail.io. http_request talks to whatever URL you gave it. wait_for_stable injects real JavaScript into the real page. Nothing is mocked; that is the entire point.

The result feeds back into the prompt

Tool output (HTTP body, screenshot, mutation count, disposable email address) becomes a tool_result message. The agent uses it to decide the next step. This is how an OTP code extracted from one tool ends up typed into a 6-digit input by the next.

Assertions are flat and countable

Every assert call appends to the scenarios' assertions array (agent.ts:897). One failed assertion flips scenarioPassed to false. There is no 'soft fail.' When complete_scenario fires, the ScenarioResult is emitted and the next #Case starts on the same browser.

The wait_for_stable sequence, start to finish

Other AI browser automation testing tools use network-idle or fixed timeouts. Assrt injects a real MutationObserver into the page and polls a counter. Here is exactly what flows between the four actors when a scenario says "wait until the page is stable."

wait_for_stable

Where the SERP is shallow and where Assrt goes deeper

Most top results for ai browser automation testingtreat the category as "an agent that clicks and types like a human." That half of the story is true but not interesting. This is the other half.

Feature	Generic AI browser automation testing articles	Assrt
The agent can click, type, scroll	Yes	Yes (tools: click, type_text, scroll, press_key)
Agent creates a real disposable inbox for signup flows	Usually not mentioned; you pre-provision a fixed test email	create_temp_email at agent.ts:850 hits temp-mail.io per run
Agent polls an inbox for OTP codes and extracts them	Out of scope or closed SaaS feature	wait_for_verification_code at agent.ts:858, regex code extraction
Agent can hit external APIs to verify side effects	Almost never; mouse and keyboard only	http_request at agent.ts:925, GET/POST with custom headers
DOM-stability primitive backed by MutationObserver	Fixed sleeps or network-idle heuristics	wait_for_stable at agent.ts:956, live injection + poll
Mid-test UX bug filing with severity	Not a feature	suggest_improvement at agent.ts:914, emits title + severity
Entire tool surface readable in one file	Closed source or spread across a SaaS backend	agent.ts lines 16-196 and 753-1010

What you can actually do on Monday

Things you can verify today

You can grep the 18 tool names in one file and know the entire agent surface.
You can watch the agent call temp-mail.io live on your machine and poll a real inbox.
You can point http_request at your own backend and verify webhooks without writing a separate integration test.
You can read the MutationObserver injection in JavaScript inside a TypeScript string inside a testing tool.
Bug reports come out of the same run that verified the happy path.
Tests are plain markdown (scenario.md) on your disk. Move off Assrt and you keep them.

Who this is for

Developers who write their own tests. Teams who test flows that span multiple services (auth + email + payment + third-party webhook). Anyone who landed here from a Reddit thread about AI browser agents drifting, flaky, or getting locked behind a $7,500 SaaS monthly bill. The difference is not a bigger model. The difference is which primitives the agent ships with, and whether you can read them. Eighteen tools, one file, 0 lines.

Want to see all 18 tools run against your app?

Twenty-minute call, your staging URL, a live #Case plan. We record the run and hand you the scenario.md.

Frequently asked questions

What exactly is the 18-tool tool surface and where does it live?

The tool list is a single Anthropic.Tool[] array at /Users/matthewdi/assrt-mcp/src/core/agent.ts, lines 16 to 196. Eighteen tools: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. Ten of those are the mouse-and-keyboard tools you would expect. The other eight are where AI browser automation testing gets interesting: an inbox, OTP polling, arbitrary HTTP, DOM-stability via MutationObserver injection, mid-test bug filing, and the assert/complete bookkeeping. Everything the agent is capable of is enumerable by grepping the word name in one file.

Why does an AI browser automation testing tool need an http_request tool at all?

Because most real bugs are not in the DOM. Clicking 'Connect Telegram' in your app might show a green checkmark and still fail to actually deliver a message to the Telegram Bot API. A mouse-and-keyboard-only agent has no way to tell the difference between a correct backend and a lying UI. http_request, defined at agent.ts:925-955, lets the agent GET https://api.telegram.org/bot{token}/getUpdates right after the UI action and assert on the payload. The response body is fed back into the agent context so the next step (the assertion) sees it. This single primitive closes the 'works in staging, breaks in production' loop for any integration where the truth lives in a third-party API: Telegram, Slack, GitHub, Stripe webhooks, your own backend.

How is the disposable email feature different from pre-provisioning a test@myapp.com?

A fixed test@myapp.com means every run of your signup scenario hits the same user record, which means deduplication code paths, email-already-exists errors, and rate limits ruin your test within a week. Assrt's create_temp_email, defined at agent.ts:850 and implemented at /Users/matthewdi/assrt-mcp/src/core/email.ts, calls https://api.internal.temp-mail.io/api/v3/email/new per run and gets a brand-new address plus a token. The next tool, wait_for_verification_code (agent.ts:858), polls that specific inbox every 3 seconds for up to 60. A regex inside DisposableEmail.waitForVerificationCode extracts 4-to-8-digit codes and common 'code: 123456' variants. No Zapier, no mail mock, no test-user cleanup cron. The inbox is born, used once, and forgotten.

Does wait_for_stable really inject JavaScript into the page?

Yes, at agent.ts:956-1009. When the agent calls wait_for_stable, Assrt runs this.browser.evaluate with a literal JavaScript string that creates a MutationObserver on document.body with childList, subtree, and characterData set to true. Every mutation increments window.__assrt_mutations. The agent then polls that global every 500ms. If the counter stops changing for the configured stable_seconds (default 2), the wait unblocks. When the scenario ends, the observer is disconnected and the globals are deleted. The whole thing is about fifty lines, and the reason it matters is that AI browser automation testing against modern SPAs has to deal with streaming tokens, optimistic UI, and React re-renders; fixed sleeps either cut off the assistant mid-response or waste eight seconds on a page that finished loading in one. A DOM-stability primitive adapts.

What does suggest_improvement actually emit, and can I see it in CI?

suggest_improvement (agent.ts:914-924) takes title, severity (critical, major, minor), description, and suggestion. When the agent calls it, Assrt emits an improvement_suggestion event over the SSE stream and logs it on the scenario report. In MCP mode, the event shows up inline in Claude Code's output next to the test run. In CLI mode with --json, it lands in the JSON report. It is not a Linear ticket or a GitHub issue out of the box; it is a structured record that the agent noticed something that was not strictly part of the scenario. Teams wire it into whatever bug tracker they already use. The important part is that happy-path verification and UX bug filing come out of the same run; most AI browser automation testing tools treat those as two separate workflows.

I came from a Reddit thread about AI browser agents that drift. Does this one drift?

Drift in AI browser automation testing usually means one of three things: the locator broke because a class name changed, the model hallucinated an element that does not exist, or a fixed sleep did not match the real page timing. Assrt sidesteps all three. Scenarios are written as intent ('Click the Sign up button', not 'click .btn-primary:nth-child(2)'). Before each action, the agent calls snapshot (agent.ts:779) and gets a fresh accessibility tree with ref IDs; a ref that worked two steps ago but is now stale triggers a fallback to text matching. And wait_for_stable adapts to real page timing, so scenarios do not break on a slow CI box. When something does drift, the same agent surfaces it through suggest_improvement and assrt_diagnose (the MCP tool that re-opens the failing URL and proposes a corrected scenario). You get a specific diff, not a generic 'the test flaked.'

How does this compare to Browser Use, Stagehand, Skyvern, or Operator?

Those are browser agents built for general web tasks: booking flights, filling forms, scraping. They are excellent at that. Assrt is a browser agent narrowed to QA testing, which means it ships primitives general agents do not: the assert tool that splits passed scenarios from failed ones, complete_scenario that terminates a test run cleanly, suggest_improvement that files bugs on purpose, the #Case markdown format that a whole team can share via git, scenario continuity across multiple cases in one browser, and a video+screenshot recorder attached to every run. A general browser agent can certainly type into a signup form, but it will not emit a pass/fail verdict, will not poll an inbox on your behalf, will not file a bug report, and will not give you a scenario.md file you can check into your repo. If you want an agent to do end-to-end QA against your own web app, the primitives matter more than the model.

Can the agent run against my existing logged-in Chrome instead of a fresh browser?

Yes. Extension mode (--extension flag, or extension: true in the MCP call) connects Assrt to your running Chrome via @playwright/mcp's --extension transport. All your cookies, localStorage, and 2FA state are already there. First run, the agent prints a token that you paste back; the token saves to ~/.assrt/extension-token and future runs are zero-setup. This is the single most useful switch for testing behind auth: the agent uses your real session, so flows like 'invite a teammate to my workspace' work without provisioning a test account, and flows like 'verify the Stripe webhook fired' can hit your production-mode backend with your real API keys.

Is this actually open source or is there a closed core?

The agent, the MCP server, and the CLI are open source. /Users/matthewdi/assrt-mcp is the MCP server and test runner. /Users/matthewdi/assrt is the web app. The 18-tool agent loop, the wait_for_stable MutationObserver injection, the disposable email client, the HTTP request tool, and the suggest_improvement primitive are all plain TypeScript you can clone, read, and modify. There is an optional cloud service at app.assrt.ai that stores scenario runs and artifacts for sharing, but the test engine does not depend on it; you can run the whole stack against localhost forever without ever touching assrt.ai. Compare that to closed-source AI browser automation testing SaaS that charges $7,500 a month and locks your generated tests inside their dashboard.

What does 'real Playwright code, not proprietary YAML' actually mean in this context?

Under the hood, Assrt uses @playwright/mcp as its browser control layer (agent.ts:3 imports McpBrowserManager, which spawns the Playwright MCP server over stdio). Every click, type, and scroll your AI browser automation test performs is a real Playwright call. The scenario format above those calls is markdown (#Case blocks), not YAML or a proprietary DSL. If you ever need to drop into handwritten Playwright, the scenario's intent translates cleanly: a step like 'Click the Sign up button' becomes page.getByRole('button', { name: 'Sign up' }).click(). Closed-SaaS tools that emit proprietary YAML force you to re-author tests if you ever move off the platform; the Assrt format is readable at the top of this page. No migration cost either way.

The mental model: a plan, a hub, four real worlds

What the agent reads from, what it writes to

The anchor fact: eighteen tools in one array

Eight tools that most SERP articles miss

create_temp_email

wait_for_verification_code

http_request

wait_for_stable

suggest_improvement

evaluate

assert

complete_scenario

The http_request tool is the one that closes the loop

The suggest_improvement tool files bugs while the test is running

A scenario that uses five of the tools in one pass

Watch it run, top to bottom

How an AI browser automation test actually executes

The agent loop enters runScenario

The LLM picks a tool by name

The tool runs against real systems, not stubs

The result feeds back into the prompt

Assertions are flat and countable

The wait_for_stable sequence, start to finish

Where the SERP is shallow and where Assrt goes deeper

What you can actually do on Monday

Who this is for

Want to see all 18 tools run against your app?

Frequently asked questions

Comments (••)

Comments ()