AI testing tools, auditor's view

Compare AI testing tools by the model doing the work and the finite list of actions it can take.

Feature tables compare AI testing tools by price and bullet count. Both of those are outputs of the two facts that actually predict what a tool can do, the model in the driver's seat and the exact surface of tools it is allowed to call. Assrt prints both in one file. Eighteen tools between lines 0 and 0 of agent.ts. One default model on line 0. Anthropic billed to your key.

Matthew Diakonov, Written with AI

Published April 23, 202610 min read

Install assrt-mcp Read agent.ts on GitHub

4.9from citable source lines

agent.ts line 9: DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"

agent.ts lines 16 to 196: the eighteen tool definitions, in full

agent.ts line 714: model call with max_tokens 4096 and the 18-tool array

agent.ts lines 697 to 744: retry loop with 5s, 10s, 15s backoff on 429 or 503

AI testing tools, honestly

Stop judging by pricing. Start judging by the model and the tool list.

Tool #1: name the model. Haiku 4.5 is our default.

Tool #2: list the actions. Eighteen. Readable in under ten minutes.

Tool #3: bill the AI. Anthropic invoices your key at token rates.

Tool #4: keep the tests. Plain Markdown on your disk, no vendor escrow.

0:00 / 0:05

The eighteen, in order

navigatesnapshotclicktype_textselect_optionscrollpress_keywaitscreenshotevaluatecreate_temp_emailwait_for_verification_codecheck_email_inboxassertcomplete_scenariosuggest_improvementhttp_requestwait_for_stable

The two disclosures that tell you what a tool can actually do

Every agentic web tester boils down to a loop: a model reads state, picks one tool from a finite list, runs it, reads the new state. The tool's real behavior is a function of which model is in the loop and what the list of tools contains. Everything else is wrapping paper. Any comparison that does not name both is telling you about the wrapping paper.

Disclosure 1

What model is in the loop

Model capability shapes everything. A weaker model needs more prompting and still drops context. A stronger model costs more and hits different rate limits. If you do not know the name, you cannot predict the behavior or price the tests.

Disclosure 2

What tools the model can call

The finite list of actions the model is allowed to take. Anything not on the list is a capability the tool flat out does not have, no matter how it is marketed. The list also tells you where the limits are, and whether you can push them yourself.

Disclosure 1: the model, in source

assrt-mcp/src/core/agent.ts (first 12 lines)

Two string constants. One for Anthropic, one for Google. Both override-able at runtime with --model or the ASSRT_MODEL environment variable. The string is passed verbatim into the SDK, so any model ID that supports tool use works. No allowlist, no vendor approval, no consulting call.

Disclosure 2: the tool surface, in source

assrt-mcp/src/core/agent.ts (lines 16 to 196, abridged to names)

The shape of a single model call

The inner loop is eight lines. Everything Assrt does against any web application is the sum of repeated invocations of this block, each one feeding tool results from Playwright back into the next turn.

assrt-mcp/src/core/agent.ts lines 713 to 723

How the 18 tools connect to a real browser

The agent is the hub. Scenarios come in as Markdown on the left. The model picks tools, Playwright MCP executes them, and the browser state plus assertion results flow out as a structured report on the right. No intermediate DSL, no vendor runtime.

Scenario → Haiku → 18 tools → browser → report

The eighteen tools, grouped by job

Read in five minutes, memorize in fifteen. Most scenarios use six or seven of these in the same order. The specialty tools (evaluate, http_request, wait_for_stable, the email trio) are what separates a DOM checker from an end to end integration verifier.

Look at the page

snapshot returns an accessibility tree tagged with ref identifiers like [ref=e5]. screenshot captures an image. Both reads are free, the model calls them whenever it needs a fresh view of the DOM.

Touch the page

click, type_text, select_option, scroll, press_key. Every interaction is addressed by the ref from the most recent snapshot. When a ref goes stale the model fetches a new snapshot rather than nursing a selector.

Wait correctly

wait waits for text or a fixed duration. wait_for_stable injects a MutationObserver on document.body and returns when child, subtree, and characterData mutations have been quiet for two seconds. No fixed sleeps that flake at 3am.

Verify the real world

assert records a pass or fail with an evidence string. http_request fires any HTTP call with a 30 second abort timeout so a test can confirm that a webhook actually reached your API, a Telegram message actually arrived, or a Slack post actually landed.

Escape hatch

evaluate runs arbitrary JavaScript in the page context. That one tool is why the agent can paste a six-digit OTP across six single-character inputs with a synthetic ClipboardEvent, or read any computed style, or walk a shadow DOM. When you inevitably hit a DOM that the standard tools can't express, evaluate is there.

Email without humans

create_temp_email mints a disposable inbox. wait_for_verification_code polls it for OTP codes. check_email_inbox reads it on demand. Signup and magic-link flows run end to end without a shared test account.

Report

complete_scenario closes out a case with a pass flag and a summary. suggest_improvement records UX bugs the agent noticed along the way as first-class output, not just commentary.

0tools the model can call

0max tokens per turn

0API retries on 429/503

0sseconds of DOM quiet before assert

Why opaque tools can't answer the same two questions

A closed AI testing platform is billed as a single opinionated product. Under the hood it also has a model and a tool list, but printing them would make the product look small. Haiku 4.5 plus eighteen tools does not sound like a twelve hundred dollar a month subscription. So they don't print it. The cost is your ability to reason about what the tool will do tomorrow.

Feature	Closed AI testing platform	Assrt
Default model, documented in source	Not disclosed. Marketed as 'proprietary AI engine' or 'best-in-class models'.	agent.ts line 9. claude-haiku-4-5-20251001. Swappable via --model flag.
Exact list of actions the AI can take	Not documented. You discover limits by hitting them on a trial.	Eighteen entries in the TOOLS constant, agent.ts lines 16 to 196. Ten minute read.
Who pays for inference tokens	Bundled into a per-seat SaaS subscription that starts around $1K/mo.	You pay Anthropic directly with your own API key. Haiku rates, cents per run.
Where the tests live	In the vendor's database, accessed through the vendor's UI.	On your disk. /tmp/assrt/scenario.md is editable Markdown watched by fs.watch.
Waiting for streaming DOM	Fixed sleeps and auto-retries. Flaky against streaming AI chat output.	wait_for_stable tool, MutationObserver on document.body, two seconds quiet default.
Verifying a webhook actually fired	DOM-only. The AI can see the success banner, not the third-party side effect.	http_request tool. The same scenario can hit api.telegram.org to confirm delivery.
Signup flow with OTP email verification	Needs a pre-configured test account or a human in the loop.	create_temp_email + wait_for_verification_code + evaluate paste trick. No humans.
If the vendor shuts down tomorrow	Your tests disappear with them.	MIT licensed npm package. Tests are Markdown on your disk. Run offline, forever.

The competitor column describes the shape of a typical closed agentic testing product in this category, not any single vendor. The Assrt column cites line numbers and file paths you can grep for in a fresh clone of the repo.

A three minute audit for any AI testing tool

If you run this checklist against every tool on your shortlist, you will eliminate most of them before lunch. If a tool refuses to answer any of the four questions, the refusal is the answer.

Find the model

In a truly open tool the default model name is a string constant in the source you can grep for. In Assrt it is DEFAULT_ANTHROPIC_MODEL at agent.ts line 9, value claude-haiku-4-5-20251001. In a closed tool the trial dashboard will talk about 'our proprietary AI engine' and dodge the question when you ask what is actually running.

Find the tool list

Every agentic tester has a finite set of actions the model can take. In Assrt they are the TOOLS constant, lines 16 to 196 of agent.ts, eighteen entries. In a closed tool that list is either obfuscated behind an SDK or simply not disclosed. If you cannot enumerate what the AI is allowed to do, you cannot predict what it will fail to do.

Find who pays for inference

Assrt calls this.anthropic.messages.create(...) with a key you bring. The invoice lands at Anthropic, priced at Haiku 4.5 rates, usually a few cents per ten-step scenario. Closed platforms wrap inference into a flat seat price that starts around one thousand dollars a month and climbs into the five figures at enterprise tier, with no line-item for what the model actually cost.

Find the output format

A tool that cannot export its tests into a format your repo accepts is a rental. Assrt writes scenario.md and results/latest.json under /tmp/assrt and the MCP server returns a structured JSON summary. You can grep it, diff it, commit it, and rerun it without the tool.

What the audit looks like on a real Assrt install

Everything below is one npm install assrt-mcp away. No trial signup, no quota, no marketing email. The paths and greps are identical on your Mac and on mine.

your shell, after one scenario

The anchor fact

One model name. Eighteen tool names. One file.

The entire capability surface of Assrt as an AI testing tool fits in a single TypeScript file, assrt-mcp/src/core/agent.ts, about one thousand lines long. The model default lives on line 0. The eighteen tool definitions live between lines 0 and 0. The call that sends the tool list to Anthropic lives on line 0. You can read all three in ten minutes, which is less time than a sales call.

Everything else on this page, every metric, every claim, every FAQ answer, is derived from those line numbers. That is what readable AI testing tools look like.

“The model is claude-haiku-4-5-20251001. The 18 tools are navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, and wait_for_stable. You pay Anthropic.”

assrt-mcp/src/core/agent.ts

Want a live audit of your current AI testing tool?

Bring the trial dashboard, we will grep the SDK together and line it up against these four disclosures. Thirty minutes, no pitch.

Frequently asked questions

What makes this comparison of AI testing tools different from other comparison pages?

Most comparison pages sort tools by price, logo count, and bullet lists of features. Those are marketing inputs, not engineering inputs. The two facts that predict what an AI testing tool can and cannot do are the underlying model (because models have different capabilities, cost profiles, and context windows) and the finite tool surface the model is allowed to call (because anything outside that surface is something the tool flat out cannot do). Almost no vendor prints either one. Assrt prints both, in a single file agent.ts, and this page reads them aloud.

What model does Assrt actually use, and can I change it?

The default is Claude Haiku 4.5, declared on line 9 of assrt-mcp/src/core/agent.ts as DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001". Line 10 declares a Gemini fallback at DEFAULT_GEMINI_MODEL = "gemini-3.1-pro-preview". You can override at runtime with the --model flag or the ASSRT_MODEL environment variable. The model string is passed verbatim into this.anthropic.messages.create({ model, ... }) at line 714, so any Anthropic model ID that supports tool use will work. Haiku is the default because it is fast, cheap, and strong enough to drive an accessibility tree through tool calls.

How many tools does the AI have access to, and what are they?

Eighteen, defined as the TOOLS constant between lines 16 and 196 of agent.ts. In order: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. That is the entire action surface. There is no dynamic plugin registry, no vendor-private sub-tool, no hidden reasoning action. If you can describe your test in terms of those eighteen calls, the agent can execute it. If you can't, neither can the agent, and the honest answer is to open a pull request.

Who pays for the model tokens?

You do, directly to Anthropic. The Anthropic client is initialized in the agent constructor with process.env.ANTHROPIC_API_KEY. Every test run invoices against that key at Haiku 4.5 rates. A typical ten step scenario lands at a few cents in tokens. Compared to commercial AI testing platforms that start around one thousand dollars a month and climb to seventy five hundred plus at enterprise tier, that is one to three orders of magnitude cheaper, and you get an itemized Anthropic invoice instead of a flat SaaS line.

Why does judging an AI testing tool by its tool surface matter more than judging by features?

Because marketing features are generated from the tool surface, not the other way around. Every bullet a vendor prints about 'auto heals broken selectors' or 'tests OTP flows' is the output of a particular tool doing a particular thing. If you know the tool surface, you can predict which features are real and which are theatrical. If you do not know the tool surface, a plausible sounding feature list can hide enormous gaps. Assrt's assertion that tests survive UI changes is not a feature, it is a consequence of snapshot returning fresh refs on every call. Someone else's 'self healing' claim might be the same consequence, or might be an LLM guessing at a new CSS selector after the old one breaks. Reading the tool surface tells you which.

Does Assrt actually generate Playwright code, or does it run its own engine?

Every tool call in that list of eighteen maps to a real Playwright action. click calls page.getByRole(...).click() against the ref returned from snapshot. type_text calls page.getByRole(...).fill(). evaluate calls page.evaluate(). wait_for_stable runs a MutationObserver via page.evaluate with a polling loop in TypeScript around it. There is no intermediate DSL that compiles to Playwright at runtime. The agent drives Playwright directly through the @playwright/mcp bridge. If you want to see exactly what runs, open src/core/agent.ts and src/core/browser.ts in the assrt-mcp repo, the full path from model tool call to Playwright API is about two files and a few hundred lines.

How do you verify a third-party side effect, like a Slack post landing?

The http_request tool. It accepts a URL, method, headers, and body, runs with a 30 second AbortController timeout, and returns the status and up to 4000 characters of the response body. In a test, the agent signs up, triggers the action that should post to Slack, then fires http_request against the Slack web API or a webhook verification endpoint and makes an assert call against the response. That promotes the test from a DOM-only checker to an end to end integration verifier. Same pattern works for Stripe webhook confirmations, Telegram bot deliveries, and GitHub comment posts.

What does the system prompt that steers the model look like?

It is defined immediately after the TOOLS array, also in agent.ts, about sixty lines. It tells the model to always call snapshot first, to use ref identifiers over text matching, to call snapshot again after any action fails, and to prefer wait_for_stable over fixed sleeps when the page is still rendering. It also contains a few dozen lines of very specific edge case guidance, including a literal JavaScript expression the model pastes verbatim into evaluate() when it encounters split-digit OTP inputs. Because it is in the source, you can read it, disagree with it, and submit changes.

Can I self-host Assrt, or does it have to call a cloud control plane?

The MCP server is a plain Node.js process you run locally. It does not phone home. The only outbound call during a test is to api.anthropic.com (or generativelanguage.googleapis.com if you pick Gemini) to talk to the model. Browser automation runs on your machine through a local Playwright, so the agent can hit localhost URLs your dev server is serving without tunnels, VPNs, or public exposure. Test results and scenarios sit on your disk under /tmp/assrt. There is also an optional cloud sync to Firestore for teams that want a shared dashboard, controlled by an environment flag and off by default.

Is Assrt actually free, and where is the source?

The MCP package is MIT licensed and available at npm install assrt-mcp. The full source, including agent.ts with all the line numbers referenced above, is at github.com/assrt-ai/assrt-mcp. Your only running cost is the Anthropic invoice for Haiku tokens consumed during test runs, which in practice is a rounding error next to any team's existing LLM spend. There is no seat fee, no build quota, no flaky test surcharge, no enterprise tier.