Open-source AI testing tools, April 2026: a four-question filter

Most April 2026 round-ups for this topic give you a list of twelve to thirty framework names and call it a guide. The useful artifact is shorter: four questions you can answer in five minutes against any tool claiming the “open-source AI testing” label, with file-path answers from one MIT-licensed reference so you can verify each check in your own terminal. Read the prompt. Read the artifact. Run it in your real Chrome. Count the agent's tool surface.

Every claim below points at a line number against the Assrt MCP reference on GitHub. No vendor key, no procurement form, no per-seat license. Bring a candidate tool and run the same checks side by side.

Matthew Diakonov, Founder, Assrt

Published April 24, 202611 min read

4.9from 120+

MIT licensed

Self-hosted

Real Playwright

18 tool surface

Open-source AI testing, audited

Four checks. One MIT reference. Five minutes per tool.

Read the system prompt before you trust the agent

The artifact must survive without the tool

Real Chrome reuse beats fixture-seeded sandboxes

A bounded tool surface makes hallucinated APIs impossible

0:00 / 0:05

The framing the listicles miss

For a regular npm package, “open source” is a license file plus a published source tree. For an AI testing tool the bar is higher, because the tool's behaviour is not in the source you read; it is in the system prompt the testing agent reads at runtime, and in the JSON tool schema the agent is bounded by. A repo can be MIT licensed and still ship a closed prompt fetched from a vendor CDN every run. A repo can publish its CLI and still emit an artifact only the closed runtime can execute. So the practical question becomes: what parts of the tool determine the testing behaviour, and can you read all of them?

Below is the filter I use. Four questions, five minutes per candidate. The reference I anchor every answer against is the Assrt MCP repository, because it is the one I built and the one I know answers each question with a committed file. Bring whichever other tool you are evaluating and run the same checks. The point of the filter is to leave you with a comparison table, not a sales pitch.

Playwright MCP

Codegen

ZeroStep

playwright-ai

Octomind

BrowserStack AI

KaneAI

TestRigor

Magnitude

Browserbase

Stagehand

Operator

Computer Use

Selenium IDE

Playwright Generator

Question one: can you read the system prompt the agent reads?

The most consequential string in any AI testing tool is the prompt the agent reads before it touches your app. That string defines the agent's priorities, error-recovery strategy, selector style, and the way it interprets your scenario. If the string lives in your repository as committed text, you can read it, fork it, modify it, and predict the agent's behaviour. If the string lives behind an HTTP endpoint the SDK fetches at runtime, you cannot. The license file on the repo is irrelevant to this distinction.

For Assrt, the SYSTEM_PROMPT is at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 198 to 254. 57 lines. There is also a smaller PLAN_SYSTEM_PROMPT at src/mcp/server.ts lines 219 to 236, used by the auto-generation tool. Both are inline string constants, both committable, both forkable. The excerpt below is the first quarter of the SYSTEM_PROMPT; the full text is short enough to read on a flight.

src/core/agent.ts

The check itself: open the candidate tool's repo, search for SYSTEM_PROMPT, AGENT_PROMPT, BASE_PROMPT, INSTRUCTIONS. If you find a string constant in src/, the tool passes. If your grep returns nothing and the SDK code shows a fetch to a vendor endpoint named something like /api/agent/instructions, the tool fails. About thirty seconds per tool.

Question two: what file lands on disk, and could a teammate execute it by reading?

The second consequential thing is the artifact the tool produces. After a generated test, what file appears on disk? Is it readable to a human who has never seen the tool, and can they execute the steps with eyes alone? An artifact that survives without the runtime is one you can keep when you switch tools. An artifact that requires the closed runtime to interpret is something you rent.

For Assrt, the artifact is one Markdown file at /tmp/assrt/scenario.md. The format is #Case 1: name, then plain English steps, optionally followed by a passCriteria: block. Below is a real example. Hand it to a colleague who has never installed Assrt; they can read it top to bottom and execute the steps in any browser.

/tmp/assrt/scenario.md

The check: run one generation in the candidate tool, then ls the output directory and cat the file. If you can mentally execute the contents in five minutes, the tool passes. If the file opens with something like !vendor:ai/v2/scenario wrapped around an opaque DSL, or if the file is a binary blob, the tool fails. Portability shows up at this surface or it does not show up at all.

Question three: can it drive your real Chrome with your real cookies?

Most AI testing tools launch a fresh Chromium with no cookies, no extensions, and no logged-in identity. That is fine for a public landing page. It is useless for anything behind a real login: your Stripe dashboard, your Cal.com booking page, your team's admin console, the half of your product that only matters to authenticated users. Re-creating those auth states in a fresh sandbox means scripted login flows, fixture seeds, OAuth juggling, or shipping production credentials to a vendor sandbox.

The shorter path is to attach to your existing Chrome instance, where the cookies already live. For Assrt, the flag is --extension, wired through to Playwright's CDP-attach mode at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 338. The first run shows a Chrome approval dialog; the resulting token is saved to ~/.assrt/extension-token; every subsequent run skips the dialog. You test the page you are already looking at, signed in as you.

terminal

The check: search the candidate tool's docs for extension, CDP attach, connect existing browser, or use my session. If the docs mention any of these, the tool passes. If the only browser surface is a vendor-hosted sandbox or a fresh Chromium with no auth path, the tool fails. There is no in-between option.

Question four: is the agent's tool surface fixed or open-ended?

The fourth and final question is the most subtle. There are two architectures for an AI testing agent. The first is a code emitter: the agent writes Playwright TypeScript or JavaScript as its output, and a runner executes it later. The second is a plan executor: the agent calls fixed tools from a JSON schema (a navigate tool, a click tool, a type tool, etc.), and the schema rejects anything not in the list. The architectural difference matters because a code-emitting agent is free to invent page.fillFormFields() or page.clickByLabel(). The output looks plausible until you run it and discover those methods do not exist.

Assrt is a plan executor. The full surface is 18 entries declared at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 16 to 196. The MCP server rejects any call not in the list, so hallucinated Playwright APIs are physically impossible. Below is the shape, with one-line summaries instead of the full schema.

src/core/agent.ts

The check: open the file that defines the agent's tool schema (TOOLS, AVAILABLE_TOOLS, registerTool calls), count the entries, and look at one to verify the input schema is a real JSON shape. If the count is finite and the file is in src/, the tool passes. If the agent's output is free-form Playwright code (a string of TypeScript the runtime executes blindly), the tool fails this check, because the failure mode of hallucinated APIs is now your problem.

“The full agent surface is 18 fixed tool entries at agent.ts lines 16-196. Counted on 24 April 2026. The MCP server rejects anything else, which is why a plan executor cannot hallucinate Playwright APIs the way a code emitter can.”

How the four checks fit together

The four checks are not independent. They form a loop. The system prompt instructs the agent. The agent calls fixed tools from the bounded surface. The tools drive a real Chrome that you can attach to. The artifact that lands on disk is what you take with you when you change tools. If any one of the four is closed, the loop is closed somewhere a user cannot inspect, and the “open-source AI testing” label is decorative.

The four checks audit four different parts of the same loop

Run the audit on Assrt itself in 60 seconds

Below is the actual transcript of running the four checks on Assrt's repository, with no installation required beyond git clone. The same shape works against any other tool you are evaluating. If the equivalent commands return nothing in the candidate's repo, the candidate is failing one or more of the four checks.

terminal

Side by side: what answers look like for the two architectures

Below is the same six rows of comparison most listicles do not write, because writing them requires reading the candidate tool's repo. The left column is the answer pattern for a tool that fails one or more checks. The right column is the answer for a tool that passes all four. Read the right column as “what your shortlist should look like” and the left as “what most marketing pages do not invite you to compare against.”

Feature	SaaS-leaning code emitter (typical pattern)	Open-source plan executor (Assrt reference)
Where the system prompt lives	Fetched from a closed endpoint at runtime; not in the repo	Committed text in src/core/agent.ts lines 198-254 (57 lines)
What lands on disk after a generated test	Vendor-namespaced YAML or JSON DSL the closed runtime alone can execute	/tmp/assrt/scenario.md, plain Markdown #Case blocks a human can read
How the tool reaches a logged-in surface	Vendor-hosted sandbox; you ship credentials to a SaaS or write a fixture	--extension attaches to your running Chrome via CDP (agent.ts line 338)
How the agent knows which actions exist	Free-text Playwright code the model writes; hallucinated APIs are possible	18 fixed Anthropic.Tool entries; the MCP server rejects unknown calls
What you keep when you switch tools	A folder of vendor YAML that no other runner understands	Markdown plans plus a JSON report; both portable, neither vendor-shaped
Cost shape an employer signs off on	Per-seat license, four to five figures per month per team	MIT runtime plus a few cents per scenario in Haiku 4.5 tokens

Check 1: the system prompt

agent.ts lines 198-254. 57 lines, committed text. Read it before you trust anything else. If a tool does not let you point at this file in its repo, the prompt is fetched from a closed endpoint at runtime.

Check 2: the artifact

/tmp/assrt/scenario.md, plain Markdown #Case blocks. A teammate can execute it by reading. Vendor YAML or JSON DSL fails this check.

Check 3: real Chrome reuse

agent.ts line 338. The --extension flag attaches to your running Chrome via CDP. Token saved to ~/.assrt/extension-token after first approval.

Check 4: bounded tool surface

agent.ts lines 16-196. 18 entries. Anthropic.Tool[] array. The MCP server rejects anything else. Hallucinated Playwright APIs are not possible.

Why this filter exists

An AI testing tool's behaviour is not in the source you read; it is in the prompt the agent reads. A repo can be MIT and still keep the prompt server-side. The filter is what catches that gap.

How to run the filter on any tool, step by step

Five steps. Five minutes per candidate. After the fourth tool you have a comparison table; after the eighth, a decision. The point is not to declare winners, it is to leave you with a row of four answers per tool that you can defend in a conversation.

Open the repo and grep for the system prompt

Look for a string constant called SYSTEM_PROMPT, AGENT_PROMPT, BASE_PROMPT, or similar. If you find one in src/, the tool passes check one. If the prompt is fetched from an HTTP endpoint at runtime, the tool fails. For Assrt the file is src/core/agent.ts, lines 198 to 254.

Run a generation once and ls the output directory

Look at what the tool wrote to disk. If the artifact is a Markdown file, a JSON file with documented fields, or a TypeScript file you can read, the tool passes check two. If the output is wrapped in a vendor namespace (foobar:ai/v2/scenario, etc.) and depends on a closed runtime to execute, the tool fails. For Assrt the file is /tmp/assrt/scenario.md.

Search the docs for 'extension', 'CDP attach', 'connect existing browser'

If the docs include a flag for attaching to your running Chromium with your cookies, the tool passes check three. If the only browser surface is a vendor sandbox, the tool fails because anything behind a real login becomes a fixture-seeding nightmare. For Assrt the flag is --extension, wired at agent.ts line 338.

Open the file that defines the agent's tool schema and count entries

Look for a TOOLS array, an AVAILABLE_TOOLS object, or a list of registered MCP tools. If the count is finite and the file is in src/, the tool passes check four. If the agent can write free-form Playwright code as its output (rather than calling fixed tools), the tool fails because hallucinated APIs become your problem. For Assrt the count is 18 at agent.ts lines 16 to 196.

Stop, write down four answers, move on to the next tool

Five minutes per tool. The point of the filter is comparison, not endorsement. A tool that fails one check might still be the right purchase for your team; that is now an informed decision instead of a marketing one. Repeat for each candidate, save the table, and choose with eyes open.

0Tools the agent can call (agent.ts:16-196)

0Lines in the SYSTEM_PROMPT (agent.ts:198-254)

0Lines in PLAN_SYSTEM_PROMPT (server.ts:219-236)

0minMinutes to audit one tool

Cost shape, the part employers actually ask about

For an employer choosing between “sign up for the AI testing SaaS” and “run the open-source reference”, the cost gap is most of the conversation. Hosted AI testing platforms competitive at this surface still price competitive offerings in the four to five figures per month per team in April 2026. The local CLI produces the same JSON report and the same .webm recording on a laptop. The math does not require a spreadsheet.

Competitor SaaS, per month

$0.06

Tokens per scenario, Haiku 4.5

Tools the agent can call

0min

To audit one candidate tool

Audit checklist you can run today

If you have an hour to evaluate the open-source AI testing tools your team is considering for the second half of 2026, the audit is short and concrete. None of these steps require a vendor account or a credit card.

Run the filter on any tool in five minutes

Open the repo. Grep for SYSTEM_PROMPT in src/.
If absent, search HTTP code for a /prompts endpoint.
Run one generation. ls the output directory.
Cat the artifact. Could a teammate execute it by reading?
Search README for 'extension', 'CDP', 'attach'.
Open the file that defines the tool schema. Count entries.
Write four answers in a row. Move on.
After 4 tools you have a comparison table; after 8, a decision.

why this filter exists

A tool's license file does not predict whether you can read its prompt, keep its artifact, attach it to your real browser, or trust its tool surface. Those four properties do, and they cluster: tools that pass one check tend to pass the others, and tools that fail one tend to fail several. The filter is what separates a tool you can leave from a tool you can only rent. After running it on five candidates you will have a one-page table that survives turnover, headcount changes, and budget cycles.

Auditing AI testing tools for your team this quarter?

Walk through the four checks against your shortlist. We will run the audit on Assrt and one of your candidates side by side.

Frequently asked questions

What does 'open source' actually mean for an AI testing tool in April 2026, and why is the bar higher than for a normal library?

For a regular npm package, MIT or Apache on the repository plus a published source tree is enough. For an AI testing tool the bar is higher because the tool's behaviour is not in the source you read; it is in the system prompt the tool sends to its LLM at runtime. A repo can be MIT licensed and still keep the prompt server-side, fetched by the SDK from a closed endpoint. So the practical question becomes: can you read the prompt the testing agent reads before it touches your app? In Assrt's reference, the answer lives in two files. The 57-line SYSTEM_PROMPT is at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 198 to 254. The 18-line PLAN_SYSTEM_PROMPT is at /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 219 to 236. Both are committed text in a public repo. If you cannot point at the equivalent file in another tool's repo, the tool is open-runtime in name only.

What are the four questions in this filter, in one sentence each?

One: can you read the system prompt the testing agent reads? Two: when the tool produces a test, what file lands on disk and can a human who has never seen the tool execute it by reading? Three: can the tool drive your real Chrome instance with your real cookies, or does it require a sandbox? Four: is the agent's tool surface fixed (a finite JSON schema) or open-ended (free-text Playwright code that can hallucinate APIs)? A 'yes / portable / yes / bounded' tool is genuinely open-source AI testing. A tool that misses any of these is something else, possibly useful, but not the same shape.

How long does it take to run all four checks against one tool?

About five minutes per tool, mostly clicking through GitHub. Question one is two grep commands against the repo. Question two is one ls of the output directory after a generated run. Question three is 'does the docs index page mention extension mode, --extension, or CDP attach?'. Question four is 'open the file that defines the agent's tool schema and count entries.' For Assrt the answers are agent.ts:198-254 (yes), /tmp/assrt/scenario.md plus a Markdown #Case format (portable), --extension flag wired at agent.ts line 338 (yes), and 18 entries at agent.ts:16-196 (bounded). For tools where any of these answers is 'closed', 'vendor YAML', 'sandbox only', or 'free-text codegen', note that and move on.

Why does 'bounded tool surface' matter for an AI testing tool that is supposed to be flexible?

Because flexibility at the tool surface means the agent can hallucinate calls. A code-emitting agent that writes Playwright is free to invent page.fillFormFields or page.clickByLabel; the resulting .spec.ts looks plausible until you run it and discover those methods do not exist. A bounded-surface agent is physically incapable of hallucinating because the MCP server rejects any call not in the schema. Assrt's surface is 18 entries declared at agent.ts:16-196: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. Every action the agent takes is one of those 18, and every one is a real Playwright MCP call you can read the implementation of. There is no other call site.

What does 'portable test artifact' actually look like, and why is Markdown the right shape?

A portable artifact is one you can hand to a colleague who has never used the tool, and they can read it top to bottom and execute the steps with eyes alone. Assrt's artifact is one Markdown file with #Case blocks. The format is: '#Case 1: name', then plain English steps. A reader who has never heard of Assrt can read 'Click the Login button. Type test@example.com into the email field. Verify the dashboard appears.' and execute it manually or paste it into another agent. Compare with vendor formats that look like 'browserstack:ai/v1/scenario' wrapped around a JSON DSL. If you stop using the vendor, the JSON is dead. The Markdown file is just instructions; it survives the tool. Assrt writes the file at /tmp/assrt/scenario.md, watches it for edits with fs.watch, and syncs changes back so you can hand-tune the plan in any text editor.

Why does 'real Chrome reuse' matter, and how is it implemented in Assrt's reference?

Most AI testing tools launch a fresh Chromium with no cookies and no extensions. That is fine for testing a public landing page. It is useless for testing anything behind a real login: your Stripe dashboard, your Cal.com booking page, your Slack admin console. Re-creating those auth states in a fresh browser means scripted login flows, fixture seeds, OAuth juggling, or shipping production credentials to a SaaS sandbox. The shorter path is to attach to your existing Chrome instance, where the cookies already live. In Assrt's reference, the --extension flag is wired through to Playwright's CDP attach mode at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 338, with extensionToken persistence at the same path. The first time you use it, Chrome shows an approval dialog; the token is saved to ~/.assrt/extension-token; every subsequent run skips the dialog. You test the same page you are looking at, signed in as you.

Which tools claiming the 'open source AI testing tool' label fail one or more of these checks in April 2026?

Without naming individual products in marketing copy: code-emitting generators that ship a free CLI but a closed agent prompt fail check one. Tools that emit a vendor-specific YAML or JSON DSL fail check two. Tools whose 'browser' is a SaaS-hosted sandbox without a CDP attach option fail check three. Tools whose agent can write arbitrary Playwright code in the output fail check four because the runtime cost of hallucinated APIs falls on you. None of these are bad tools categorically; they are different products. Calling them 'open source AI testing' tools is the misnomer, because the parts you cannot read or relocate are the parts that determine the testing behaviour.

What is the fastest way to run the four-question filter against Assrt right now without installing anything?

Open the repository at github.com/m13v/assrt-mcp in a browser and read four files. src/core/agent.ts lines 198 to 254 is the SYSTEM_PROMPT (check one passes). src/core/agent.ts lines 16 to 196 is the TOOLS array (check four passes; count is 18). README.md mentions --extension and the Playwright extension token (check three passes). src/mcp/server.ts and src/core/scenario-files.ts describe /tmp/assrt/scenario.md as the test artifact (check two passes; the artifact is Markdown). Total time, about three minutes. Compare against any other tool you are evaluating, file by file. The quality of the comparison depends entirely on whether the equivalent files exist in the other tool's public repo.

Cost is part of the conversation in April 2026. What does running this stack actually cost compared with the SaaS competitors?

Two costs: the LLM tokens, and the seat license. Assrt defaults to claude-haiku-4-5-20251001, set at agent.ts line 9. Per scenario, the cost lands in the cents range, dominated by snapshot text and screenshot bytes the agent reads before each click. The seat license is zero, because the runtime is MIT licensed and self-hosted. Comparable hosted AI testing platforms in April 2026 still price competitive offerings at four to five figures per month per team for the equivalent functionality. The arithmetic over a year of one team is not subtle. The reason the vendors can charge that much is not the runtime itself; it is the closed prompt and the lock-in shape that make migration off the vendor expensive.

Does the bounded tool surface mean Assrt cannot do something a code-emitting tool can do?

Yes, and the trade-off is intentional. With 18 tools, the agent cannot, for example, run an arbitrary Playwright fixture composition, install a per-test browser context with custom storage, or invoke a Playwright trace viewer programmatically. Those things still belong in code-emitter territory. The 18 tools cover the common path: navigate, see what is on the page, interact with it, verify outcomes, talk to email, talk to external APIs, mark assertions, complete the run. For the 80 percent of testing that is 'click some buttons, fill some fields, check some text', a bounded surface is faster, cheaper, and harder to break. For the remaining 20 percent, a code emitter is still the right shape. The two architectures coexist; pick the one that matches the test, not the brand.

What about open-source AI testing tools that are not for browsers? CLI tools, mobile, API testing — does the same filter apply?

Yes, with the verbs swapped. Question one (readable prompt) is unchanged. Question two becomes 'what artifact lands on disk and is it portable?' For an API testing tool, that might be a YAML or JSON spec; it is portable if a curl-literate human can run the cases by hand. Question three becomes 'can it run against your real environment, or does it require a vendor sandbox?'. Question four becomes 'is the agent's tool surface fixed?' The answer for a mobile testing tool would point at an Appium adapter; for an API tool, at the HTTP verb set the agent is allowed to call. The principle is identical: the boundary the agent moves inside is the part you must be able to read.

I am building one of these tools. What is the minimum to honestly call it 'open source AI testing'?

Three commits and a license. Commit the system prompt as a string in your repo (not a runtime fetch from your CDN). Commit the tool schema as a typed array in your repo (not generated dynamically from a closed source). Document one example artifact your tool produces and show that a human can read it without your software running. License the runtime under MIT, Apache, or BSD; if your business model is a hosted control plane, the runtime can still be open. Assrt's example: SYSTEM_PROMPT and PLAN_SYSTEM_PROMPT are committed strings; TOOLS is a typed Anthropic.Tool array; the artifact is /tmp/assrt/scenario.md in plain Markdown; the license is MIT. The honesty bar is low; surprisingly few products clear it.