Automation test tool, the second-output edition

An automation test tool that also reports the bugs you did not ask about.

Every guide on this topic is a table of vendors and a paragraph each. Skip the table. There is one property that actually separates these products: can the runner tell you something you did not ask for? In Assrt, tool number 0 of 0 is called suggest_improvement. Its output lives in a separate improvements[] array in your results file. Same run, second answer.

Matthew Diakonov, Written with AI

Published April 23, 202610 min read

Install npx @assrt-ai/assrt

4.8from Assrt maintainer view

suggest_improvement: agent.ts line 158

improvements[] aggregated at mcp/server.ts line 663

Pass/fail and bug log are architecturally independent

The second output

Every agent-driven run emits two things: a verdict and a bug log.

Tool #16: suggest_improvement

severity: critical | major | minor

emit('improvement_suggestion', ...)

improvements[] lands in results/latest.json

Pass/fail stays untouched

0:00 / 0:05

Markdown plansMIT licenseself-hostedno dashboard requiredJSON results on diskimprovement log, orthogonal to pass/failPlaywright MCP underneathBYO model keyone install, npxagent-driven, not script-driven

One run, two outputs

A script-based automation tool is a regression machine. You hand it a list of assertions, it replays the pre-decided clicks, it reports pass or fail on each assertion. That is a single output channel: binary per scenario, aggregated at the suite level. Nothing about the footer it walked past. Nothing about the modal that did not trap focus. Nothing about the password field that cleared on tab-out. The plan did not ask; the plan did not get.

An agent-driven runner changes the shape of the output because it changes the shape of the runner. The agent is reading the page, reasoning about it, and making tool calls turn by turn. The same reasoning that lets it click the right button lets it notice that the button next to it has a stale label. The architectural question is whether the tool schema gives the agent a way to report that observation without derailing the scenario. In Assrt it does, and the side output is a first-class field in the results file.

What lands in your results file

A single boolean per scenario. The assertions that the plan named, evaluated against the live app. Anything adjacent to the flow that happened to be broken is invisible because nothing in the spec asked about it.

passed: true | false
assertions you wrote
nothing else

The tool definition that makes it possible

The schema below is lines 157-170 of assrt-mcp/src/core/agent.ts. It is tool number 16 in a TOOLS array of 18 that the agent sees on every turn. Three things to notice. First, severity is a required field with a stated enum (critical, major, minor), which means the model ranks before filing. Second, suggestion is required, not optional, so the agent has to propose a fix, not just flag a smell. Third, there is no mention of the current scenario in the schema, which is what lets this tool be orthogonal to pass/fail.

assrt-mcp/src/core/agent.ts

How a tool call becomes a row in your JSON report

When the agent calls suggest_improvement, the switch case at agent.ts line 914 handles it. Note what is absent: no assignment to scenarioPassed, no call to complete_scenario, no break out of the main loop. The agent keeps going on its current scenario and logs the observation as a side effect. The emit fires an improvement_suggestion event that both the CLI and the MCP server listen for.

assrt-mcp/src/core/agent.ts

The MCP server aggregates those events into a plain array and drops it into the response at the top level of the summary object. This is the file that ends up at /tmp/assrt/results/latest.json and the same shape the MCP tool returns to Claude Code or Cursor.

assrt-mcp/src/mcp/server.ts

One run, three landing spots

What a real terminal session looks like

Three scenarios, a smoke suite. The plan is six lines of Markdown. The run passes. Along the way, three issues get logged inline. None of them fail the run. All of them end up in improvements[] in the JSON output. This is the whole product pitch compressed into 30 lines of terminal.

npx @assrt-ai/assrt run --plan-file smoke.md

The plan that produced that run

Twelve lines of plain Markdown. Three #Case blocks. No selectors, no waits, no hard-coded timeouts. The agent binds the actual DOM refs at runtime by reading each page's accessibility tree. The plan is grep-able, git-diff-able, and portable to any machine that has Node and a model API key.

smoke.md

What the results file carries

Top-level: passedCount, failedCount, scenarios[], improvements[]. Same file for both channels. The improvements entries below are what a typical run of a consumer web app surfaces: one stale copyright, one focus-trap issue, one password-field bug. Three things no script-based runner would have mentioned.

/tmp/assrt/results/latest.json

The kind of things the agent actually files

Six recurring categories from real runs. None of these are assertions a sensible regression author would hand-write. All of them show up in improvements[] often enough that teams end up with a small weekly triage backlog instead of a single flat pass/fail feed.

Copy and date drift

"Copyright 2024" in a footer in April 2026. "Beta" badge on a product that launched two years ago. A pricing page that still references the old plan name. The plan would never assert on these, but the agent reads the page and notices.

Broken or stale links

A 'Privacy' link in the footer returning 404. A 'Docs' CTA pointing at a deleted subdomain. The agent does not follow every link (that would balloon the run), but it flags ones it saw during the main scenario.

Accessibility smells

A dialog that does not trap focus. An input with no label association. A button with aria-disabled but no visual distinction. These are recognized pattern-matches, not a full axe-core pass, but they catch the obvious regressions that slip past manual review.

Low-severity UX bugs

Tab order that skips the submit button. An error toast readable only for 800ms. A password field that clears on blur. A dropdown that closes on hover instead of click. Things that ship.

Visually disabled but clickable

A CTA styled with opacity-50 that is still clickable and fires the action. A form submit rendered as :disabled CSS without the actual disabled attribute. Only something that reads the live DOM catches this.

Silent console errors

The plan does not ask for console output, but the agent has evaluate() and will sometimes spot a red Uncaught TypeError during a flow and file it under critical. Not a CDP network log replacement, just the obvious ones.

improvement event routing

🌐

tool call

agent calls suggest_improvement with { title, severity, description, suggestion }

⚙️

switch case

agent.ts line 914 handles the call, no effect on scenarioPassed

🔔

emit event

agent.ts line 919 emits 'improvement_suggestion' on the event stream

✅

CLI + MCP

CLI prints [issue] line; MCP pushes to improvements[]

📧

results.json

/tmp/assrt/results/latest.json carries improvements[] alongside scenarios[]

What a run actually costs

The bug log is not free in tokens, but it is close. Each suggest_improvement call is one tool request and one short tool result; the model drafts four string fields. That is roughly the cost of one extra type_text call per observation. A 10-scenario run with 40 snapshots and 3 improvements stays in the cents-per-run range on Claude Haiku 4.5, the default model at agent.ts line 9.

0tools in the agent's schema

#0suggest_improvement's index

0required fields per issue

0impact on the scenario verdict

How an engineering manager should read a run

Three passes over the same results.json. If you treat it as a regression pass, you scan passedCount and failedCount. If you treat it as a bug scan, you skim improvements[] for anything marked critical, then batch the rest into a weekly triage. The third pass is the one that scripted runners cannot give you: cross-referencing an unchanged pass/fail with a growing improvements[] shows you where the app is rotting quietly while the features still work.

Read 1: pass/fail

scenarios[] tells you whether the features named in the plan still work. Same signal every regression suite has given you for a decade.

Read 2: critical issues

improvements[] where severity === 'critical'. These are auth-broken, checkout-down, prod-public 500s. Promote to tickets immediately.

Read 3: the slow leak

Count improvements[] over time. A pass/fail that holds flat while the bug log grows is a codebase shipping polish debt. No script-based runner surfaces this curve.

How this maps against the usual options

Feature	Typical runner	Assrt
Test plan format	DSL, code, or a row in a vendor database	plain Markdown, #Case N: blocks, commit to git
Selectors	CSS/XPath/getByTestId hand-written ahead of time	accessibility refs bound at runtime from a live snapshot
Pass/fail output	equivalent report file, same semantics	scenarios[] in /tmp/assrt/results/latest.json
Unplanned-bug output	none; the runner cannot report what was not asked	improvements[] in the same JSON, orthogonal to pass/fail
Runs without a dashboard	varies; many hosted tools require login for results	yes, everything is a local file; cloud sync is optional
License	mixed; hosted tools per-seat, some in the thousands/mo	MIT, @assrt-ai/assrt on npm

Three questions to replace the feature-grid comparison

The next time you evaluate one of these products, these three questions will surface the architectural facts faster than a 15-row checkbox sheet. They are simple, direct, and vendors either answer them cleanly or dodge.

On the next demo call

How many artifacts does one test run produce? — If the answer is one, you are looking at a script-based runner. If the answer is two or more (pass/fail plus a bug log, or plus a page-discovery log), you are looking at an agent-driven runner. Both categories call themselves 'automation test tools.' The distinction is not in the tool page, it is in the shape of the output directory.
What is the file format of my test plan? — A plain-text format (Markdown, YAML, .spec.ts, even a well-formed Gherkin .feature) that I can edit in any IDE and diff in git is the bar. A proprietary visual format that only loads in the vendor's cloud editor is not. Ask to see the file that lives in the repo after adoption. If the answer is 'there is no file, it is a row in our database,' note that.
Does the runner make its own observations, or only the ones I asked for? — An agent-driven runner has the structural capacity to report unplanned findings. A script-based runner can only report on assertions you hand-wrote. Neither is strictly better; they are different products that share a label. If your biggest blind spot is 'small issues we keep meaning to file,' the first category is worth the switch.

Where to go from here

Install is one line: npx @assrt-ai/assrt setup. It registers the MCP server with Claude Code, drops a QA reminder hook into your CLAUDE.md, and reads your Anthropic OAuth token from the macOS Keychain if you already have Claude Code installed (see assrt-mcp/src/core/keychain.ts line 10 for the keychain entry name). First run, take a small existing regression suite, translate it into three #Case blocks, point it at your localhost, and read both arrays in the results file. That reading is the evaluation.

Want to see a real run against your app?

Bring a URL and a flow description. We will build the #Case plan live and read the improvements[] array together on the call.

Frequently asked questions

What makes one automation test tool meaningfully different from another in 2026? The feature tables all look identical.

The feature tables look identical because they compare the same three things: what language your tests are written in, which browsers the runner supports, and how the tool plugs into CI. Those are the 2018 axes. The 2026 axis is the architecture of the runner itself. A script-based runner (Playwright, Cypress, Selenium, and every hosted tool built on top of them) replays pre-decided clicks and asserts pre-decided outcomes. An agent-driven runner (Assrt, a handful of newer tools) reads your plan, reasons about the page in front of it, and makes the next tool call at runtime. The practical consequence shows up in what the runner produces. A script-based runner produces one output: pass or fail. An agent-driven runner can produce a second: a bug log for things it noticed that were not in the plan. That second output exists because the agent is structured enough to emit it (via a dedicated tool call) and un-scripted enough to have opinions. Pick on this axis, and your evaluation will feel different from a spec sheet.

What is the 'second output' exactly? Is it just console noise?

It is a structured, orthogonal record. In Assrt, the agent has a tool named suggest_improvement (defined at assrt-mcp/src/core/agent.ts line 158) with four required fields: title (short), severity (critical, major, or minor), description (what is wrong), suggestion (how to fix it). When the agent calls it during a scenario, the runner emits an improvement_suggestion event (agent.ts line 919). In the MCP response, these events are aggregated into a dedicated improvements[] array that sits alongside the pass/fail scenarios array (mcp/server.ts line 663). In the CLI, they print as a one-line per issue: [issue] major: Stale copyright year in footer (cli.ts line 165). This is not debug output mixed into pass/fail. It is its own channel. A test scenario can pass while carrying four improvement entries; a scenario can fail while carrying none. The two signals are independent by design.

Why would an agent report bugs that were not in the plan? Does it slow down the main test?

The system prompt gives the agent a tool for it, and the agent uses that tool when it sees something obvious. No extra instructions required. A typical trigger: the agent navigates to /checkout as part of a plan about signup, notices a broken 'Terms' link at the bottom of the page, calls suggest_improvement with severity=minor, and keeps going with the signup flow. It does not abandon the scenario to chase the bug. It does not rerun anything. The cost is one additional tool call and a few hundred extra tokens of context, both of which are dwarfed by the snapshot + screenshot pairs every turn already ships. The main scenario's pass/fail is unaffected because suggest_improvement does not touch the scenarioPassed variable (contrast with the assert case at agent.ts line 899, which does). From the agent's perspective, reporting a side-observation is just another tool call, like navigate or click.

What kinds of issues does the agent actually catch that a scripted test would not?

Four categories show up regularly. First, copy and date drift: 'Copyright 2024' in a footer in April 2026. A script would not notice because the scenario never asserted on the year. Second, broken or stale links visible on pages the plan happened to walk through. Third, accessibility smells the model recognizes from its training: a modal that does not trap focus, a form input with no associated label, a click target under 24px. Fourth, small UX bugs: the password field clears when the user tabs to the next input, an error toast that disappears in 800ms and is unreadable, a CTA that looks disabled but is clickable. These are the 'small things you keep meaning to file a ticket for' category. A regression suite ignores them because they are not in the spec. An agent run logs them because it has a tool for it and the question 'does anything look wrong?' is adjacent to the question 'did my plan pass?'.

How is this different from a hosted exploratory-testing tool like Rainforest or Testim autopilot?

Hosted exploratory tools are separate products. You pay per tester or per crawl, you run them on a schedule, results live in their dashboard, and the bugs they find are on a different tab from the regression results. Assrt fuses both. You write one plan, you run it once, and you get two outputs in the same JSON report. The improvements[] array lives in the same file as the scenarios[] array (mcp/server.ts lines 654 to 664). No second product, no second subscription, no reconciliation between two dashboards. The tradeoff is depth: a dedicated exploratory crawler will click through more of your app than a targeted test plan ever will. If you need both a deep crawl and a regression pass, you run both. If you want 80% of the bug-finding value as a side effect of your existing regression runs, this architecture gives it to you for the cost of the extra tokens.

Where does the improvements log actually live on disk so I can wire it into Linear or Jira?

Every run writes /tmp/assrt/results/latest.json (plus /tmp/assrt/results/<runId>.json for history), defined in scenario-files.ts lines 18 to 20. The improvements array is a top-level field in that JSON: { url, passedCount, failedCount, scenarios: [...], improvements: [...] }. Each improvement is a plain object with title, severity, description, suggestion. From there, a five-line shell pipe can open Linear tickets: jq '.improvements[]' latest.json | while read -r row; do ... done. No API integration, no webhook, no proprietary output format. The file is grep-able, diff-able, and committable. For CI, assrt run --json prints the same structure to stdout so you do not even need the /tmp file. Wire it wherever you already ship test results.

Is this tool open source? What am I actually installing, and what happens if Assrt the company vanishes?

The CLI is @assrt-ai/assrt on npm. Install with npx @assrt-ai/assrt setup, MIT license. The source is readable at github.com/assrt-ai. What lives in your repo after adopting it: plain Markdown plans (any text editor, any VCS), a package.json dependency on @assrt-ai/assrt, and nothing else Assrt-specific. The agent that drives the browser is Anthropic's Claude Haiku 4.5 by default (agent.ts line 9) or Google Gemini, with the provider and model configurable per run. The browser layer is @playwright/mcp, which is Microsoft's. If Assrt shuts down tomorrow, the CLI continues to work on the version you installed, your Markdown plans continue to be readable Markdown, and the only loss is the optional cloud sync for history. The agent's reasoning layer is an API key you own against a model provider you choose, not a subscription to Assrt.

How do I evaluate an automation test tool using this frame? What do I actually ask on the vendor call?

Three questions. First: 'When I run one test, how many distinct artifacts do I get?' A script-based runner answers one (the report). An agent-driven runner should answer at least two (the report plus a bug log, possibly a third if it also discovers new pages). Second: 'What format is my test plan in, and can I commit it to git as a text file?' Markdown, YAML, and .spec.ts all qualify; a proprietary format that only exists inside their cloud editor does not. Third: 'What happens to my plans and my results if I cancel my account?' An acceptable answer is 'you downloaded them to disk during the trial and they still run.' An unacceptable answer is 'we export to a schema no other tool reads.' This frame replaces the usual 15-feature checkbox matrix with three architectural questions that surface whether the tool's shape matches how you actually want to work.

Does the bug log ever generate false positives, and how noisy is it in practice?

Yes, and it is calibrated low by design. The system prompt instructs the agent to report 'obvious' bugs, and suggest_improvement requires a severity field that forces the model to rank. In practice, on a 10-scenario run, the improvements[] array tends to land between 0 and 5 entries. A critical entry is rare (the agent reserves it for things like checkout failing or auth being broken on a public page). Minor is the usual severity, and minor entries are triaged in batch, not as tickets. False positives happen, mostly around 'this copy is confusing' style UX opinions; a .assrtignore pattern or a severity filter at the CLI level (grep) resolves them. The alternative is silence, which is what a scripted runner gives you: the bugs exist, the run just does not mention them.

Can I disable the bug log if my team does not want it?

You can ignore it: it never fails a scenario, and if no one reads improvements[] it sits in latest.json unread. You can also prompt it off: add 'Do not call suggest_improvement; focus only on the plan' to your scenario text, and the agent will comply. The current architecture does not expose a CLI flag to remove the tool definition from the schema (agent.ts line 158) because the tool is cheap when unused, but there is no cost to leaving it on. Most teams that adopt Assrt end up keeping the log and skimming it weekly, the same way you might skim Sentry's low-signal issues tab. It becomes a small backlog of polish items attached to your main regression suite.