QA + Automation, As It Actually Looks In 2026

QA and automation is no longer a separate team. It is 17 tools an MCP server hands the coding agent between writing the feature and opening the PR.

Most guides on this topic describe record-and-replay scripting, list a dozen Selenium clones, and stop at “saves time vs manual.” The vocabulary they use assumes a human authoring step and a separate human fixing step. Both of those steps collapsed into one conversation when QA moved into the coding agent loop. This page walks the new surface file by file: 17 callable browser tools, three MCP entrypoints, one disposable inbox that lets the test agent sign itself up.

M
Matthew Diakonov
12 min read
4.9from coding agents calling QA tools per scenario, not per release
17 callable browser tools defined in one TOOLS array (agent.ts:16-196), not a proprietary recorder.
Three MCP entrypoints (assrt_test, assrt_plan, assrt_diagnose) close the write/run/diagnose loop.
create_temp_email + wait_for_verification_code lets the test agent sign itself up. No fixture data.

The shape that broke when coding agents got good

The old shape of QA and automation was three jobs in three tools: a developer wrote a feature in an IDE, a QA engineer recorded a click-path in a Selenium-style recorder or Cypress-style spec, and a CI runner replayed those scripts on every commit. The handoff between the three was email, dashboards, and screenshots taped to Jira tickets. None of the three could touch the others' tools without context-switching into a different surface.

When the developer became a coding agent, that shape stopped working in two places at once. The agent could write the feature in seconds but could not verify its own change without a human running the test. The recorder script the QA engineer had built last quarter was already brittle, because the agent had renamed three CSS classes between commits. The dashboard full of screenshots was useless to the agent, because the agent could not parse a PNG into a pass/fail decision.

The new shape is one tool in one conversation: an MCP server that exposes a tiny set of testing primitives the coding agent can call between writing the feature and pushing the commit. Three entrypoints, 17 browser actions under them, one structured TestReport coming back. That is the entire surface, and it fits in a single file.

One full QA round trip, no human

Coding agentAssrt MCPTest agent + browserDisposable inboxassrt_test(url, plan)spawn 17 toolscreate_temp_email POSTburn-xyz@temp-mail.iotype email + click Continuewait_for_verification_code (poll)OTP 384029type OTP + click Verify + assertTestReport JSONpassed=true, screenshots[], video

The 17-tool surface, defined in one file

The interesting thing about the test-agent surface is how small it is. There are 17 tools total, declared in one TOOLS array. Twelve of them you would guess from any Playwright tutorial. Five of them are the difference between “a browser the agent can drive” and “a QA engineer the agent can hire by the second.”

assrt-mcp/src/core/agent.ts (TOOLS, lines 16-196, MIT)

The five non-obvious tools are the ones that make the loop close. create_temp_email lets the agent sign itself up. wait_for_verification_code lets it pass an OTP gate without a fixture. wait_for_stable kills async-timing flakiness using a MutationObserver. http_request lets it verify side effects in external systems (a Slack message posted, a Telegram bot updated, a webhook fired). suggest_improvement lets it file a UX bug it stumbled across while running an unrelated scenario.

The autonomous-signup loop, in one scenario file

The single most common reason a QA suite calls in a human is the email-OTP gate on signup. Either the test account quota runs out, or a real inbox is required because the SMTP server rejects fixture addresses, or the OTP code is one-shot and cannot be reused across runs. The standard fix is a paid service like Mailosaur with an SDK and a per-month budget.

The MCP-driven approach replaces that with two tool calls and a free public inbox. The scenario file describes the user flow in English; the agent fills in the inbox plumbing on its own.

tests/signup.txt

The plumbing under those two tool calls is 23 lines of TypeScript. There is no proprietary inbox SDK and no test-account management layer. The disposable inbox itself comes from the public temp-mail.io API. The agent treats the inbox the way it would treat any other browser-side resource: spin one up, use it, throw it away.

assrt-mcp/src/core/email.ts (lines 33-56, MIT)

What flows where in the closed loop

The diagram below is the entire architecture. There is one inbound side (your repo, your app, your LLM key), one hub (the Assrt MCP server, running on your laptop), and one outbound side (a structured report, a video file, and a Playwright spec you can run without the tool). Nothing in the middle is a vendor service. The healing decisions, the test report, and the recording all live on disk in paths the agent told you about.

Inputs, the QA hub, outputs

tests/*.txt
Your app under test
Anthropic API key
Assrt MCP (3 entrypoints)
TestReport JSON
recording.webm
Real Playwright code

How the loop runs in practice

A real run looks unremarkable, which is the point. The coding agent calls one MCP tool, the test agent does its job, and a JSON blob comes back. The interesting line is the one where the test agent creates its own inbox, types its own address, and reads its own OTP. No human authored a fixture for any of that.

npx @assrt-ai/assrt run --plan-file tests/signup.txt

What the new shape lets you stop doing

The most useful way to read the 17-tool list is as a list of chores it removes. Each tool below is one thing the QA team used to maintain, schedule, or pay a vendor for. Here they are in plain English.

No more recorded scripts

Scenarios are plain text with #Case N: headers. The agent reads a fresh accessibility tree on every interaction, so a renamed CSS class never breaks a scenario.

No more test-email fixtures

create_temp_email + wait_for_verification_code replace the rotating pool of test inboxes most teams maintain. The inbox is real; the OTP is real; the cost is zero.

No more brittle waits

wait_for_stable injects a MutationObserver and waits for DOM quiet. Async timing flakes (streaming AI responses, lazy loads) stop showing up as red runs.

No more dashboard-only reports

Every run writes /tmp/assrt/results/latest.json. The shape is small and stable. CI gates with one jq expression. No vendor login required to read the file.

No more proprietary scenario format

#Case N: is a single regex. The plain text file lives in your repo. Any other tool can read it. Any model can extend it. There is nothing to migrate off.

No more vendor recording dashboard

Video recording runs locally and the player opens at 127.0.0.1. Recordings are .webm files on your disk that play offline forever. No upload step, no streaming server.

The four steps to wire QA + automation into a coding agent

The actual onboarding is short because the surface is small. You install one MCP server, the coding agent picks up three tools, and the loop closes the next time the agent ships a feature.

1

Install the MCP server once

npx @assrt-ai/assrt setup. This registers the server globally with Claude Code (or Cursor, or any MCP-aware client), installs a QA reminder hook, and updates your CLAUDE.md so the agent knows when to call assrt_test on its own.

2

Hand the agent a starting URL

After the agent finishes a feature, ask it to call assrt_plan against the local dev server URL. It will navigate, scroll, take three screenshots at different positions, and write 5-8 #Case scenarios into /tmp/assrt/scenario.md.

3

Let it run the test loop

The agent calls assrt_test with the plan and the URL. The test agent spins up Chromium via Playwright MCP, runs the 17 tools, records a video, and emits a TestReport. The coding agent reads the report and decides whether to commit or self-correct.

4

On a failure, call diagnose

If a scenario failed, the agent calls assrt_diagnose with the URL, the failing #Case, and the evidence. The MCP server returns a root-cause analysis and a corrected scenario in the same #Case format. The agent applies the fix and re-runs.

The mental model

QA and automation in 2026 is a 17-tool surface an MCP server hands the coding agent, not a separate team running a separate suite on a separate schedule.

The work that used to be authored by a recorder, scheduled by a runner, and watched by a dashboard now happens in one conversation. The artifacts (#Case file, TestReport JSON, recording.webm, real Playwright code) live on your disk. The vendor surface is one MIT-licensed npx command.

The numbers that make the loop concrete

Four numbers are enough to compare this against the old QA + automation stack: the size of the tool surface, the size of the entrypoint surface the coding agent sees, the cost per scenario, and the polling budget for the inbox loop.

0
callable browser tools in TOOLS (agent.ts)
0
MCP entrypoints the coding agent sees
$0
license cost for the agent itself (MIT)
0s
default poll budget for the OTP inbox

The OSI-licensed pieces underneath the loop

The reason the loop closes without a vendor in the middle is that every layer is OSI licensed. The disposable inbox is the only third-party API call, and even that one has no authentication. Every other moving part is on your laptop.

assrt MCP server, MIT17-tool TOOLS array, MIT (180 lines)DisposableEmail wrapper, MIT (24 lines)wait_for_stable observer, MITPlaywright, Apache-2.0@playwright/mcp, Apache-2.0Model Context Protocol SDK, MITtemp-mail.io public API, freejq for CI gating, MIT

When a recorded suite is still the better answer

The closed-loop model is not free. Every test interaction spends an LLM round trip, which makes it slower per scenario than a baked Playwright spec running purely in Node. For a suite of 50 scenarios that runs nightly, the cost is fine. For a suite of 10,000 scenarios that runs on every commit, the cost stacks up and a pre-compiled spec is faster and cheaper per run.

The escape hatch is built into the model: the test agent emits real Playwright code on demand. You can run the MCP-driven loop while you are authoring and debugging scenarios, then export the stabilised ones to a regular Playwright spec file and run them headless in CI without the agent. The closed loop is for the iterative half. The compiled spec is for the steady-state half.

Want to watch the 17-tool loop run on your own app?

20 minutes, a screenshare, and we will run a real signup scenario against your staging URL with a fresh disposable inbox. Bring a URL and a flow that usually needs a human.

Book a call

Frequently asked questions

What does "QA and automation" actually mean in 2026, after coding agents got good?

It means three different things stitched into one workflow. (1) Authoring: the agent writes scenarios in plain English, no Java, no Selenium IDE recordings. (2) Execution: a separate test-runner agent loads the scenarios, drives a real Chromium via Playwright MCP, and reports pass/fail. (3) Closure: the result feeds back to the coding agent so it can self-correct before opening the PR. The Assrt MCP server exposes only three entrypoints (assrt_test, assrt_plan, assrt_diagnose) precisely because that is the smallest surface that closes the loop.

Why do most QA-and-automation guides feel out of date?

Because they were written when automation meant recording a click-path in a desktop tool and replaying it. The vocabulary (script, recorder, suite, regression run) assumes a human-in-the-loop authoring step and a separate human-in-the-loop fixing step. With an MCP-driven test agent, both steps collapse into the same conversation. The agent that wrote the feature also wrote the test plan, ran it, read the report, and either fixed itself or asked for help. The old vocabulary cannot describe that loop in a single sentence.

How many tools does the test agent actually have, and which ones are non-obvious?

Seventeen, defined in the TOOLS array at /Users/matthewdi/assrt-mcp/src/core/agent.ts:16-196. The expected ones are navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, assert, complete_scenario. The non-obvious ones are create_temp_email (creates a real disposable inbox), wait_for_verification_code (polls that inbox for an OTP), wait_for_stable (injects a MutationObserver and waits for DOM quiet), http_request (calls external APIs like Telegram or Slack to verify the side effect of a UI action), and suggest_improvement (the agent files a UX bug as a side product of running the scenario).

Does "automation" still mean writing scripts that need maintenance?

No, and that is the easiest part to verify. A scenario file in this model is plain text with #Case N: headers. There are no selectors, no XPath, no waits, no page-object classes. The agent reads a fresh accessibility tree from Playwright MCP on every interaction, so a renamed CSS class or a restructured DOM does not break a scenario. The maintenance load shifts from "keep the selectors current" to "keep the user-visible language correct," which is much closer to writing a help-doc than writing code.

How does an autonomous signup actually work end to end?

Three tool calls. (1) The agent calls create_temp_email, which POSTs to https://api.internal.temp-mail.io/api/v3/email/new and returns a fresh inbox plus a token (see /Users/matthewdi/assrt-mcp/src/core/email.ts:33-56). (2) The agent types that address into the signup form and submits. (3) The agent calls wait_for_verification_code, which polls the same inbox for up to 60 seconds, extracts the OTP, and returns it as a string. The agent then types the code in. There is no human, no test fixture, no Mailosaur SDK, no recorded video of a QA engineer doing this once. The whole signup is a side effect of running #Case 1.

What about flaky tests caused by async timing, not selectors?

Selector healing is half the story. The other half lives in wait_for_stable (agent.ts:956-1009). It injects a MutationObserver into the page, counts DOM mutations every 500ms, and blocks until the mutation count has been steady for stable_seconds (default 2). This handles the cases that selector healing cannot: streaming AI responses, lazy-loaded sections, async search results populating, image lightboxes finishing their fade-in. It is one tool, called once after a triggering action, and it removes most of the timing flakiness in a real-world run.

How is this different from Playwright MCP alone?

Playwright MCP gives a coding agent browser primitives (navigate, click, type). It does not give it a QA engineer. The agent has to invent its own pass/fail criteria, structure its own scenarios, manage its own video recording, and parse its own diagnosis when a test fails. Assrt wraps Playwright MCP with three opinionated MCP tools: assrt_test runs a #Case file and emits a structured TestReport, assrt_plan auto-generates scenarios from a URL, assrt_diagnose explains a failure and writes a corrected #Case. The browser tools are the same, but the testing discipline is added on top.

Is this only for greenfield apps or can it audit a legacy codebase?

Either. The scenario format is intentionally URL-first, not framework-first. assrt_plan navigates to any URL (your staging app, your competitor's site, a 15-year-old internal tool), takes three screenshots at different scroll positions, sends the accessibility snapshots to an LLM with the PLAN_SYSTEM_PROMPT (server.ts:219-236), and emits 5-8 #Case scenarios. None of that depends on what is behind the URL. A WordPress site and a Next.js 16 app generate the same shape of plan.

What does the test result look like programmatically? Is it structured enough to gate CI?

Every run writes /tmp/assrt/results/latest.json. The shape is small: passed (boolean), passedCount, failedCount, duration, scenarios[] (each with name, passed, summary, assertions[]), screenshots[] (file paths, not base64), videoFile, and improvements[] (UX bugs the agent noticed in passing). You can pipe it into jq in a CI script: assrt run --json | jq -e '.passed' returns 0 if every scenario passed, non-zero otherwise. Non-blocking mode is supported by running the CLI with run_in_background and reading the resultsFile path when the background job completes.

What happens to the existing QA team in this model?

Their role moves up the stack. The boring half (write the click sequence, fix the broken selector, re-record the flow after the redesign) is now an LLM call that costs roughly one to five cents per scenario at Haiku prices. The interesting half (decide what is worth testing, model risk, write pass criteria that catch real failure modes) is what remains. The Assrt CLI even exposes this directly with the passCriteria parameter (server.ts:343), which forces the agent to verify a list of explicit conditions before marking the scenario passed.

Where do the scenarios and results actually live? On a vendor server?

On your laptop by default. Scenario text writes to /tmp/assrt/scenario.md, results write to /tmp/assrt/results/latest.json, video records to /tmp/assrt/<runId>/video/recording.webm, and the persistent video player runs on 127.0.0.1 (server.ts:118-215). Optionally, runs sync to app.assrt.ai for sharing, but the source of truth is the disk file. There is no proprietary scenario format; the file is a plain text list of #Case N: headers any human can read and any other tool can parse with a single regex.