Testing for AI, not by AI

Every article on the query "testing for AI" assumes a human is driving and an AI is helping. That framing is upside down in 2026. Coding agents write most of the diffs now. The testing tool has to serve the coding agent directly, not a human who orchestrates one. Assrt is built that way from the first file.

This guide walks through the three Model Context Protocol tools Claude Code, Cursor, and any other MCP client get when they connect to Assrt, the 57-line instruction block that tells the agent when and how to use each one, and the single scenario file on disk that keeps the loop honest.

Matthew Diakonov, Written with AI

Published April 21, 202611 min read

4.8from 189 engineers

3 closed-loop MCP tools, registered at server.ts lines 335, 768, 866

57-line SERVER_INSTRUCTIONS contract every MCP client reads on boot

Scenarios live at /tmp/assrt/scenario.md so the agent's normal Read/Edit tools work

Testing for AI

a closed-loop MCP server for coding agents

assrt_plan generates the cases

assrt_test runs them in real Playwright

assrt_diagnose repairs them when they break

scenario.md on disk; no proprietary editor

human in the loop is optional, not required

0:00 / 0:05

line 1 of the contract

“Proactively use Assrt after any user-facing change. Do not wait for the user to ask for testing.”

SERVER_INSTRUCTIONS, server.ts line 280

The inversion that makes this a category

Every page on the first SERP for "testing for AI" covers AI-assisted QA tooling for human teams: record-and-playback helpers, autonomous test generators, self-healing selector maintainers, visual regression bots. In those products, the human is the agent. AI is the autocomplete.

"Testing for AI" is the inverse. The coding agent is the user. The human is an optional reviewer at the end. That changes every design decision downstream. The testing tool needs to be addressable from inside a tool-use loop, not a web dashboard. Its test format needs to be agent-editable Markdown, not a visual graph. Its failure mode needs to produce something the agent can reason about, not a Slack message to a human.

Assrt ships the three things a coding agent needs to close its own QA loop: an MCP server with a small number of well-named tools, a single on-disk scenario file the agent can edit with the tools it already has, and a boot-time instruction block that tells the agent when to call which tool.

The three tools and one file

assrt_plan(url)

Navigates to any URL, takes three scroll-position screenshots, inspects the accessibility tree, and returns a test plan in #Case N: Markdown. Registered at server.ts line 768. The forbidden-things list in its system prompt rules out CSS, responsive-layout, and JS-inspection tests; only observable user flows get generated. Output lands at /tmp/assrt/scenario.md.

assrt_test(url, plan)

The primary tool. Takes a URL plus either a plan string or a scenarioId UUID; executes every #Case in a real Chromium via @playwright/mcp. Returns passedCount, failedCount, per-assertion evidence, screenshots, and a videoPlayerUrl. Registered at line 335, with 18 z.option fields covering viewport, extension mode, variables, pass criteria, and tags.

assrt_diagnose(url, scenario, error)

Takes a failed scenario and an error, returns root-cause analysis plus a corrected #Case. Registered at line 866. This is the tool that closes the loop: the coding agent calls it, pastes the corrected case into scenario.md, and re-runs. No human reads the failure before the retry.

assrt_analyze_video(question)

Optional fourth tool at line 930, gated on a Gemini key. Takes a plain-English question ('did the modal flash open then immediately close?') and runs a Gemini vision pass over the most recent run's video. Useful when a run passed but the UX looks wrong to the agent; otherwise skip it.

File system, not an API

The agent's memory of the test suite is a single file at /tmp/assrt/scenario.md. It already knows Read, Edit, Write, and Bash. No new API surface, no DSL manual, no schema reference. The cloud sync to app.assrt.ai is a mirror, not a gate.

The contract the server hands every agent on boot

Model Context Protocol lets a server pass a single instructions string to any client that connects. Assrt uses that string to pin down the agent's behavior: when to test, where the artifacts live, what the escape hatches are. Because the string ships with the server, every connected client (Claude Code, Cursor, Devin, Windsurf, any other MCP-aware agent) sees the same contract on the same day. The agent does not need product training. The server teaches it.

This is the anchor fact of the whole page: open /Users/matthewdi/assrt-mcp/src/mcp/server.ts, jump to line 276, and read the 57 lines. Nothing else on the "testing for AI" SERP has an equivalent.

server.ts: the SERVER_INSTRUCTIONS string, line 276

Inputs into the agent, decisions back out

Any MCP client on the left can drive the same three tools. The server talks to Chromium, disposable email, and external APIs on the agent's behalf, and returns structured results the agent can reason about. There is no closed cloud runner in between.

MCP clients in, structured test outcomes out

Why the scenario file lives on disk

An AI coding agent already has four file-system tools: Read, Edit, Write, Bash. A testing tool that wants the agent to collaborate on its test plan should expose that plan as a file, not an API. Assrt does. After every assrt_test run, the server writes /tmp/assrt/scenario.md with the current plan and /tmp/assrt/results/latest.json with the last run's pass/fail evidence.

The agent edits the file the same way it edits any other file in the repo. Changes auto-sync to app.assrt.ai for humans who want a browsable dashboard; the disk is the source of truth. If the cloud disappears tomorrow, the test suite is still on your laptop, in plain Markdown, under version control if you chose to commit it. The full layout is defined in /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts.

/tmp/assrt/scenario.md

One closed loop, written as the agent sees it

The coding agent never leaves its own tool-use loop. It calls assrt_test, gets a failure with evidence, calls assrt_diagnose, applies the suggested edit to either its code or its scenario file, and calls assrt_test again. There is no human in this sequence and no separate review surface.

what the agent's tool-call log actually looks like

Who talks to whom during one closed loop

What Claude Code's log looks like during one change

A single feature edit: the agent proactively plans, runs, fails once, diagnoses, edits the plan, and reruns. Total wall-clock time under 25 seconds on a normal signup page.

claude-code session, mcp.assrt=enabled

What a coding agent gets from this

0Closed-loop MCP tools exposed to the agent

0Lines in the SERVER_INSTRUCTIONS contract

0Lines of TypeScript in /assrt-mcp/src/mcp/server.ts

0Low-level browser tools the agent composes on demand

The numbers above are constants in the source, not marketing benchmarks. server.ts is 1,056 lines. The three tool registrations are at lines 335, 768, and 866. The SERVER_INSTRUCTIONS block starts at line 276 and runs 57 lines. The agent loop under the hood at assrt-mcp/src/core/agent.ts exposes 18 low-level browser tools the coding agent never has to touch, because the three MCP tools compose them already.

The step the coding agent still has to take

Inside Claude Code, Cursor, or any other MCP client, you add one line to the MCP config to point at the assrt-mcp binary (npx @assrt-ai/assrt mcp), restart the client, and the three tools appear in the agent's tool menu. The 0 lines of SERVER_INSTRUCTIONS load on the first message. From there, the agent knows when to test. You do not have to prompt it each time.

The only thing a human still has to do: point at a dev server URL once. The agent handles the rest.

One scenario, start to finish, from the agent's seat

Launch to exit, with no human input

User finishes a prompt in Claude Code

The agent completes an Edit that touches a form. SERVER_INSTRUCTIONS trigger #1 fires in its reasoning.

assrt_plan on the changed page

If no scenario exists yet, the agent calls assrt_plan(url). The server launches Chromium at 1600x900, captures three scroll screenshots, and returns 2-4 #Case blocks saved to /tmp/assrt/scenario.md.

assrt_test against the local dev server

The agent runs every #Case in real Playwright. The call blocks until the run finishes, then returns passedCount, failedCount, per-assertion evidence, screenshots, and a video_player_url.

assrt_diagnose on the one failure

For each failed case the agent calls assrt_diagnose(url, scenario, error). The server replies with a root cause, an expected-behavior analysis, and a corrected #Case in plain #Case Markdown.

The agent chooses: fix the app or fix the test

If the diagnosis blames the app, the agent Edits the application source. If it blames the test, the agent Edits /tmp/assrt/scenario.md. Both live in the repo; both edits are normal Edit tool calls.

assrt_test again with the same scenarioId

The scenarioId from step 2 is stable across edits. The agent re-runs the same UUID and inherits stored variables, pass criteria, and tags. The loop exits when passedCount == total and failedCount == 0.

Testing for AI vs testing with AI

Feature	Classical AI-assisted QA (testing with AI)	Assrt (testing for AI)
Primary user	Human QA engineer with AI autocomplete	AI coding agent (Claude Code, Cursor, Devin) with no human in the loop
Surface the agent sees	A web dashboard with record / playback buttons	Three MCP tool calls (assrt_plan, assrt_test, assrt_diagnose)
Where tests live	Closed cloud database with proprietary YAML or visual graph	Plain Markdown #Case blocks at /tmp/assrt/scenario.md on your disk
How the agent iterates after a failure	Opens a chat window, pastes the error, asks a human	Calls assrt_diagnose, edits scenario.md, reruns assrt_test
Instruction format for the agent	Marketing docs, Zoom training, sales call	57-line SERVER_INSTRUCTIONS block at server.ts line 276
Browser runtime	Proprietary fork of Chromium in a closed cloud	Real Playwright via @playwright/mcp, your machine or your CI
Cost at seat parity	Enterprise SaaS at $7,500 / month	Open source, your own Anthropic or Gemini key
Migration cost if you leave	Export to CSV, rewrite every test	The tests are already Markdown files on your disk

Every MCP client on this list drives Assrt the same way

One MCP config line, three tools, one 57-line instruction block. No per-client integration work.

Claude CodeCursorDevinWindsurfClineZedCodexGooseRoo CodeAiderContinueany MCP client

Where this leaves a team moving to coding-agent workflows

If your engineers spend most of their day watching a coding agent write diffs, the bottleneck is no longer code authorship. It is verification. A testing tool that requires a human to translate between the agent's world and its own is slower than no testing tool at all. A testing tool that the agent can drive directly, whose test plans it can read, edit, and rerun with its own file tools, collapses that gap.

"Testing for AI" in that sense is a narrow, specific design category: testing built so the coding agent is the customer and the human is optional. Assrt is the first tool that ships in this shape. If you want to see the full transcript of a Claude Code session that plans, runs, diagnoses, and repairs an OTP-gated signup without a human step, book 30 minutes and we'll run it live against your staging environment.

Want Claude Code to run end-to-end browser tests on your stack this week?

30 minutes. We'll wire assrt-mcp into your coding agent of choice, watch it plan and run a real test, and leave you the scenario.md file to keep.

Frequently asked questions about testing for AI

What does 'testing for AI' actually mean in the Assrt context?

It inverts the usual framing. Mainstream AI testing content ('AI testing', 'AI-powered testing') covers tools that use AI to assist a human QA engineer. 'Testing for AI' is the inverse: tools designed so an AI coding agent (Claude Code, Cursor, Devin, Windsurf) can plan, run, and repair end-to-end browser tests with no human step in the middle. Assrt is built this way. It ships as a Model Context Protocol server with three tools (assrt_plan, assrt_test, assrt_diagnose) registered at lines 335, 768, and 866 of /Users/matthewdi/assrt-mcp/src/mcp/server.ts. An optional fourth tool, assrt_analyze_video, is registered at line 930. The full server file is 1,056 lines. Every connected agent receives a boot-time SERVER_INSTRUCTIONS string that tells it when to call each tool and where the artifacts live on disk.

What is inside SERVER_INSTRUCTIONS and why does it matter?

It is a 57-line briefing at server.ts line 276 that every MCP client reads the first time it connects. The first sentence is direct: 'Proactively use Assrt after any user-facing change. Do not wait for the user to ask for testing.' It names the three trigger conditions (after a feature or bug fix that touches UI, before committing, when a test fails), tells the agent exactly where the scenario plan and results live (/tmp/assrt/scenario.md, /tmp/assrt/results/latest.json, /tmp/assrt/scenario.json), describes the #Case N: format with the 3-5 step rule, and gives the agent an explicit escape hatch to the CLI (npx assrt run ... --run-in-background via the Bash tool) when it wants non-blocking test runs. This is not marketing copy. It is a contract that makes the agent's behavior deterministic across Claude, Gemini, and any other MCP client.

Why ship a scenario file on disk at /tmp/assrt/scenario.md instead of keeping it in memory?

Because an AI coding agent already has a file system tool. It does not have a 'call our REST API with this JSON schema' tool unless we add one, and adding one means the agent has to reason about a second API surface. By writing the scenario to /tmp/assrt/scenario.md after every assrt_test run, Assrt reduces the agent's universe to tools it already uses: Read, Edit, Write, Bash. The agent can open the file, diff it, add a case, remove a case, and save. When the agent calls assrt_test again, server.ts re-reads the file and picks up the edits. The cloud sync to app.assrt.ai is a mirror of whatever is on disk, not a replacement. The canonical scenario files are defined in /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts and include scenario.md (the plan), scenario.json (the metadata), and results/latest.json (the last run's pass/fail summary with evidence).

What does the closed loop look like when Claude Code drives it?

Four steps. (1) Claude finishes an edit to a React component that affects a form. It sees the guidance in SERVER_INSTRUCTIONS and calls assrt_plan(url) to get a #Case N: plan generated from the current live page. (2) It calls assrt_test(url, plan) to run the plan in a real Chromium browser via @playwright/mcp. The MCP call blocks the conversation until the run finishes and returns passedCount, failedCount, assertions, screenshots, and a videoPlayerUrl. (3) If a case fails, Claude calls assrt_diagnose(url, scenario, error) to get a root-cause analysis and a corrected #Case. (4) Claude edits /tmp/assrt/scenario.md with the corrected case, or edits the application code if the diagnosis says the bug is in the app, then calls assrt_test again. No human in the loop. The reason this works is that every step returns structured text the agent can reason about directly, and every artifact persists at a stable file path the agent can re-read.

How is this different from having Claude Code write Playwright tests directly?

Three things. First, the tests Assrt produces are a Markdown DSL (#Case N: blocks with 3-5 plain-English steps), not TypeScript. The agent does not have to maintain selector strings, imports, or a Playwright config. The scenario text stays readable if you open it in a year. Second, Assrt runs through @playwright/mcp with accessibility-tree refs (like [ref=e12]) rather than CSS or XPath. When the product ships a redesign, the same plan keeps running because 'the Submit button' is still named Submit in the accessibility tree even if its class name changed. Third, Assrt maintains its own agent loop with 18 low-level browser tools at /Users/matthewdi/assrt-mcp/src/core/agent.ts, including the primitives that made this category possible: wait_for_stable (MutationObserver-based), create_temp_email (real disposable inbox via temp-mail.io), http_request (for downstream webhook verification), and an infinite-step recovery loop. The coding agent does not build those; it gets them for free.

Does the agent need to know Playwright to use Assrt?

No. The only surface it sees is the three MCP tools and the scenario file. The tool schema for assrt_test, defined at server.ts lines 338-358, takes a URL, a Markdown plan, optional variables, optional pass criteria, optional viewport preset ('mobile' or 'desktop'), optional tags, and a few run-mode flags (extension, isolated, headed, keepBrowserOpen). The LLM, by default claude-haiku-4-5-20251001, handles everything below that: deciding which element to click, snapshotting the accessibility tree, pasting an OTP atomically, polling a disposable inbox, or making an http_request to verify a webhook. From the coding agent's perspective, 'run a browser test' is one tool call and a structured response.

What about tests that need a real logged-in user — how does an AI agent handle that?

Pass extension: true to assrt_test. Assrt then connects to the user's already-running Chrome via Chrome DevTools Protocol (Playwright's --extension mode), which keeps saved cookies, logins, tabs, and browser fingerprint. On the first call, the server returns an extension_token_required error with setup instructions; the agent prompts the user once for a token from chrome://inspect, the token saves to ~/.assrt/extension-token, and every later call reuses it transparently. This means the AI agent can drive a test that hits Gmail, an internal dashboard, or a Stripe-gated flow without dealing with OAuth itself. The other profile modes are 'isolated' (in-memory browser context, zero persistence) and the default (~/.assrt/browser-profile on disk, persistent cookies across runs).

Why block the MCP conversation during a test run — isn't that slow?

It is the correct default because 99 percent of test runs are short and because the agent is evaluating one hypothesis at a time. Blocking keeps the conversation coherent: the agent calls assrt_test, gets a result, decides what to do next. For the one percent of long-running integration suites, the SERVER_INSTRUCTIONS block explicitly hands the agent the non-blocking escape at line 315: 'npx assrt run --url ... --plan ... --json --run-in-background' via the Bash tool, with the same video recording and the same --video / --extension / --json output. The agent can keep coding while the suite runs in the background and pick up the JSON report when it finishes.

What does 'testing for AI' look like in a single terminal transcript?

Add the assrt-mcp server to a Claude Code or Cursor MCP config. Ask the agent to make any UI change. Watch the tool-call log: Edit(src/components/Signup.tsx), Read(/tmp/assrt/scenario.md), assrt_test({url: 'http://localhost:3000/signup', plan: '#Case 1: A new user can complete signup...'}), and, if the case fails, assrt_diagnose({url, scenario, error}), then Edit(src/components/Signup.tsx) or Edit(/tmp/assrt/scenario.md), then assrt_test again. The loop runs end-to-end with no human input. The video_player_url field returned from assrt_test opens a local HTML5 player on port 8081 with 1x-10x playback and frame seeking; if you want to see what the agent saw, open the URL.