Agentic coding, tested
Test coverage during agentic coding: put the runner inside the loop, not at the end of it
A coding agent makes 5 to 20 user-facing edits before it ever hands back. CI fires once, days later, on the human's push. The gap in between is where regressions live. This is the gap nobody's advice on agentic coding closes, and it has a concrete fix.
Direct answer (verified 2026-05-08)
The way to keep behavioral test coverage during agentic coding is to plug the test runner into the same tool loop the agent is editing in. Not as a CI step. Not as a manual command the human runs after. As an MCP tool the agent calls itself, with results the agent reads as files before its next edit.
Assrt does this with three MCP tools: assrt_test runs the scenarios, assrt_plan crawls the app and writes new ones, and assrt_diagnose analyses a failure and proposes a fix. The agent calls them the same way it calls Read, Edit, or Bash. Plans are written to /tmp/assrt/scenario.md and pass/fail JSON to /tmp/assrt/results/latest.json so the agent can read its own results between turns.
Source for the integration surface: assrt-ai/assrt-mcp/src/mcp/server.ts. MIT licensed, open source.
“Three tools the agent calls itself: assrt_test, assrt_plan, assrt_diagnose. The whole MCP integration surface fits on one screen.”
assrt-ai/assrt-mcp on GitHub
Why "raise the coverage number" misses the point
Most advice you'll find on test coverage when working with coding agents reads like advice from before the agent existed. It says: write more unit tests. Set a coverage threshold in CI. Use the agent to generate the tests. Review them carefully.
That advice is fine when a human is doing the coding and CI is the only consumer of the test results. It does almost nothing for an agent in autonomous mode, because of one detail: the agent edits the code in a loop, and the loop has many turns before any commit. A typical Claude Code or Cursor agent run will make a dozen user-facing changes, fix two bugs the user did not ask about, and rename a route, all in the same tool-use loop. CI fires once at the end. By the time it fails, the agent is in a completely different mental state and the human has to reconstruct the regression by hand.
The number you actually want kept high is the number of real user flows that have passed an end-to-end run since the last edit. That number drops every time the agent touches a route, a form, an auth flow, or a payment integration, and it can only be restored by running the tests right then. Not at PR time. Right then.
Two mental models for "tests as feedback"
The shift here is small in words and large in practice. Tests stop being a thing the human runs to validate a finished change. They start being a thing the agent runs to decide what to do next.
Where the test runner sits
The agent edits code in a long autonomous turn. The user presses commit. CI runs the suite. CI reports pass/fail to the user, who interrupts the agent if it failed. The agent has to reconstruct the regression from a stale context window. Most fixes happen by hand because the agent never sees the failure in the same turn it caused it.
- Test results arrive after commit
- User is the integrator between CI and the agent
- Agent's context has rolled past the regression
- Fixes happen by hand or in a fresh agent session
What the loop actually looks like
Spelled out, with the file system and the MCP tool calls labelled, the agentic test loop looks like this. The interesting thing is how short it is. The whole feedback channel between the agent and the test runner is two files in /tmp/assrt and three MCP tool calls.
One turn of the agentic test loop
The agent writes its diff. The MCP tool runs the scenarios. The results land on disk. The agent reads them. If something failed, the agent calls diagnose on the same scenario and gets back a root cause it can act on. None of this requires the human to be in the chair.
The integration surface, in source
The reason this works is that the MCP server is small enough to read in one sitting. Three server.tool(...) calls register the agent-facing surface. A short string of instructions tells the agent when to call each one. Everything else (the browser driver, the scenario sync, the video player) is private to the server.
Two ways an agent can run E2E tests
// Agent in tool-loop, no test runner
// Has to: shell out to npx playwright,
// parse stdout, ignore the trace viewer,
// fail to see the video, ignore the
// retry semantics. In practice it
// does none of this and just commits.
await Bash({
command: "npm test",
timeout: 120000,
});
// Output: 1 passed, 4 skipped, 0 failed
// Agent: "looks good!" Commits.
// Reality: the four skips matter.The instructions string the MCP server hands to the connected agent is the cultural piece. It says, verbatim, what the agent should do and when:
"Proactively use Assrt after any user-facing change. Do not wait for the user to ask for testing."
"After implementing a feature or bug fix that touches UI, routes, forms, or user flows: run
assrt_testagainst the local dev server to verify the change works end-to-end in a real browser.""When a test fails: use
assrt_diagnoseto understand root cause before attempting a fix. Do not guess."
The MCP protocol delivers that string to the connected client when the agent first sees the server, so it lands directly in the agent's system context. There is no prompt engineering for the user to do; the runner ships with its own etiquette.
The two files the agent watches
The feedback channel between the test runner and the editing agent is intentionally boring: two files in /tmp/assrt. The agent already has a Read tool. It already has an Edit tool. The MCP server uses the same surface so the agent does not have to learn anything new.
File layout (verified in scenario-files.ts:17-20)
/tmp/assrt/scenario.mdThe current test plan, in plain Markdown with#Case N:headers. The agent can Read this to see what is being tested. If the agent edits it, anfs.watchwith a 1-second debounce syncs the change back to the central scenario store automatically./tmp/assrt/scenario.jsonMetadata for the current scenario: id, name, url, updatedAt. The agent uses this to understand which scenario it is operating on./tmp/assrt/results/latest.jsonThe most recent run's structured results: per-case pass/fail, assertions, error strings, and the path to the recorded video. This is the file the agent reads before deciding to edit again or to callassrt_diagnose./tmp/assrt/results/<runId>.jsonPer-run history. Useful when the agent wants to compare the current failure to the previous passing run on the same scenario.
That is the entire surface. No database. No proprietary trace format. No vendor server the agent has to authenticate against. When you want to leave, the directory is yours, the scenarios are plain text, and the runner under the hood is Playwright.
What "in the loop" means in practice, by edit type
You do not want the agent calling assrt_test after it edits a comment. You do want it calling after a route change, a form change, or a payment-integration touch. The following is the rule of thumb that keeps coverage useful without turning every turn into a 90-second wait.
When the agent should test, and when it should not
- After editing or adding a route, page, or layout component
- After touching a form, login flow, or payment integration
- After upgrading a UI dependency that ships its own DOM (Monaco, Radix, Headless UI, etc.)
- Before any commit that changes user-facing behavior
- After a known-flaky scenario, with assrt_diagnose if it fails twice in a row
- After a typo fix, an import reorder, or a comment change
- On every line edit (this turns the agent into a slow CI machine and burns its context)
- Without a running dev server (the MCP server will refuse and tell the agent to start one first)
A short, honest list of failure modes
Putting the runner inside the loop is not free. Three things to watch.
- The agent will keep going if you let it. Once an agent learns it can test its own work, it can spend ten minutes iterating on a single failing case while you sip coffee. That is fine when the case is right. It is expensive when the case is wrong. Cap with the
timeoutparameter onassrt_testand review the scenario before you walk away. - The dev server is now part of the loop. The MCP server expects a real local URL. If the dev server crashes mid-turn, the agent sees navigation errors and may misdiagnose them as application bugs. The fix is to keep the dev server in a watched, restart-on-crash process and to instruct the agent (in your project's CLAUDE.md or equivalent) to verify the dev server is up before its first
assrt_testof the session. - Behavioral coverage is not the same as line coverage. None of this raises the line/branch coverage number you might be reporting to a board. It raises the proportion of real flows that pass after each edit, which is the number that actually correlates with users not seeing broken pages. Decide which number you want to optimise; do not pretend they are the same.
Getting started in the agent you already use
Assrt's MCP server runs over stdio, so any client that speaks MCP can connect: Claude Code, Cursor, the open-source Continue extension, custom agents built on the Anthropic SDK. Setup is a single line in the client's MCP config. The npm package is @assrt-ai/assrt; the binary is named assrt-mcp. On first connection the SERVER_INSTRUCTIONS land in the agent's context and it knows when to call the three tools.
The repo is at github.com/assrt-ai/assrt-mcp. MIT, no telemetry by default beyond an opt-out PostHog ping. Read src/mcp/server.ts before installing if you want to see the full integration surface in one file.
Wire test coverage into your agent loop
Twenty-minute call. Walk through your current setup, leave with the MCP config and the first three scenarios for the routes that break most often.
Frequently asked questions
What does test coverage even mean during agentic coding?
Two things, and people conflate them. There is line/branch coverage, which is a number you can compute statically against whatever the agent has written so far. And there is behavioral coverage, which is the set of user flows that have been verified end to end since the last edit. The first does not change much when an agent is editing; the second decays every time the agent touches a route or a form. The decay is what most teams are missing. The number you want to keep up is behavioral coverage, measured by how many of your real user flows passed the last time the agent ran the tests.
Why is CI too late for an agentic loop?
CI fires on commit. An agent doing autonomous work makes 5 to 20 user-facing edits per commit and never sees CI until the human pushes. By the time CI says something is broken, the agent has typed twelve more changes on top of the broken code, the context window has rolled past the regression, and the diff that caused it is buried. The signal arrives, but it cannot be acted on. The runner has to live inside the same tool loop the agent is already iterating in, with results the agent reads before its next edit.
Can the agent really run end-to-end browser tests itself?
Yes. That is what the Model Context Protocol is for. Assrt ships an MCP server (the binary is named assrt-mcp) that registers three tools the agent can call: assrt_test runs scenarios and returns structured pass/fail; assrt_plan crawls a URL and writes new scenarios; assrt_diagnose takes a failed run and proposes a fix. The agent calls them the same way it calls Read, Edit, or Bash. The browser, the navigation, the assertions, and the video recording all happen on the user's machine, not in CI.
How does the agent see the results between turns?
Every test run writes two files. The plan goes to /tmp/assrt/scenario.md as plain Markdown the agent can Read or Edit. The pass/fail JSON goes to /tmp/assrt/results/latest.json, plus a per-run copy at /tmp/assrt/results/<runId>.json for history. There is also a 1-second debounced fs.watch on /tmp/assrt/scenario.md, so when the agent edits the scenario file directly (because it noticed a flaky case during a fix), the change syncs back to the central scenario store without a separate save call.
Does Assrt lock me into a proprietary test format?
No. The plan format is plain Markdown with #Case N: headers, the kind a human would write. Under the hood the agent that runs the test drives a real browser through Playwright. If you stop using Assrt tomorrow, the scenarios are still readable, the videos still play, and the pattern (MCP server exposing tools to a coding agent) is reusable with any other Playwright-based runner. The repo is MIT, the code is on GitHub at assrt-ai/assrt-mcp.
Won't the agent just call the test tool too often and burn time?
It can, and the SERVER_INSTRUCTIONS string in the MCP server explicitly steers it. The instructions tell the agent to test after a user-facing change, before a commit, and after a failure (with assrt_diagnose). They also tell it to start a dev server first if one is not running. In practice the bigger risk is the opposite, an agent that ships UI work without testing because the user did not ask. The instructions push it to the right side of that trade-off without asking the user every time.
What is the difference between assrt_test and a Playwright spec file?
A Playwright spec is code; assrt_test is a tool call with a Markdown plan. The agent driving the browser inside Assrt translates the plan into Playwright actions at run time, and the resulting trace, video, and structured assertions are what gets returned. If you want raw Playwright, the underlying driver is open source and the scenarios on disk are the authoritative description of what you are testing. Most teams find the Markdown plan easier for an LLM to maintain than a hand-written spec, but you can mix both.
Can I see this without installing anything first?
Read /Users/matthewdi/assrt-mcp/src/mcp/server.ts on GitHub at assrt-ai/assrt-mcp. The three server.tool registrations and the SERVER_INSTRUCTIONS string are the entire integration surface. Everything else (browser session reuse, video player, scenario sync) is implementation detail you do not need to understand to use it.
Related guides
Agentic testing as an engineering discipline
Why splitting the tester into a separate agent with its own system prompt catches bugs that single-agent setups miss.
Vibe coding and the test coverage gap
AI coding tools ship features fast but skip tests. Why vibe-coded apps need automated coverage and how to add it.
Readable AI-generated tests
What it takes for AI-generated Playwright tests to actually be readable, and why that matters when an agent has to maintain them.