End to end testing for AI generated code, without selectors that break the next regeneration

Cursor, Claude Code, Lovable, v0, Bolt, and the rest do not promise stable markup. They rewrite className strings on a whim. They restructure wrapper divs between regenerations. They drop and re-add data-testid attributes silently. A test plan that hard codes any of those dies the first time the agent rebuilds the component, and you spend the next twenty minutes diffing locators instead of shipping. The fix is not better selectors. It is no selectors at all.

This walks through how the open source Assrt agent handles end to end testing for AI generated code: the scenario file is plain Markdown that names a user intent, the agent calls a snapshot tool that returns the live accessibility tree with [ref=eN] IDs, and every click and type is resolved from the ref the snapshot just produced. Nothing about the implementation is persisted in the test. Every claim below points at a file path and a line number in the public repo.

Matthew Diakonov, Founder, Assrt

Published April 25, 202612 min read

4.9from 120+

MIT licensed

Real Playwright under @playwright/mcp

Zero selectors in scenario files

Survives the next regeneration

End to end tests that survive AI regeneration

Plain English plans. Accessibility tree at runtime. Zero selectors persisted.

The plan file says "Click Send", not ".btn-primary"

Agent calls snapshot, picks a fresh ref ID this run

Cursor regenerates the component, refs change, plan does not

The user contract is the test, the implementation is disposable

0:00 / 0:05

Why selector based tests die the moment AI touches the code

A coding agent is allowed to rewrite the implementation on every iteration. That is the entire value proposition: tell it what you want, accept any working answer. The trouble is your existing test suite assumes the implementation is stable across runs. The Send button selector you wrote yesterday was .btn-primary; today, after one prompt to clean up the styling, it is a Tailwind chain forty characters long; tomorrow, after a shadcn upgrade, the wrapper element flipped from div to form and the type changed. None of those changes broke the user flow. All of them broke your test.

This is not a flaky tests problem. Flaky tests are a timing problem. This is a contract problem. Your test bound itself to the implementation, the AI is allowed to change the implementation, the test was always going to lose. The only durable contract is the one a real user (or a screen reader) experiences: there is a button, it is labeled Send, clicking it sends the message. Bind the test to that contract and the AI can rewrite the markup as often as it wants without invalidating the suite.

Cursor regenerates your markup

Claude Code regenerates your markup

v0 regenerates your markup

Bolt regenerates your markup

Lovable regenerates your markup

Replit Agent regenerates your markup

Windsurf regenerates your markup

Aider regenerates your markup

Continue regenerates your markup

Cline regenerates your markup

Devin regenerates your markup

GitHub Copilot Workspace regenerates your markup

The pattern that survives regeneration: nothing in the plan, everything from the tree

The Assrt scenario format is plain Markdown. A test case is a heading like #Case 1: A user sends a message followed by a few bullets of imperative English. There is no DSL. There is no selector syntax. The bullets describe what a user would do; the agent figures out how at runtime.

At runtime the agent calls one of 18 fixed tools (defined at agent.ts:16-196). The load bearing one for this discussion is snapshot: it forwards to browser_snapshot on the underlying Playwright MCP server, which returns the page's accessibility tree as text with each interactive element tagged [ref=eN]. The agent reads the tree, picks the ref for the element that matches the bullet ("Send button"), then calls click(element, ref). The ref is generated this run, used this run, and forgotten. Tomorrow's run gets a new tree, new refs, the same passing test.

The runtime loop the agent runs for every interaction

Brittle .spec.ts vs durable #Case, side by side

Same user flow, two encodings. The Playwright spec on the left is correct today and likely wrong tomorrow. The #Case file on the right is bound to user intent and survives the next AI regeneration. Click the toggle in the header to compare.

One feature, two test contracts

// tests/chat.spec.ts (handwritten Playwright)
// 38 lines. Will break the next time Cursor regenerates the
// page and changes the className strings or the wrapper element.

import { test, expect } from "@playwright/test";

test("a user sends a message and gets a reply", async ({ page }) => {
  await page.goto("http://localhost:3000/chat");

  // The selector below is whatever class Cursor used today.
  // Tomorrow it might be 'btn-new-chat' or
  // 'flex-row-reverse-w-full-shadow-md-rounded-2xl-px-4-py-2'.
  await page.locator(".btn-new-chat").click();

  // The textarea selector is even worse: shadcn changes the
  // wrapper depth between minor versions.
  await page
    .locator('textarea[placeholder="Type a message..."]')
    .fill("Summarize the meeting notes");

  // The Send button is a different selector again because the
  // generated component composes shadcn Button with custom variant.
  await page.locator('button[type="submit"]').click();

  // The success state is the worst: AI generated UIs love to
  // wrap everything in fresh divs every regeneration.
  await page.waitForSelector(".message-assistant.message-final");

  const lastTurn = await page
    .locator(".message-assistant")
    .last()
    .textContent();
  expect(lastTurn?.length ?? 0).toBeGreaterThan(0);

  await expect(page.locator('button[type="submit"]')).toBeEnabled();
});

56% fewer lines, zero selectors

What the agent actually sees when it calls snapshot

Snapshot does not return the DOM. It does not return a screenshot. It returns the accessibility tree, the same data structure a screen reader announces from. Roles, accessible names, ARIA states, hierarchy, and the [ref=eN] IDs that are stable across the round trip from agent to browser and back. The tree below is what the agent gets back for a freshly Cursor generated chat page; the agent reads it, finds the textbox labeled Type a message... at [ref=e10], types into that ref, then re-snapshots before clicking Send.

agent.snapshot.output

“The number of CSS selectors in a working Assrt scenario file. Selector resolution happens at runtime from the accessibility tree, so AI codegen churn cannot invalidate it. The plan file encodes user intent; the implementation is disposable.”

Watching the loop run, step by step

The animation below walks through one full interaction. Five stages: the plan, the snapshot, the resolved tool call, the assert, and what happens when the AI regenerates the page underneath. The contract held; the test passed; the file did not change.

Plan → snapshot → ref → click → assert, then regenerate and run again

mk0r preview

scenario.md #Case 1: Send a message - Navigate to /chat - Click "New chat" - Type "hello" into the message input - Click Send

1/51. The plan is plain English. Zero selectors. Bound to user intent, not implementation.

A real run, captured verbatim

The terminal output below is the agent driving Chrome through one #Case file twice. Between the two runs the application is regenerated by Cursor; the className strings change, the wrapper hierarchy shifts, the ref IDs all renumber. The plan file is not edited. The same scenario passes against both versions.

assrt run, two passes, plan unchanged

A complete chat scenario in 21 lines

For comparison with what a Playwright suite for the same coverage would weigh, the file below is the entire end to end test for an AI generated chat page: three cases, eight assertions, full conversation persistence. No imports, no fixtures, no helpers, no selectors.

tests/chat.case.md

18 fixed tools, no escape hatches into arbitrary code

The agent's surface is bounded. The full schema is a typed array at agent.ts:16-196. The Anthropic SDK rejects any tool name not in that array, which means the agent cannot invent a new locator API, cannot call into your repo, and cannot generate Playwright code that compiles to something different next run. It can decide which of the 18 to call and what arguments to pass; that is the entire decision space. Most scenarios use seven of them.

src/core/agent.ts

The bound surface matters specifically for AI generated code because it removes the temptation that every "AI generates Playwright tests" tool runs into: the agent gets to invent a locator on the fly, the locator looks plausible, it works on the page in front of the agent today, and it is wrong the moment the implementation churns. With a bound surface and an accessibility tree as the resolution layer, there is no locator to invent. The user said "Click Send"; the tree has a button labeled Send; the agent picks its current ref; click goes through. The next run does the same dance from scratch.

0Fixed tools the agent can call

0 charsMax accessibility tree size before truncation

0CSS selectors in a #Case scenario

0Lines for a complete chat test

The accessibility tree is the contract you actually want

Worth saying this out loud: binding to the accessibility tree is not a workaround. It is the most stable contract your application has. If a real user can find the Send button, the screen reader can announce the Send button, and the AI can label the Send button, then any of those three can drive a test. The accessibility tree is what they all agree on. Implementations that fail this test are accessibility bugs first and test bugs second; fix the bug, the test follows.

The size of the tree the agent will tolerate is bounded at 0 characters (browser.ts:523, the SNAPSHOT_MAX_CHARS constant). Anything beyond that is truncated with a marker telling the agent which refs survived. Most application pages are 5k to 20k of tree, so the cap rarely fires; when it does, the agent scrolls and re-snapshots a different region.

From freshly generated app to a passing test, four steps

The on-ramp from a Cursor or Claude Code generated app to a working end to end test is short enough to run during the same coffee break that produced the feature. No fixtures, no test runner config, no Playwright project file.

Install the MCP into your coding agent

Run npx @assrt-ai/assrt setup. This registers the MCP server, drops a CLAUDE.md note, and installs a hook so the coding agent is reminded to test after user-facing changes. Works in Claude Code, Cursor, Windsurf, and any MCP-aware editor. Source for the setup script is in the open repo.

Write a #Case file that names what a user does, not what the markup looks like

Create tests/chat.case.md with bullet points that read like product requirements. Every step is a verb plus a target named the way a user would name it: "Click New chat" not "click .btn-new-chat". "Type into the message input" not "locate textarea#composer-input". The plan file is bound to user intent. The implementation can change underneath.

Run the agent against your dev server

npx @assrt-ai/assrt run --url http://localhost:3000 --plan tests/chat.case.md. The agent boots a Chrome through Playwright MCP, navigates, calls snapshot, picks fresh ref IDs from the accessibility tree on every interaction, asserts on visible text and roles, records a video. Default model is Claude Haiku for speed (agent.ts:9). Add --extension to attach to your real Chrome session for authenticated flows.

Re run after the next AI regeneration, change nothing

When Cursor or Claude Code rewrites the component, do not touch the plan file. Re run the same command. The agent re-reads the accessibility tree, picks new ref IDs, executes the same user flow. If the user contract held, the test passes. If it failed, you have a real regression with a video and a screenshot for the human reviewer or the next agent loop. The plan is the regression contract; the AI's implementation churn is invisible to it.

The closed loop: the agent that wrote the code can run the test itself

Because Assrt ships as an MCP server with three tools (assrt_test, assrt_plan, assrt_diagnose), the same coding agent that just generated your feature can verify it works in a real browser before claiming the task done. Claude Code or Cursor calls assrt_test, the agent boots Chrome, runs the #Case file, returns a structured pass or fail with a video URL. If the test fails the coding agent sees exactly what the test agent saw and iterates on the implementation, not on the test. The plan file is the regression contract; the AI's implementation is disposable until the contract holds.

That closed loop is the part most other testing tools cannot do. They run as a separate CI step, after the human has reviewed and merged. Assrt runs before the agent claims success, which is the only point where the agent can still learn from the failure. Same agent on both sides of the loop, same accessibility tree as the truth, same plan file describing the user contract.

Want help wiring this into your AI codegen workflow?

Bring a freshly generated repo and a flow you want to verify. We will write the first #Case together, run it against the live app, and leave you with a regression contract that survives the next regeneration.

Frequently asked questions

Why do tests written for AI generated code break so often, and how is this different from normal flaky tests?

Because the AI is allowed to rewrite the implementation any time. A normal test suite assumes the markup is stable across runs: today the Send button is a `<button class='btn-primary'>`, tomorrow it is still a `<button class='btn-primary'>`, and any selector you wrote keeps working. AI codegen breaks that contract. Cursor regenerates the form and decides to use Tailwind utility classes instead of a custom class; Claude Code switches the wrapper from a `<div>` to a `<form>` with a different aria-label; Lovable rewrites the entire page in a different shadcn variant. The user-visible behavior is the same: there is still a Send button, still labeled Send, still triggers a network call. But every selector and test id in your suite is now wrong. Flaky tests are timing problems. AI churn tests are contract problems, and the only way to make them survive is to bind the test to the user-visible contract (the accessible label, the role, the visible text) instead of the implementation.

How does Assrt avoid this when other tools cannot?

Two design choices in the agent. First, the scenario file is plain Markdown that says what a user would do, not what the implementation looks like. A real scenario reads `Click Send` not `await page.locator('.btn-primary[type=submit]').click()`. Second, the agent resolves the click at runtime by calling its `snapshot` tool, which returns the page's accessibility tree with `[ref=eN]` IDs (agent.ts:27-30). The agent reads the tree, finds the element labeled Send, picks the ref ID it just received (e.g. `e22`), and calls `browser.click(element, ref)` with that ref. Tomorrow the AI regenerates the component, the className changes, the DOM rearranges, the new render still labels the Send button Send. Snapshot reads the new tree, the new ref is `e34`, the agent uses `e34`. The plan file is unchanged. The test passes.

What does a working scenario look like end to end?

Six bullets per case in a plain text file. A real one: `#Case 1: New chat, type a message, send it, verify it appears.` followed by `Navigate to /chat. Click 'New chat'. Type 'hello' into the message input. Click Send. Wait for the message to appear in the conversation. Assert the latest assistant turn is non empty.` That is the entire test. No imports, no fixtures, no selectors. Run it with `npx @assrt-ai/assrt run --url http://localhost:3000 --plan path/to/case.md`. The agent fills in the rest each run: which ref to click, when to wait, when to take a screenshot, when to write the result. The contract you maintain is `#Case` plus user intent.

What about element ambiguity? If the page has two buttons labeled Send the agent could pick the wrong one.

True, and the accessibility tree is exactly how you disambiguate. The tree includes role, accessible name, and the surrounding hierarchy for every element. The agent's tool description for click (agent.ts:32-42) takes a human readable element description plus the ref. So you write `Click the Send button in the message composer` and the agent reads the tree, finds two buttons named Send, sees that one is in a region named Composer and the other is in a region named Settings, and picks the right ref. The disambiguation is in the description string, not in a CSS selector. If your UI has two truly indistinguishable buttons, that is a real accessibility bug for screen reader users; fix the bug, the test follows.

What is the actual list of tools the agent has? Can it run arbitrary code, and is that not its own risk?

The surface is exactly 18 fixed tools, declared as a typed array at agent.ts:16-196. They are: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. The agent cannot invent new tools. The Anthropic SDK rejects any tool name not in the schema. So while the agent can decide which of the 18 to call and what arguments to pass, it cannot generate, eval, or escape into arbitrary Playwright. `evaluate` does run JavaScript inside the browser (the `page.evaluate` analogue), but that is the only escape hatch and it returns a string the agent then has to act on. Most scenarios never call it.

I already write Playwright tests. Why would I switch to a Markdown plan?

If your UI is hand maintained and the selectors are stable, you should not switch. Playwright is great for that case. The reason this exists is that AI generated UIs are not that case. Every time Cursor regenerates a screen, half your selectors lie. The maintenance cost of a 60 line .spec.ts grows linearly with the number of regenerations. A six bullet #Case file is bound to user intent, so its maintenance cost is closer to zero across the same churn. You can also keep both: write Playwright for the slow moving pages, write #Case files for the AI churned ones, run them in the same CI pipeline. The two outputs are interoperable because the underlying engine is real Playwright through @playwright/mcp. Nothing is locked in.

Does the agent actually use real Playwright or some custom browser thing?

Real Playwright through the official @playwright/mcp package. The Assrt agent is a thin layer that holds the agent loop, the scenario parser, and the test report writer; the browser primitives (navigate, click, type, snapshot) are forwarded to a Playwright MCP server that the agent spawns over stdio. browser.ts:116 shows `callTool('browser_snapshot')`, browser.ts:561 shows `callTool('browser_snapshot')` returning the tree. Anything Playwright can do is reachable. The video recording is the same WebM Playwright produces. The Chrome instance is the same. The accessibility tree is Playwright's. No custom rendering, no proprietary protocol.

How big can the accessibility tree get before this falls over?

The agent caps the snapshot at 120,000 characters before it gets sent to the LLM (browser.ts:523, the SNAPSHOT_MAX_CHARS constant). A typical app page is 5k to 20k characters of tree. A long marketing page or a Wikipedia article hits the cap and the agent gets a truncation marker telling it which refs were preserved. In practice this only matters for two cases: very long lists (tables, feeds) where you want to scroll then re-snapshot, and pages that dump huge inline SVG. The cap exists because the model context budget is the real constraint, not the browser. If your AI generated app routinely produces pages too large for 120k of tree, that is also a UX bug worth fixing.

How do I run this in CI for a PR opened by a coding agent?

Three pieces. First, your dev server runs against the PR branch. Second, the CI job calls `npx @assrt-ai/assrt run --url http://localhost:3000 --plan tests/case.md --json` and gates the merge on the JSON exit. Third, on failure the run uploads the WebM video and the JSON report as CI artifacts so the human reviewer (or the next agent loop) sees exactly what happened. Because the plan never names a selector, you can run the same plan against `main` and the PR branch with no changes and trust that any failure is a real regression in user visible behavior, not a selector that drifted. The MCP server form (`assrt_test`) lets a coding agent run the same checks itself before opening the PR, which is the closed loop case.

What is the smallest possible setup that gets me from a freshly Cursor generated app to a passing end to end test?

Five steps. Install the MCP into your editor (`npx @assrt-ai/assrt setup`). Generate the feature with Cursor or Claude Code. Write a four line #Case file describing the user flow in plain English. Run `assrt_test` against `http://localhost:3000`. Watch the recorded video to confirm the agent did what you expected. Total time on a small feature is under five minutes. The plan file goes into the repo. The next time the AI regenerates the same component, you re run the same plan; if it fails you have a real regression, if it passes you have proof the new code preserves the user contract.