AI generated Playwright E2E tests: two shapes, only one survives the next UI regen
One shape emits `.spec.ts` once and prays. The other writes a plain English plan once and re-derives the Playwright calls every run from a fresh accessibility tree. This is the second shape, with the source files and line numbers from the open source agent.
Yes, AI can generate Playwright E2E tests, in two flavors. Flavor one: the model writes a `.spec.ts` file with hard coded locators. Works the first run, breaks the next UI regen. Flavor two: the model writes a 6-bullet plain English plan once, and a runtime agent re-derives the Playwright calls every run from a fresh accessibility tree. Flavor two is what Assrt ships; the 18-tool agent loop is at src/core/agent.ts lines 16-198 of the open source assrt-mcp repo.
The two flavors, side by side
Toggle between the two outputs an AI test generator can produce. Both are real. Only the second is durable on a UI that the AI itself keeps regenerating.
What an AI generated Playwright E2E test actually is
A .spec.ts one-shot codegen. Hard codes the className the model saw today: await page.locator('input.email-field-v3').fill('a@b.c'); await page.locator('button.btn-primary[type=submit]').click(); First UI regen: .email-field-v3 becomes .input, the submit button loses .btn-primary. Test fails. You re-prompt the model. Repeat forever.
- Hard codes today's className strings
- Re-prompt the model on every UI regen
- Test failures are usually selector drift, not real regressions
How the durable flavor actually works
Four moves. Each one points at a real file in the open source repo. There is no SaaS in the loop.
- 1
Generate the plan
`npx @m13v/assrt discover https://your-app.com` runs once. Three screenshots plus an 8k-char accessibility tree go to the model with the 18-line PLAN_SYSTEM_PROMPT. You get 5 to 8 `#Case` bullets in plain English.
- 2
Commit the Markdown
The plan is a text file. No selectors, no .spec.ts. Diff it in PRs, grep it for coverage, edit it by hand when product changes. The plan is the artifact you own.
- 3
Resolve at run time
Each run, the agent reads a fresh accessibility tree, picks the ref for each interaction, and calls the matching Playwright primitive through `@playwright/mcp`. Today's ref is `e7`, tomorrow's might be `e34`. The plan never changes.
- 4
Review the run, not the code
Per-step PNG, full WebM, and a JSON report land in `/tmp/assrt/results/latest.json`. The `.spec.ts` you would have reviewed was always a derivative; the run is the truth.
Anchor fact: the 18 tools the model is allowed to pick from
The agent's tool surface is fixed at compile time. Declared as a typed Anthropic.Tool[] array at src/core/agent.ts lines 16 to 198 in the assrt-mcp repo. Default model: claude-haiku-4-5-20251001 (line 9). The Anthropic SDK rejects any tool name not in the array, so the model gets to decide which Playwright primitive to fire and what ref to pass, but it cannot escape into freeform code you would then have to trust.
Every browser primitive forwards to a spawned @playwright/mcp subprocess (resolver at browser.ts:284; snapshot call at browser.ts:590). Real Playwright underneath, every time.
Why bother with the plan layer at all
If your UI is hand maintained and selectors are stable, you should not. Hand written Playwright is excellent for that case. The plan-then-resolve flavor exists for the case the AI testing tools market does not name out loud: most teams now ship UI that an AI keeps regenerating. The maintenance cost of a 60-line `.spec.ts` grows linearly with the number of regenerations. The maintenance cost of a 6-bullet plan stays roughly flat across the same churn. You can keep both: hand written `.spec.ts` for the slow pages, `#Case` plans for the AI-churned ones, run them in the same CI pipeline.
The other reason: a teammate who does not know Playwright can read a plan. A product manager can review one before a release. A coding agent can edit one between turns without breaking your build. The plan is a contract; the Playwright code is a byproduct.
What a durable AI generated Playwright E2E test should have
The durable-flavor checklist
- Plan written in plain English, not in `await page.locator(...)` calls
- Selectors resolved at run time from a fresh accessibility tree
- Real Playwright under the hood through `@playwright/mcp`, not a custom protocol
- Disposable email, OTP, and magic-link flows handled by named tools
- WebM recording and zero-padded PNGs written per step
- Open source, MIT, no SaaS dashboard, no proprietary YAML
One thing the marketing pages do not say
“The `.spec.ts` was never the test. The run was. Once you stop treating the generated code as the artifact and start treating the plan plus the run video as the artifact, your suite stops costing a sprint a quarter to maintain.”
Want a plan generated against your app live?
20 minutes. We point assrt_plan at your URL, walk through the cases it emits, and you keep the Markdown.
FAQ
Frequently asked questions
Can AI actually generate Playwright E2E tests that pass?
Yes, with a caveat that decides whether they keep passing. Two shapes are possible. One is one-shot codegen: feed a model your app, ask for `.spec.ts`, paste it into your repo. This works for the first run. It fails the moment the underlying app changes any className, aria label, or DOM nesting that the model hard coded into a locator, which on an AI generated UI is roughly every day. The other shape is plan-then-resolve: the model writes a short plain English plan describing user intent, and a runtime agent re-derives the Playwright calls each run from a fresh accessibility tree. Assrt is the second shape. Both are 'AI generated Playwright E2E tests'; only the second survives a regen.
What does an AI generated Playwright E2E test look like in the plan-then-resolve shape?
Six bullets in plain English. A real one: `#Case 1: New user signs up and reaches the dashboard.` then `Navigate to /signup. Fill the email input with a fresh address. Fill the password input. Click Create account. Wait for the dashboard heading. Assert the URL contains /dashboard.` That is the entire test file. No imports, no fixtures, no selectors. The model that wrote it never had to guess what the Send button's className is. At run time the Assrt agent calls its `snapshot` tool, reads the accessibility tree, sees a button with the accessible name Create account, gets back ref `e7`, and calls `browser.click('Create account', 'e7')`. The Playwright code exists, it just exists for one run.
How does the AI pick the right Playwright primitive each step?
From a fixed 18-tool surface declared at src/core/agent.ts lines 16 to 198 in the assrt-mcp repo. The tools are: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. The default model is `claude-haiku-4-5-20251001` (agent.ts line 9). The model cannot invent a 19th tool; the Anthropic SDK rejects any tool name not in the typed array. So the agent gets to decide which Playwright primitive fits the next step and which ref to pass to it, but it cannot generate freeform Playwright code that you then have to trust or review.
Is this real Playwright or some custom browser thing dressed up as Playwright?
Real Playwright through the official `@playwright/mcp` package. The Assrt agent spawns it as a stdio subprocess; the resolver is at src/core/browser.ts line 284 (`require_.resolve('@playwright/mcp/package.json')`). Every browser primitive the agent calls forwards to that subprocess. The snapshot tool calls `browser_snapshot` (browser.ts line 590). The Chrome instance is Playwright Chromium. The video output is the WebM Playwright records. Nothing in the path is custom rendering or a proprietary protocol. Anything Playwright can do is reachable.
Why is the one-shot .spec.ts flavor so brittle on AI generated apps specifically?
Because the contract those tests bind to does not survive. A normal test suite assumes today's `button.btn-primary[type=submit]` is still there tomorrow. AI codegen breaks that. Cursor regenerates the form and decides Tailwind utility classes are cleaner; Claude Code switches a `<div>` to a `<form>`; Lovable rewrites the whole page in a different shadcn variant. The user visible behavior is identical: there is still a Send button, still labeled Send, still triggers a network call. Every selector in your suite is now wrong. The one-shot `.spec.ts` flavor was great for slow-moving hand-maintained UIs; it is a tax on AI-generated ones.
Do I lose Playwright's debugging story by working with a plan instead of a .spec.ts?
No, you gain a different one. Each step writes three things to disk: a zero-padded PNG of the post-action page, an entry in `/tmp/assrt/results/latest.json` with the exact tool call and arguments, and a WebM recording of the whole run. You also get the resolved ref the agent clicked, which is more informative than a CSS selector because it tells you what the page actually exposed to assistive tech at that instant. The Playwright trace viewer still works because the run is real Playwright. The plan is what you commit; the resolved Playwright calls per run are diffable artifacts.
Can I use AI generated Playwright E2E tests for an app with auth, OTP, or magic links?
Yes, the 18-tool surface includes `create_temp_email`, `wait_for_verification_code`, and `check_email_inbox` for exactly that case. The agent provisions a disposable inbox, types its address into your signup form, then polls the inbox for the code and types it back in. The plan bullet looks like `Fill email with a fresh disposable address. Fill password. Click Sign up. Wait for verification code. Enter the code.` The agent does the rest. You do not write the inbox wiring; you do not hard code a test email address that goes stale.
How is this different from Playwright's own codegen recorder?
Playwright's recorder watches you click, then emits `.spec.ts` with `getByRole`, `getByText`, and the like. The output is good Playwright but the same brittleness category as any one-shot codegen: it freezes a specific path through a specific render of the page. If the page changes, you re-record. The plan-then-resolve shape skips the freeze. The recorder is great when you already know the flow and the UI is stable. The plan format is great when the UI churns or when you want a non-Playwright-fluent teammate (or a coding agent) to maintain the suite by editing a Markdown file.
Where do I run these in CI?
Same place you run any Playwright E2E suite. The Assrt CLI is `npx @m13v/assrt discover https://your-app.com` for a fresh plan, then `npx @m13v/assrt run --url $URL --plan path/to/case.md --json` per CI job. The JSON exit gates the merge. On failure, the WebM video and the latest.json report upload as CI artifacts. Because the plan never names a selector, the same plan runs against `main` and the PR branch with no diff; any failure is a real regression in user-visible behavior, not a selector that drifted. The MCP form (`assrt_test`) lets a coding agent run the same checks before opening the PR.
What does `assrt_plan` actually send to the model when it generates the tests?
Three screenshots taken at different scroll depths, the page's accessibility tree truncated to about 8000 characters, and a system prompt that lives at src/mcp/server.ts line 219. The prompt opens with `You are a Senior QA Engineer generating test cases for an AI browser agent.` It enumerates the agent's capabilities and explicit limits (cannot resize the browser, cannot inspect CSS, cannot test network errors) so the generated cases stay inside what the runtime agent can actually execute. The output is `#Case` Markdown, which the executor parses one case at a time. The whole pipeline is open source, MIT, and the prompts are not minified or obfuscated; they are 18-line literals you can read in the repo.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.