E2E testing, artifact-first edition

E2E testing without selectors and without code.

Every guide on this topic compares Cypress and Playwright and Selenium and tells you which to pick. None of them ask the more useful question: does your end-to-end test artifact have to be code at all? In Assrt it is a Markdown file parsed by exactly 0 regex, executed by the same @playwright/mcp the Playwright team ships, with 0 selectors in the file. This page walks through what changes when you remove the selector layer.

Matthew Diakonov, Assrt maintainer

Published April 24, 202612 min read

npx assrt run --plan-file checkout.md

4.8from Assrt MCP users

Test artifact is /tmp/assrt/scenario.md, plain Markdown

Parsed by one regex at src/app/app/test/page.tsx:451

Runner is the official @playwright/mcp, MIT-licensed

E2E testing, but the test isn't code

What an Assrt scenario actually is on disk.

scenario.md, parsed by one regex

Refs bound at click time, not at authoring

wait_for_stable: a MutationObserver, not a guess

MAX_STEPS_PER_SCENARIO = Infinity

Runner is @playwright/mcp, MIT-licensed

0:00 / 0:05

scenario.md, not .spec.ts@playwright/mcp under the hoodrefs from a live snapshotMutationObserver waitsMAX_STEPS_PER_SCENARIO = Infinityuncapped conversation turnsMIT-licensed, npm-installedself-hosted Chromiumvideo for every runno vendor dashboard

Every E2E framework's biggest decision is the artifact

Cypress writes JavaScript. Playwright writes TypeScript. Selenium writes Java or Python. Cucumber writes Gherkin .feature files with regex glue code underneath. Hosted SaaS competitors record proprietary YAML inside a vendor visual editor. Each of those artifacts dictates everything downstream: who can author tests, what code review feels like, whether you can grep them, whether they survive a runner migration, and how badly the suite breaks when the UI changes.

The artifact debate gets dressed up as a runner debate (which is faster? which has better debugging?), but the runner is replaceable; the artifact is not. Once you have a hundred .spec.ts files, you are not switching. So the most consequential decision in adopting an end-to-end stack is the one almost never asked: what does the test file look like?

Same flow, two artifacts

// e2e/checkout.spec.ts (Playwright) import { test, expect } from "@playwright/test"; test("guest checkout", async ({ page }) => { await page.goto("/products/sku-42"); await page.getByTestId("add-to-cart-v2").click(); await page.waitForTimeout(2000); await page.getByRole("link", { name: /cart/i }).click(); await page.getByRole("button", { name: "Checkout" }).click(); await page.getByLabel("Email").fill("test+abc@example.com"); await page.getByLabel("Card number").fill("4242424242424242"); await page.getByLabel("Expiry").fill("12/30"); await page.getByLabel("CVC").fill("123"); await page.getByRole("button", { name: /place order/i }).click(); await expect(page).toHaveURL(/\/thanks/); }); // Brittleness lives here. data-testid renames break it. // 2000ms hardcoded sleep is a guess. The next UI shuffle // breaks the file, not the run.

data-testid renames break the file
page.waitForTimeout(2000) is a magic number guess
Maintenance lives in the test you wrote

The whole test format is one regex

When the artifact is plain Markdown, the parser fits in a single line of TypeScript. There is no AST, no validation pass, no schema migration story. The function below at src/app/app/test/page.tsx:451 is the entire grammar.

src/app/app/test/page.tsx

That is the whole API surface a human or an agent has to learn. The regex accepts #Case 1:, Case 2., #Scenario 3:, Test 4: and a few obvious variants, because reasonable defaults beat strict syntax for a format humans hand-edit.

Anchor file

/tmp/assrt/scenario.md

This is what an Assrt run reads on disk. Two cases, plain English, no imports, no fixtures, no afterEach. The agent edits this file in place; an fs.watch with a 1-second debounce syncs every save back to cloud storage so you can share a URL or replay the exact text later.

# checkout.md  (commit this file to your repo)

#Case 1: Guest checkout with a real card
Navigate to /products/sku-42. Click Add to cart.
Click the Cart icon. Click Checkout.
Fill the shipping form with a US address. Use a disposable email.
Pick the test Stripe card 4242 4242 4242 4242, expiry 12/30, CVC 123.
Verify the URL contains /thanks and the heading reads "Order confirmed".

#Case 2: The order shows up in account history
Navigate to /account/orders. Cookies from Case 1 should still be valid.
The newest row should match the order from Case 1.

What runs the file: the official Playwright MCP

Assrt does not implement a test runner. It embeds @playwright/mcp (the official Playwright MCP server, maintained by the Playwright team) and spawns it over stdio. An LLM agent reads your Markdown case, emits one tool call at a time (navigate, snapshot, click, type_text, wait_for_stable, assert, and twelve others) and Playwright MCP executes each call against a real Chromium. If you remove Assrt tomorrow, the MCP server keeps working and your Markdown files are still readable as documentation.

What an Assrt run actually wires together

The lifecycle of one #Case

You write a Markdown file with #Case headers

Anything between two #Case lines is plain English. No imports, no fixtures, no afterEach hooks. Commit the file to your repo next to your code.

Assrt reads the file with one regex

parsePlanText splits on /(?:#?\s*(?:Case|Scenario|Test))\s*\d*[:.]\s*/gi and feeds each block to the agent verbatim. There is no DSL to learn.

TestAgent spawns @playwright/mcp over stdio

The official Playwright MCP server is launched as a child process. Cookies persist under ~/.assrt/browser-profile unless you pass --isolated.

Each turn: snapshot, decide, call a tool

The agent receives the page as an accessibility tree plus a JPEG screenshot, picks one of the 18 named tools, and binds refs from the live snapshot just before the click.

wait_for_stable replaces fixed timeouts

A MutationObserver is injected, mutations counted, and the call returns when the DOM is quiet for 2 consecutive seconds. The instrumentation is cleaned up before the next turn.

complete_scenario writes the artifacts

Plan, results JSON with per-assertion evidence, and a WebM recording all land under /tmp/assrt/. The video player auto-opens at 5x speed; close it or share the file.

A real run, with the actual tool calls

0regex parses the test format

0named tools the agent can call

0selectors in scenario.md

0smax wait_for_stable ceiling

Selectors moved from your test to the live page

The traditional E2E pain story is selector drift: you wrote .btn-primary-v2 last sprint, the design team renamed it to .cta-buy-now this sprint, and a hundred tests fail with errors that have nothing to do with the user flow. Frameworks have responded with better locator APIs (Playwright's getByRole), data-testids, and self-healing patches that try alternate selectors when the primary breaks. Those help, but they all keep the binding authored once at write time.

Snapshot-then-act inverts that. Every click is preceded by a fresh accessibility-tree dump where each focusable element has an opaque id like e5 or e12. The agent reads the tree, finds an element whose accessible name matches its intent, and passes the ref straight to Playwright MCP, which resolves the id on the live page. Refs cannot drift because they only live for one turn. If the element is not in the tree, the agent cannot click it, full stop.

Same flow, where the binding lives

// e2e/checkout.spec.ts  (Playwright)
import { test, expect } from "@playwright/test";

test("guest checkout", async ({ page }) => {
  await page.goto("/products/sku-42");
  await page.getByTestId("add-to-cart-v2").click();
  await page.waitForTimeout(2000);
  await page.getByRole("link", { name: /cart/i }).click();
  await page.getByRole("button", { name: "Checkout" }).click();
  await page.getByLabel("Email").fill("test+abc@example.com");
  await page.getByLabel("Card number").fill("4242424242424242");
  await page.getByLabel("Expiry").fill("12/30");
  await page.getByLabel("CVC").fill("123");
  await page.getByRole("button", { name: /place order/i }).click();
  await expect(page).toHaveURL(/\/thanks/);
});
// Brittleness lives here. data-testid renames break it.
// 2000ms hardcoded sleep is a guess. The next UI shuffle
// breaks the file, not the run.

37% fewer brittle lines

The artifact-level differences in one table

What changes when the test stops being code.

Feature	Selector-based stacks	Assrt
Test artifact	.spec.ts / .py / .feature with selectors baked in	scenario.md, plain Markdown, no selectors
Parser	Full TypeScript / Python / Gherkin grammar	One regex at page.tsx:451
Selector binding	Authored once, breaks when DOM changes	Rebuilt at click time from a live accessibility snapshot
Wait strategy	Hardcoded ms or framework auto-wait on a known selector	MutationObserver, returns when DOM is quiet for 2s
Step cap	Implicit, by file length and assertion count	MAX_STEPS_PER_SCENARIO = Infinity (agent.ts line 7)
Where the runner runs	Vendor cloud or your CI, paid by seat	Local Chromium via @playwright/mcp, MIT-licensed, free
Login flows	Auth fixtures, storage state files, secrets management	--extension reuses your real Chrome profile, --isolated for clean runs
OTP / email loop	Mailosaur or similar paid inbox, glue code in the test	create_temp_email + wait_for_verification_code in the same run

What a healthy E2E suite looks like in 2026

The test file is readable English a non-engineer can review
Selector strings live nowhere in the artifact
Wait durations are not hardcoded; instrumentation decides them
Login state is reusable from the developer's real Chrome profile
Email and OTP flows complete without a paid inbox provider
The runner is open source and matches what other tools use
Every assertion stores free-form evidence a human can read
The artifact survives a framework migration

When you should still write a .spec.ts

Some tests are easier to write as code. A trace-level performance probe with custom Playwright tracing API calls. A non-UI integration test that opens a database connection and asserts on rows. A hot-loop test that fires a thousand requests in two seconds and times them. Assrt is built for the part of E2E that mirrors a person: 5 to 20 actions on a real page, with assertions on what a human would see. For the rest, the right answer is a Playwright spec, a Vitest integration test, or a small Node script.

The two coexist comfortably. Both run the same Chromium under the hood. Both can share a logged-in profile with --extension or a saved storage state. The decision is per-flow, not all-or-nothing. The flows where Markdown wins are the ones where the maintenance cost dwarfs the writing cost: signup with OTP, checkout with a card, an admin page that changes every release.

Want a hand pointing Assrt at your stack?

Bring a flow you would rather not write Playwright for. Twenty minutes, one #Case, you keep the file.

Frequently asked questions

What does E2E testing mean? How is it different from unit and integration testing?

End-to-end testing exercises a real user flow against a running application: a browser opens the app, clicks through the UI, and verifies a final state. The unit layer below it isolates a single function with mocked dependencies; the integration layer in the middle checks that two or three components talk to each other correctly. E2E is the only layer where 'it works' actually means 'it works for a person'. The cost is that E2E suites are slow, flaky, and expensive to maintain because every layer the user touches (browser, network, backend, third-party APIs) sits in the test path. Most of the maintenance cost has historically been in one place: the selector strings that point at DOM elements. Change a className or rewrite a component and dozens of tests break with errors that have nothing to do with the user-visible behavior.

Is the test artifact in Assrt actually plain Markdown? What parses it?

Yes. The plan lives at /tmp/assrt/scenario.md and is split into cases by one regex: /(?:#?\s*(?:Case|Scenario|Test))\s*\d*[:.]\s*/gi at src/app/app/test/page.tsx line 451. There is no grammar, no AST, no proprietary YAML schema, no .config file, no compile step. Every line that is not a #Case header is plain English instruction passed straight to the model. The metadata sidecar at /tmp/assrt/scenario.json holds the id, name, and url; results land in /tmp/assrt/results/latest.json. The same Markdown file is the editing surface for both humans and agents. fs.watch with a 1-second debounce syncs edits back to cloud automatically. You can grep your scenarios. You can diff them in a PR. You can keep them in your repo. None of that is true of a vendor visual-editor scenario.

If there's no .spec.ts file, what actually runs the test?

Assrt embeds @playwright/mcp, the official Playwright MCP server, and spawns it over stdio at scenario start (assrt-mcp/src/core/browser.ts launches a local node process pointing at the cli.js inside the @playwright/mcp package). The LLM agent (Claude Haiku 4.5 by default, model can be overridden) reads the case text and emits tool calls one at a time: navigate, snapshot, click, type_text, wait_for_stable, assert. Playwright MCP executes each call against a real Chromium and returns the result. There is no Assrt-owned test runner; the engine running your test is the same MCP server the Playwright team ships, and if you uninstall Assrt tomorrow your scenario.md files are still readable English. That portability is the point.

Where do CSS selectors live in this architecture? Why are they not in my test?

They live in the live page, and they are looked up at click time, not at authoring time. Every snapshot tool call returns the page as an accessibility tree where each focusable element has an opaque id like e5 or e12. When the agent decides to click 'Add to cart', it asks for the snapshot first, finds the matching element in the tree, and passes ref=e5 to Playwright MCP, which resolves the id on the live page. If your team renames a button class from .btn-primary-v2 to .btn-cta the test does not break, because no test ever named that class. Selectors are recomputed on every snapshot, so they cannot drift between authoring and execution.

Why is MAX_STEPS_PER_SCENARIO = Infinity at agent.ts line 7? Is that safe?

It is the only honest setting for an LLM-driven runner. A hard cap (25 steps, 50 turns) is a quiet failure mode: if a scenario legitimately needs 28 actions, the test reports FAILED with an out-of-steps summary and a human has to guess whether the test was wrong or the cap was too tight. Assrt sets the per-scenario cap to Infinity (and MAX_CONVERSATION_TURNS likewise) and lets the agent exhaust its own budget naturally. The model ends a scenario by calling complete_scenario, runs out of API retries, or hits a CLI/CI-level timeout you configured. Time and money budgets belong at the orchestration layer, not as a magic number deep inside the runner.

How are waits handled if there are no fixed millisecond timeouts in the plan?

wait_for_stable injects a MutationObserver into the page, counts DOM mutations into a window global, and returns as soon as N consecutive seconds pass without a new mutation (default N is 2, ceiling is 30 seconds). A login that finishes in 1.4 seconds returns in 1.4 seconds; a streaming chat response returns when the streaming actually stops; a slow API call returns when the spinner is gone. There is no magic number sitting in your test that will be wrong on a slower machine next month. The cleanup after the wait disconnects the observer and deletes the window globals, so your app never sees the instrumentation.

How does this compare to Cypress, Playwright, Selenium, and Cucumber?

Cypress and Playwright tests are TypeScript files with chained selector calls (cy.get('button').click(), page.getByRole('button', { name: 'Login' }).click()). Selenium tests are usually Java or Python with WebDriver locators. Cucumber tests are Gherkin .feature files with Given/When/Then sentences that look like English but are matched by regex into glue code you still write in Ruby/Java/JS. Closed-source SaaS tools (Momentic, QA Wolf) ship visual editors that produce proprietary YAML or scenario records inside their cloud. In every one of those, the artifact is structurally tied to the runner. Assrt's artifact is plain Markdown that the runner has no claim on; if the open-source project disappeared tomorrow your scenario.md files would be readable as documentation. The runner is @playwright/mcp, which the Playwright team maintains.

What does it cost to run a single scenario? Where does the LLM bill come from?

Each turn sends the Anthropic API a list of messages that includes the latest accessibility snapshot (text, capped at 3000 chars) plus a JPEG screenshot of the viewport. Default model is claude-haiku-4-5-20251001, which is the cheapest Anthropic model that produces reliable tool calls; you can override with --model. A typical 8 to 12-turn scenario sits in the cents range at Haiku list price, and Assrt's sliding-window logic trims the conversation at assistant/model boundaries so a long scenario does not blow the context window. Self-hosted Chromium runs on your machine or your CI; there is no per-test seat license. Compare with hosted SaaS competitors at four or five figures a month for a fixed seat count.

Can I run E2E tests against authenticated routes without baking credentials into the test file?

Yes. With --extension Assrt connects to your existing Chrome profile via Playwright's extension mode and inherits the cookies, sessions, and login state already in your browser. With --isolated false (the default) it persists a Chromium profile under ~/.assrt/browser-profile across runs so a one-time login carries over. For email-loop flows like signup-with-OTP, create_temp_email spins up a disposable inbox and wait_for_verification_code polls it for up to 120 seconds; the system prompt hard-codes a ClipboardEvent + DataTransfer paste expression for the common 6-digit split-input pattern. None of that is in your scenario.md, which keeps the test artifact free of secrets and free of timing magic.

How do I review an Assrt run after it finishes?

Every run writes three things: /tmp/assrt/scenario.md (the plan as it ran), /tmp/assrt/results/<runId>.json (per-assertion pass/fail with free-form evidence strings), and a WebM video of the entire browser session. The CLI passes --video --json so a CI job can capture the structured JSON and link the recording. The MCP server (used inside Claude Code, Cursor, etc.) auto-opens the video player at the end of a run so a human can scrub at 5x or 10x. There is no proprietary dashboard you have to log into; the artifacts are local files and you can grep them.