A field guide sorted by one axis, not by feature count

E2E testing tools, grouped by the one thing that decides maintenance cost: what each tool stores

Almost every guide on this topic lists the same fifteen tools against the same feature grid: parallel browsers, auto-waiting, retries, price per seat. Those rows converge to near-identical. The row nobody draws is the one that actually predicts your maintenance bill: when you author a test, what gets written to disk, and who has to fix it when the UI moves. Sorted that way, the landscape falls into three clean groups.

Matthew Diakonov, Written with AI

Published May 30, 20269 min read

Direct answer (verified 2026-05-30)

E2E testing tools split into three groups by how a test is authored and what gets stored. Code-first frameworks (Playwright, Cypress, Selenium, WebdriverIO) store hand-written selectors in code you commit. Cloud-recorded platforms (testRigor, Mabl, Rainforest QA, and managed services like QA Wolf) store recorded steps in the vendor's own format inside their cloud. Agent-resolved runners (Assrt) store a plain-English Case in your repo and resolve each element at runtime against the live page, so no selector is pinned. The right group depends on who writes the tests and how often your UI changes, not on which feature grid is longest.

Tool landscape cross-checked against current public guides on end-to-end testing tools. Assrt internals verified directly from source at assrt-mcp/src/core/browser.ts.

The landscape, sorted by what gets written to disk

Read this top to bottom as a spectrum of how tightly your test is bound to today's markup. The further down you go, the less there is to break and the less there is to maintain, at the cost of giving up some byte-level control.

Tool	What you author	What gets stored	What breaks it
Group 1 — Code-first frameworks
Playwright	Selectors + assertions in TS/JS/Python	Your test code (CSS/XPath locators)	Renamed class, restructured DOM, moved element
Cypress	Selectors + assertions in JS	Your test code in your repo	Selector drift; cross-origin and tab limits
Selenium / WebdriverIO	Selectors + driver glue code	Your test code + driver config	Selector drift; driver and wait flakiness
Group 2 — Cloud-recorded platforms
testRigor	Plain-English steps in their editor	Steps in the vendor cloud, proprietary format	Subscription lapse; lossy or no export
Mabl	Click-through recording, low-code	Recording in the vendor cloud	Account access; export limits
Rainforest QA / QA Wolf	No-code editor, or a managed team writes them	In the vendor cloud / vendor infra	Contract terms; you do not own the runtime
Group 3 — Agent-resolved runners
Assrt	A plain-English #Case block	Markdown in your repo; no selector pinned	Rarely; element re-resolved at runtime

Group 2 descriptions reflect each vendor's documented authoring model; exact export and pricing terms change, so confirm them on the vendor's own site before you commit.

Why "what gets stored" is the axis that matters

Pick any two tools from Group 1 and their feature pages are nearly interchangeable. Both run parallel browsers. Both auto-wait. Both screenshot and integrate with CI. If you choose between them on features, you are splitting hairs. The number that is not interchangeable, and almost never written on a comparison page, is how tightly the test you just wrote is bound to the exact markup that existed the day you wrote it.

A stored selector is a bet that the DOM will not change. When a developer renames btn-primary to btn-cta, every test that pinned that class fails, and someone has to go fix selectors instead of shipping. That is the hidden line item: not the cost of writing the suite, but the cost of keeping it green as the product moves underneath it. The longer that bill runs, the more teams quietly stop running the suite.

The two ways out are not symmetric. Group 2 removes the selector you write, but replaces it with a recording you do not own, stored in a format you cannot read, on infrastructure you rent. You traded maintenance for lock-in. Group 3 removes the stored selector entirely and keeps the authored intent in your repo as readable text. That is the trade this page is really about.

The same test, stored two ways

On the left, a code-first sign-in test: the brittle parts are the selector strings you commit. On the right, the same scenario as an Assrt Case. There is no selector to drift, because the binding from "Sign in" to an actual element happens at runtime, not at authoring time.

What you store: selector vs intent

import { test, expect } from '@playwright/test';

test('user can sign in', async ({ page }) => {
  await page.goto('http://localhost:3000');
  // a stored selector: breaks when the class
  // or DOM structure of this button changes
  await page.click('button.btn-primary[data-testid="login"]');
  await page.fill('#email', 'demo@example.com');
  await page.fill('#password', 'hunter2');
  await page.click('form > div.actions > button.submit');
  await expect(page.locator('.dashboard-header')).toBeVisible();
});

36% fewer lines, zero selectors

How a Case finds the button without a stored selector

This is the part you cannot read off a feature grid, so here is the actual mechanism. Before every action, the agent asks the page what is on it. The runtime is real Playwright driven through @playwright/mcp, and the agent has roughly eighteen tools it can call (navigate, snapshot, click, type_text, scroll, press_key, evaluate, assert, and more). The one that matters here is snapshot: it returns the page's accessibility tree with short handles like [ref=e5] attached to each interactive element.

The agent matches your English ("Click Sign in") to a ref and clicks it. If that ref has gone stale because the DOM changed between snapshot and click, the click handler in assrt-mcp/src/core/browser.ts (lines 526 to 576) does not give up. It scans every clickable element on the page, scores each by how well its text overlaps your instruction, and clicks the best match. No CSS path is ever pinned, which is the concrete reason a renamed class usually does not break the test.

One click, resolved at runtime

This is what "self-healing" means when you read the code instead of the marketing: not a model guessing, but a deterministic fallback from a stale reference handle to a text-scored scan of the live page. The selector was never the source of truth. The page was.

Where your tests live when you stop paying

The portability question is the one Group 2 tools answer worst and it is worth asking before you adopt anything. With a code-first framework, walking away costs nothing: the code is already yours. With a cloud-recorded platform, walking away can mean re-authoring the entire suite, because the recording lived in a format you could not read on infrastructure you did not own.

Assrt is built to be left. Each plan is written to /tmp/assrt/scenario.md as plain Markdown #Case blocks with a UUID, ready to copy into your repository and review in a pull request. The execution is standard Playwright. So the worst case of abandoning Assrt is that you keep a folder of readable test files and an MIT-licensed runner. That is the opposite of lock-in, and it is by design.

# run the agent-resolved approach on your own app

$ npx assrt-mcp

# add it to your Claude Code or Cursor config, then ask it to test a URL

# or a one-shot CLI run without the MCP layer:

$ npx assrt run --url http://localhost:3000 \

--plan "sign in and reach the dashboard"

Not sure which group your team should be in?

Bring your current e2e setup and we will map it against the three storage models and where the maintenance is actually leaking.

Specific questions about the categories, storage, and lock-in

What are the main categories of e2e testing tools?

Three, sorted by how a test gets authored and what is written to disk. Code-first frameworks (Playwright, Cypress, Selenium, WebdriverIO) ask you to write selectors and assertions in code; the file you commit contains CSS or XPath locators. Cloud-recorded platforms (testRigor, Mabl, Rainforest QA, and managed services like QA Wolf) capture a click-through recording or a plain-English script and store the steps in the vendor's own format inside their cloud. Agent-resolved runners (Assrt) store a plain-English Case in your repo and resolve each element at runtime against the live page, so no selector is pinned anywhere. The decision is less about feature checklists and more about which of these three storage models matches who writes your tests and how often your UI moves.

Why group e2e testing tools by 'what gets stored' instead of by features?

Because features converge and storage models do not. Playwright, Cypress, WebdriverIO, and most commercial platforms all ship parallel browsers, auto-waiting, screenshots, and CI hooks; the differences between their feature grids are mostly stylistic. What does not converge is the artifact you are left holding after you author a test. A hand-written selector breaks the moment a developer renames a class. A recording locked in a vendor cloud breaks when your contract lapses or the export is lossy. A plain-English Case re-resolved at runtime survives small UI edits and stays in your repo. The artifact, not the feature list, is what determines your maintenance bill six months in.

Is Assrt a real e2e testing tool or a wrapper?

It is a real runner whose execution layer is Playwright. Under the hood the agent drives a Chromium instance through @playwright/mcp, so every action it takes (navigate, click, type_text, scroll, press_key, evaluate, and a dozen more) is a real Playwright action against a real browser. What is different is the authoring and storage layer above that engine: instead of writing page.click with a selector string, you write an English Case, and the agent decides which element to act on at runtime from the page's accessibility tree. You get Playwright's execution reliability without writing or maintaining the selectors yourself.

How does an agent-resolved tool click the right button without a stored selector?

It asks the page what is on it, every run. Before acting, the agent calls a snapshot tool that returns the accessibility tree with short reference handles like [ref=e5] attached to each interactive element. The agent matches your English instruction ("click Sign in") to a ref and clicks it. If that ref has gone stale because the DOM changed, the code in assrt-mcp/src/core/browser.ts (lines 526 to 576) falls back to scanning every clickable element and scoring them by text overlap with your instruction, then clicks the best match. Nothing is pinned to a brittle CSS path, which is the mechanical reason a small UI tweak usually does not break the test.

Which e2e testing tools lock you in, and which let you leave?

Code-first frameworks never lock you in: the test code is yours and runs on any CI. Cloud-recorded platforms vary, but the recording usually lives in their format inside their dashboard, and leaving means re-authoring; managed services additionally own the infrastructure the tests run on. Assrt is built to be leavable on purpose: the plan is written to /tmp/assrt/scenario.md as plain Markdown Case blocks you can copy into your repo, and the execution is standard Playwright, so the worst case of walking away is that you keep a folder of readable test files and an open-source MIT runner.

Does skipping the selector mean I lose repeatability?

No. Every plan is saved as Markdown to /tmp/assrt/scenario.md with a UUID, and you re-run the exact same text by passing that scenario id back to the test tool. The Markdown is deterministic and commit-ready, so you diff it and review it in a pull request like any other artifact. What is re-resolved on each run is only the element lookup, against the live accessibility tree. The plan itself stays fixed; the binding from instruction to element is what stays flexible.

When is a code-first framework still the better choice?

When a seasoned QA team is maintaining a large, mature suite. Playwright in experienced hands is still the right default once you are several hundred scenarios deep and want byte-level control over fixtures, network mocking, and parallelization. The agent-resolved approach wins earlier: for a founder or PM writing the first smoke tests, for an app whose UI changes weekly, or for teams that keep skipping e2e entirely because the selector-maintenance tax is too high. Many teams run both, with Assrt covering breadth and a hand-written Playwright suite covering the deep paths.

What does Assrt cost compared with commercial e2e platforms?

The runner, the CLI, and the MCP server are MIT-licensed and free. The only usage-based cost is the LLM API calls the agent makes to plan and interpret your Cases; by default it uses claude-haiku-4-5-20251001 with your own Anthropic key, and a plan-then-run loop typically costs cents. Commercial platforms in the cloud-recorded group price as per-seat subscriptions or annual managed contracts, and the spend continues whether or not you can export your tests. There is an optional hosted dashboard at app.assrt.ai for sharing runs, free for individual use.

How do I try the agent-resolved approach on my own app?

Run the MCP server with npx assrt-mcp and add it to your Claude Code or Cursor config, then ask the assistant to plan and run tests against a URL. If you prefer a one-shot CLI run without the MCP layer, use npx assrt run --url http://localhost:3000 --plan "sign in and reach the dashboard". Either path launches a real browser, resolves elements at runtime, writes the plan to disk as Markdown, and reports pass or fail with screenshots. The generated files are yours to keep.