E2E testing best practices

Nine e2e testing best practices, one root cause, and the rule every SERP misses.

Read any top-ten list on e2e testing best practices and you will see the same nine rules: use stable selectors, adopt the Page Object Model, avoid hard-coded waits, isolate your tests, fix flakes first, run in parallel, write from the user's perspective, pick data-testid over CSS, run on every PR. Every one of them is correct. Every one of them is patching the same thing: a selector persisted in a test file. Pull that root cause out and eight rules change shape and one disappears.

M
Matthew Diakonov
12 min read
4.8from Assrt MCP users
Selectors resolved at runtime, zero persisted in the plan
Real Playwright underneath, not a proprietary YAML
Open-source and self-hosted, no vendor account required

Best practice zero

Nothing in your test file should outlive a single run.

Every rule on the top-ten list (data-testid, POM, explicit waits, isolated fixtures, quarantine) is a mitigation for the same underlying failure: a string in a test file gets out of sync with the app. Drop the persisted selector and the nine rules collapse into one primitive: resolve against the live accessibility tree on every click, then throw it away.

The nine practices every e2e testing guide agrees on

Scraped from the top-ranked 2026 guides on this keyword (Leapwork, Keploy, Bunnyshell, Katalon, Shiplight, Virtuoso, OneUptime). The nine below are endorsed by at least seven of the ten. Your mileage may vary on ordering; the list itself is stable.

The consensus list

  • Keep E2E to 5-10% of the test suite; cover only critical user journeys.
  • Use stable selectors (data-testid), never CSS classes or XPath.
  • Adopt the Page Object Model so locators live in one place.
  • Isolate tests: every test creates and cleans up its own data.
  • Never use hard-coded sleeps; prefer explicit waitFor conditions.
  • Fix flaky tests before writing new ones; quarantine what you can't fix.
  • Run in parallel across workers; shard across CI nodes.
  • Write tests from the user's perspective, not the DOM's.
  • Integrate into CI on every PR, in an ephemeral preview env.

Read them as a bundle and they look like a checklist of independent practices. Read them as a taxonomy and they are variations on one theme: something your tests depend on (a selector, a timeout, a fixture, a worker count) does not stay true between when you wrote it and when the test next runs. The app drifts. Each rule is a particular shape of that same drift.

Six drift surfaces, six runtime alternatives

Each card below is a best practice written as a drift surface, paired with what replaces it when your plan is English and your runtime resolves selectors on the fly.

Stable selectors

Classical mitigation: pick data-testid attributes and pray the frontend team preserves them. Drift surface: every testid string in your repo is a thing that can disappear in a redesign. Runtime-resolved alternative: the agent calls snapshot before each click, reads the accessibility tree, matches role + name from your English step. Nothing to preserve.

Page Object Model

Classical mitigation: a pages/ directory so locators live in one place. Drift surface: the POM files themselves. When the UI refactors, you touch two code bases. Runtime alternative: no POM; the plan is the page object, expressed as an English sentence per step.

Explicit waits, never sleeps

Classical mitigation: await locator.waitFor({ state: 'visible', timeout: 15000 }). Drift surface: timeout budgets you tune per test. Runtime alternative: wait_for_stable injects a MutationObserver into the page and returns as soon as the DOM has been silent for N seconds. Fast pages finish fast, slow pages block longer, you never pick a number.

Isolated, self-cleaning tests

Classical mitigation: fixtures that create data up front and delete it after. Drift surface: the fixture code plus the schema it assumes. Runtime alternative: create_temp_email for signup flows, evaluate for JS-side setup, and a shared-browser opt-in for when you want auth to carry across cases.

Quarantine flakes

Classical mitigation: a flakiness dashboard and a quarantine list. Drift surface: the quarantine itself. Runtime alternative: the system prompt's Error Recovery block tells the agent to re-snapshot on failure, try a different ref, scroll and retry, before marking failed. Stale-ref flake disappears because refs are per-turn.

User perspective

Classical mitigation: coding the perspective via byRole and byLabel queries. Drift surface: the queries themselves. Runtime alternative: the plan is the user perspective, verbatim. 'Click Get started' is the sentence your PM would write, and it is also what ships.

The anchor fact: the four-paragraph rule that removes persisted selectors

Here is the actual system prompt Assrt sends the model on every test run. The three blocks below (CRITICAL Rules, Selector Strategy, Error Recovery) are what replace the nine best practices at runtime. You can verify this in the source: assrt-mcp/src/core/agent.ts:207-225.

assrt-mcp/src/core/agent.ts:207-225
3 steps

snapshot first, ref from the accessibility tree, re-snapshot if stale. That is the entire selector strategy.

assrt-mcp/src/core/agent.ts, SYSTEM_PROMPT

Notice what is not in that prompt. No locator syntax, no data-testid convention, no Page Object advice, no wait timeout recommendation. The model is told: call snapshot, read refs, click with a ref, re-snapshot on failure. Four rules. Everything else on a classical best-practices list is downstream.

Same flow, same Playwright: persisted selectors vs runtime-resolved

Flip between tabs. Left: a typical Playwright checkout spec following the classical best practices. Right: the same journey as a Markdown scenario.md Assrt executes against the same Playwright underneath. Count the strings in the left that live past the run and exist in the right.

persisted vs runtime

// e2e/checkout.spec.ts
import { test, expect } from "@playwright/test";

test("add to cart and checkout", async ({ page }) => {
  await page.goto("/products/42");

  // Practice: "use stable selectors", "prefer data-testid"
  // Reality: five places you now have to keep in sync with the app.
  await page.locator('[data-testid="add-to-cart"]').click();
  await page.locator('[data-testid="cart-badge"]').waitFor();
  await page.locator('[data-testid="open-cart"]').click();
  await page.locator('[data-testid="checkout-button"]').click();
  await page
    .locator('[data-testid="checkout-email"]')
    .fill("a@b.com");

  // Practice: "avoid hard-coded waits"
  // Reality: you still guess a number, and it is always wrong
  // in one direction on one day.
  await page.waitForTimeout(1500);

  await expect(
    page.locator('[data-testid="order-confirmation"]')
  ).toBeVisible();
});
76% fewer lines

The left file is not bad code. It follows every rule from the top-ten SERPs. It is still seven selectors and one timeout number you own forever. The right file is the same journey with zero selectors and zero timeout numbers. When the product team renames a button, the left breaks and the right does not, because the accessibility tree still exposes a button whose name contains the English word you wrote.

Four numbers to keep in mind

Everything above is derivable from these four. Eighteen tools is the total action surface of a test run. Zero is how many selectors sit in scenario.md. Two is the default seconds-of-DOM-silence for wait_for_stable. Seven is the number of verification-code patterns wait_for_verification_code matches in priority order before giving up.

0
Tools the agent can call
0
Selectors in scenario.md
0
MutationObserver target (seconds silent)
0
OTP regex fallbacks, priority-ordered

Plan, drift pressure, runtime resolution

The diagram below traces the pipeline. On the left, the three things a reader controls: the English plan, the list of pass criteria, and any variables for parameterization. Everything on the right is a runtime output: the Playwright call, the assert record, the result file, and the video. The middle is the only stateful layer, and it holds nothing between runs.

Plan in, assertions out, nothing persisted in the middle

scenario.md
passCriteria
variables
agent loop
Playwright MCP
assert records
results/latest.json
recording.webm

What lives in your repo, before and after

A classical e2e suite that follows every best practice ends up as several folders of fixtures, page objects, spec files, and configuration. Every file is a drift surface. The runtime-resolved equivalent is a single Markdown file.

Repo layout for one checkout journey

// your-app/e2e/ // fixtures/ // inbox.ts (42 lines) // auth.ts (38 lines) // seed.ts (90 lines) // pages/ // ProductPage.ts (74 lines) // CartPage.ts (51 lines) // CheckoutPage.ts (88 lines) // specs/ // checkout.spec.ts (142 lines with locators) // playwright.config.ts (91 lines) // package.json (+ 6 devDeps) // // Total: 626 lines of test code // Every testid, page object, timeout and fixture is // a drift surface tied to the current UI.

  • Six files of locators and fixtures
  • One playwright.config.ts and six devDeps
  • 626 lines that must track the UI

What a run actually looks like

The trace below is the eight-line plan from the previous section, executed headless against a local dev server. Note every click line is followed by a ref like [ref=e23] that came out of the snapshot one step before. That ref did not exist before the run and will not exist after it.

assrt run (trimmed)

Best practice, classical answer, runtime-resolved answer

Nine practices; same Playwright underneath either way. The right column is Assrt.

FeatureClassical Playwright suiteAssrt (runtime-resolved)
Stable selectors (data-testid)Every clickable element gets a testid; coordinated with frontendNo testids needed; role + name from the accessibility tree, per click
Page Object Modelpages/ directory of classes wrapping locatorsPlan is the page object; no wrapper layer
Explicit waitswaitForSelector / waitForLoadState with a timeout numberwait_for_stable: MutationObserver, returns on DOM silence
Isolated test dataPer-test fixtures + teardown hookscreate_temp_email + evaluate + shared-browser opt-in
Flake quarantineDashboard, retry flag, quarantine listPer-turn re-snapshot; refs resolve to the live DOM every time
Parallel / shardedWorker count tuning, shard index configEach #Case is a conversation; run them in parallel processes
User-perspective assertionsbyRole / byLabel queries you authorEnglish assertion in the #Case; agent matches role + name at runtime
CI integrationPlaywright config + CI YAML + env-specific URLsAny URL; --json stdout; self-hosted runner, no vendor account
Vendor lock-inFramework neutral; high lock-in for SaaS runners ($7.5K/mo+)Open-source, self-hosted, real Playwright underneath

The practice every SERP misses

None of the top ten guides on this keyword mention it, so it gets to be the one original idea on this page: write your test plan so nothing in it is specific to the current UI. That means no selectors, no testids, no timeout numbers, no worker count, no POM classes, no fixture shapes, no config files. If the only thing in your repo is an English statement of intent, then the day the app changes you do not own a migration; the accessibility tree changed shape and the runtime resolved the new shape for you. This is not an argument to throw away every classical test. It is an argument that the default unit of work for e2e should be an English sentence, and the spec file should exist only where determinism is contractual (legal audit, pixel regression, sub-90-second CI gates).

Everything else on a best-practices list is a cope for the fact that you chose to persist. Stop persisting and most of the list evaporates.

See your app run with zero persisted selectors

Twenty minutes. Bring one of your real flows. We show you the eight-line plan equivalent of your current spec file, executed live, and hand you the scenario.md at the end.

Book a call

Questions the top-ten e2e testing best practices guides leave unanswered

Which e2e testing best practice is universally agreed on but actually wrong?

The universal rule is 'use stable selectors, preferably data-testid.' It is not wrong in a classical suite, but it is treating a symptom. The disease is that a selector is persisted in your test file at all: once persisted, any app change can break it and now you own the drift. Assrt's system prompt at assrt-mcp/src/core/agent.ts:213 tells the agent to call the snapshot tool before every interaction, read ref IDs like ref="e5" from the accessibility tree at that moment, and then use those refs to click. No data-testid is needed in the app and no selector is written into the plan file. When the UI refactors, the accessibility tree still exposes a button with role=button and name="Get started", and the same plan still clicks it. Stable selectors solve drift by picking better-typed drift; runtime-resolved refs eliminate the drift surface.

How does wait_for_stable replace 'avoid hard-coded waits' as a best practice?

Hard-coded waits are forbidden for the same reason hard-coded selectors are: you pick one number and live with it. Even the 'explicit wait' substitute (waitForSelector with a timeout) still requires you to pick a timeout. wait_for_stable in Assrt (see agent.ts tool definition lines 186-194) injects a MutationObserver into the page, counts DOM mutations, and returns as soon as there have been zero mutations for stable_seconds (default 2, max 10) or the outer timeout_seconds (default 30, max 60). The agent calls it after any action that kicks off async work. Pages that finish in 200ms return in 200ms; pages that finish in 4 seconds block for 4 seconds. You never pick a number.

If there is no Page Object Model, where do you put shared steps?

Nowhere. The reason POM exists is that locators are written in multiple test files and you want one place to change them when the UI changes. If no locators are written anywhere, there is nothing to share. Instead, steps are English sentences in scenario.md; if three #Cases all 'open the cart', that English repetition is fine because the agent re-resolves the ref from the live DOM on each run. When an app refactors the cart icon, the classical POM needs an update; the Markdown plan does not.

What is the exact list of tools the agent can call during a test?

Eighteen, defined in assrt-mcp/src/core/agent.ts starting at line 16: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. That closed set caps what any test can do. For best-practice reasoning this matters: classical suites cover 'isolate your tests', 'avoid hard-coded waits', 'fix flakes' as rules in docs, but the tools the test has access to are unbounded. Here they are bounded, so the rules are enforceable by inspection.

How do you run e2e tests in parallel when there is only one scenario.md?

Each #Case block is independent of the others within a file, and each file is independent of other files. So parallel execution is run as many processes of npx assrt run as you have CPUs, each with its own plan. Because the runner is self-hosted and open source (no SaaS connection required), there is no account-level worker cap. Browsers are per-process, so there is no cross-test state leakage unless you opt into shared auth with --no-isolated. The one tradeoff: agent-picked tool calls add a model-inference round trip per step, so a compiled Playwright spec is faster per run. In most product test suites the model latency is dominated by page latency anyway.

Does following e2e testing best practices actually prevent flakiness, or just move it?

In a classical suite, most 'best practices' are flakiness-shaped: explicit waits, stable selectors, isolated fixtures, retry on failure. Each one fixes a specific flake mode. But the rules do not stack cleanly; an app that adds a new async animation breaks a test that used waitForLoadState('networkidle'), and now you need a new rule. Runtime-resolved refs and MutationObserver stability cover most of those flake modes at once, because the primitives themselves adapt to the page. The remaining flake surface is genuine: flakes from actual race conditions in the app, which you want to see because they are real bugs.

Is any of this compatible with the Playwright tests I already have?

Yes. Assrt runs on real Playwright via Playwright MCP over stdio (see assrt-mcp/src/core/browser.ts for the launch path). Your existing spec files keep running the way they always did. The suggested adoption path is: keep stable legacy specs as-is, write new user flows as Markdown #Cases, and let the two coexist in the same repo. You can also graduate a high-value #Case into a deterministic spec when you need repeatable per-millisecond behavior. The only thing that changes is the default unit of work above the Playwright runtime: English instead of TypeScript.

What counts as a 'user journey' worth covering with e2e, and what should stay in unit tests?

Standard guidance is 5-10% of your test suite should be e2e, focused on revenue-impacting journeys: signup, checkout, upgrade, core CRUD. That guidance still applies. What changes under Assrt is the cost per journey: instead of 150 lines of TypeScript plus fixtures, a journey is 5 lines of Markdown. The 5-10% budget becomes a line-count budget not a maintenance budget, so teams can cover more journeys without the traditional ceiling. The things that stay in unit tests: pure functions, reducer logic, date math, parsing, anything where there is no browser surface to observe.

Where do evidence, screenshots and videos live after a run?

Three files on the local filesystem. /tmp/assrt/scenario.md is the plan, watched for edits and auto-synced to cloud storage after a 1-second debounce (see scenario-files.ts lines 96-103). /tmp/assrt/results/latest.json holds the TestReport with scenarios, passedCount, failedCount, per-assertion evidence, and per-step timings. /tmp/assrt/results/<runId>.json is the immutable per-run copy. If --video is passed, a WebM plus a player.html with 1x through 10x playback speed live next to the run. Nothing leaves your machine unless you opt into cloud sync. CI integrations typically read --json stdout and skip the files.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.