E2E testing guide, Markdown edition

E2E testing guide where the plan is a Markdown file, not a code file.

Most e2e testing guides walk you through installing a framework, writing a spec file in TypeScript, wiring up fixtures, tuning timeouts, and then arguing about page objects. This one does something else. The plan here is a scenario.md with #Case N: headings, one parser regex splits it, and an agent picks from a closed set of exactly 18 tools to execute it against real Playwright.

Matthew Diakonov, Written with AI

Published April 19, 202611 min read

4.8from Assrt MCP users

18 tools in the closed set the agent can call

Plan is Markdown, runtime is real Playwright MCP

Self-hosted, open-source, no proprietary YAML

One sentence, for impatient readers

The test plan is a .md file. The runtime is Playwright. The glue is one regex and 18 tools.

You write English. The parser at agent.ts:621 splits on #Case. The agent loops tool calls until a scenario completes. The browser is a real Chrome over stdio-managed Playwright MCP.

Install npx assrt-mcp

E2E testing, Markdown edition

The test plan is content, not code.

Write #Case N: in plain English

Parser splits on one regex at line 621

Agent picks from 18 closed-set tools

Runs against real Playwright MCP

Output: pass/fail, video, assertions, logs

0:00 / 0:05

Why this e2e testing guide is different from the other ten on page one

Read the top results for this keyword. IBM, Browserstack, Katalon, Testim, the usual vendor roundups. They cover the same shape: the testing pyramid, what e2e means relative to unit and integration, why e2e catches real bugs, a list of frameworks with a features table, a checklist for selecting a runner, and a CI snippet. Genuinely useful context. But every one of them assumes you will finish reading, open a new repo, run npm init playwright@latest, and start writing test code in a TypeScript file.

This guide covers the pattern none of them mention: keep the runner, drop the spec file. Your plan is the content. The spec file shape (imports, describe, it, locators, awaits) is a runtime concern the reader never writes. The full surface area between your English and a real Playwright action is one regex and eighteen tool names. You can read both in under thirty seconds and audit everything the system can do.

Same three cases, two files

Flip between the tabs. Same user journey: homepage, signup with OTP, dashboard. Left: a typical Playwright spec file you would write from a classical e2e testing guide. Right: the Markdown scenario.md that Assrt runs through the same Playwright underneath.

plan: code vs content

// e2e/signup.spec.ts
import { test, expect } from "@playwright/test";
import { createTempInbox, waitForOtp } from "./fixtures/inbox";

test.describe("Signup", () => {
  test("homepage loads", async ({ page }) => {
    await page.goto("/");
    await expect(
      page.getByRole("heading", { name: /Ship/i })
    ).toBeVisible();
  });

  test("sign up with email", async ({ page }) => {
    const inbox = await createTempInbox();
    await page.goto("/");
    await page.getByRole("button", { name: "Get started" }).click();
    await page
      .getByLabel("Email")
      .fill(inbox.address);
    await page
      .getByRole("button", { name: "Send code" })
      .click();
    const code = await waitForOtp(inbox, 60_000);
    await page
      .getByLabel("Verification code")
      .fill(code);
    await page
      .getByRole("button", { name: "Verify" })
      .click();
    await expect(page).toHaveURL(/\/app/);
  });

  test("dashboard renders for new user", async ({ page }) => {
    await page.waitForLoadState("networkidle");
    await expect(page.getByText(/welcome/i)).toBeVisible();
    await expect(
      page.getByRole("button", { name: "Create project" })
    ).toBeVisible();
  });
});

73% fewer lines

Notice what went away: two imports, a custom inbox fixture, a describe wrapper, three await page.getByRole(...).click() calls, one hard-coded 60-second OTP timeout, one waitForLoadState. None of that disappeared by magic. It moved out of your repo and into the runtime.

The anchor fact: one regex is the parser

The whole surface between your Markdown plan and the execution loop is the regex below. If your plan has one marker, it runs as one scenario. If it has five, they run as five scenarios in the same browser session with cookies and auth carrying over between them. Scenarios can be named #Case 1:, Scenario 2., or Test 3: interchangeably.

assrt-mcp/src/core/agent.ts:620-631

1 regex

“If your plan has one marker, one scenario runs. If it has five, five run, with browser state carrying between them.”

assrt-mcp/src/core/agent.ts:620

That parser is the reason this guide does not need a "how to organize your test files" section. There are no describe nests, no test.each arrays, no conftest.py, no config cascade. One file per journey is fine. Five cases in one file is fine. Thirty files with one case each is fine. The parser treats them all the same and the browser is reused inside a single file.

The other anchor fact: 18 tools, nothing else

Every tool the agent can call is listed below. This is not a partial list or a subset of an SDK. It is the entire set defined in assrt-mcp/src/core/agent.ts from line 16 to line 196, and the model literally cannot invent a 19th. When you are reasoning about what a test could have possibly done, this list is the whole story.

navigatesnapshotclicktype_textselect_optionscrollpress_keywaitscreenshotevaluatecreate_temp_emailwait_for_verification_codecheck_email_inboxassertcomplete_scenariosuggest_improvementhttp_requestwait_for_stable

Navigation and inspection

navigate, snapshot, screenshot, scroll, wait. snapshot returns the accessibility tree with stable ref IDs like ref="e5" that the agent reuses for the next click, so it does not have to write a selector.

Interaction

click, type_text, select_option, press_key. The agent passes a human-readable element description plus a ref from the last snapshot; there is no locator string to maintain.

Waiting that adapts to the page

wait_for_stable injects a MutationObserver and returns once the DOM has been silent for N seconds. wait can target a text string instead of a timeout. You never pick a number in milliseconds.

Email and OTP

create_temp_email spins up a disposable inbox via temp-mail.io. wait_for_verification_code polls up to 120s and returns the digits, matching seven OTP patterns in priority order (code, verification, OTP, PIN, 6-digit, 4-digit, 8-digit).

Assertions and completion

assert records a pass/fail with evidence. complete_scenario closes a #Case. suggest_improvement lets the agent file a UX bug against the app it just tested.

External APIs

http_request fires a 30-second-timeout fetch, used to verify integrations (Telegram Bot API, webhooks, Slack, anything the app pushes out) without leaving the test runner.

Arbitrary JavaScript

evaluate runs a JS expression inside the page. The agent uses it to paste into split OTP inputs (parent element, DataTransfer, ClipboardEvent) when single-character fields defeat normal typing.

How one #Case runs, end to end

Seven steps, same order every time. If you are used to Playwright, steps 1, 3, and 6 are familiar; the other four are where the Markdown workflow diverges.

Parse the plan

parseScenarios splits scenario.md at every #Case N:, #Scenario N., or #Test N:. Name = text between the marker and the first newline. Steps = everything until the next marker.

Preflight the URL

A HEAD request with an 8-second timeout confirms the target is reachable. A wedged dev server fails here with an actionable message instead of hanging Chrome launch for minutes.

Launch the browser

launchLocal spawns a local Playwright MCP over stdio, navigates to the URL with a 30-second nav timeout, injects the cursor overlay, and emits a screencast URL for live viewing.

Snapshot before every action

The agent calls snapshot first, reads refs like [ref=e5] out of the accessibility tree, and uses them when it clicks or types. If the ref goes stale, snapshot again.

Pick one of 18 tools

The model sees only the closed set: navigate, click, type_text, select_option, scroll, press_key, wait, wait_for_stable, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, snapshot.

Assert and complete

Each assert call writes a {description, passed, evidence} record. complete_scenario closes the #Case. The browser stays alive between cases so cookies and auth carry over.

Emit the report

TestReport with per-case pass/fail, duration, assertions, and steps is written to /tmp/assrt/results/latest.json. The plan sits at /tmp/assrt/scenario.md and is watched for edits.

Plan, parse, act, assert

The data flow below is the whole runtime in one picture. Markdown on the left, assertions on the right, the hub is the agent loop that keeps picking tool calls until every #Case calls complete_scenario.

Markdown plan to Playwright actions

What the output looks like

The three-case plan shown above, run headless against a local dev server. Note how the tool call sequence for each case is different because the model picked tools based on what the page showed at snapshot time, not a fixed recipe.

assrt run (trimmed)

The sequence is not hard-coded

For Case 1, three tools. For Case 2, six. For Case 3, four. The model chose each one after reading the accessibility tree snapshot at that moment. If your signup flow changes tomorrow and now requires a country dropdown, the same Markdown plan adds a select_option call automatically because the agent sees the new field in the snapshot. You do not update the plan.

Numbers worth pinning to the inside of your head

The four numbers below are the whole product, condensed. Every one of them is checkable in the source; see the file paths in the FAQ below if you want to verify.

0Tools in the closed set

0OTP regex fallbacks

0z-index for injected overlay

0Selectors you write

0 tools, 0 OTP patterns in priority order, an overlay pinned at the max z-index (2147483647), and zero selector strings in your repo. These four numbers are all you need to explain the runtime to someone new on your team.

Markdown-first e2e testing vs classic framework-first

Same Playwright underneath. Different unit of work above it.

Feature	Typical Playwright repo	Assrt (Markdown #Case)
What is in the repo	*.spec.ts files with imports, fixtures, page objects	scenario.md with #Case N: blocks, checked in as content
How you target elements	Locators: getByRole, getByLabel, CSS, XPath	Plain English; agent matches against the accessibility tree ref
How you wait	await page.waitForSelector / timeout number you tune	wait_for_stable: MutationObserver, returns when DOM quiets for N seconds
How you do OTP signup	Bring your own Mailosaur / Mailslurp fixture	create_temp_email + wait_for_verification_code, built in
How you debug a failure	Parse selector timeout, read HTML report, trace.zip	Watch the WebM video with a 20px red cursor on every click
When selectors drift	Rewrite the locator, push a new spec	Plan still says 'Click Get started'; agent re-resolves ref at runtime
Where the tests run	Your machine, your CI (Playwright binary, Node)	Same Playwright binary, via Playwright MCP over stdio, locally
Vendor lock-in	None on the framework; lots on proprietary SaaS runners	Open-source, self-hosted; the .md plan and the Playwright are yours

When you should still write a spec file

Markdown plans are not a universal upgrade. Three places where the classical approach still wins, and you should stay there:

Deterministic pixel-level assertions (visual regression diffs, canvas games, precise drag distances). A plan that reads "Drag 100 pixels right" is ambiguous to an agent; an await locator.dragTo is not.
Hot paths in CI that must run in under 90 seconds per test. Agent-picked tool calls add a model-inference round trip per step; a compiled Playwright spec is faster. For a smoke suite that runs on every PR, keep the spec.
Strict regulated flows where the test itself is an audit artifact and the legal team needs to read exactly what was asserted. A Markdown plus a model is two artifacts; a spec file is one.

Everywhere else, especially in product-facing teams who change flows weekly, the Markdown pattern removes most of the maintenance cost and all of the selector drift.

Try it against your own app in five minutes

You do not need to register, install a framework, or hand over any code. Three commands, one Markdown file.

quickstart.sh

Watch the video that auto-opens. The red cursor is the agent. Every place it clicked came from a sentence you wrote.

See the Markdown workflow run against your app

Twenty minutes, one of your real flows, a #Case plan on screen. We show you the agent loop live and hand you the scenario.md afterwards.

Questions other e2e testing guides do not answer

Frequently asked questions

What exactly is inside an e2e testing plan file when using this Markdown format?

A text file with one or more #Case N: headings. Each heading is followed by 1 to 5 imperative English lines describing what to do and what to verify. No imports, no fixtures, no describe/it blocks, no TypeScript. The parser at assrt-mcp/src/core/agent.ts:621 splits the file on the regex /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi, so #Case 1:, Scenario 2., and Test 3: all work interchangeably. Everything between two markers is one scenario. The name of the case is the text between the marker and the first newline; the steps are the rest. The file lives at /tmp/assrt/scenario.md after the first run and can be edited in place; a file watcher debounces at 1 second and syncs the change back to cloud storage.

If the plan is English, how does the agent actually click the right button?

It does not guess. Before every interaction, the agent calls the snapshot tool, which returns the page's accessibility tree with per-element reference IDs like [ref=e5]. The model reads your English step ('Click Get started'), scans the tree for a matching node (role=button, name contains 'Get started'), and passes both a human description and the ref to the click tool. If the ref goes stale because the page mutated, the system prompt instructs the agent to call snapshot again for fresh refs. That is why the plan never needs a CSS or XPath selector: the accessibility tree is the selector, resolved at runtime by the model from your sentence.

What is the exact set of tools the agent can use during a test?

Eighteen, defined in assrt-mcp/src/core/agent.ts starting at line 16: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. The model cannot invent new tool names or reach outside this set. That closed set is a deliberate choice: it caps the blast radius of what a test can do, makes reasoning about failures easy, and keeps the cost low (each run is a conversation, each tool call is one turn).

How does the OTP signup flow work without plugging in my own email provider?

The create_temp_email tool calls temp-mail.io's internal API (api.internal.temp-mail.io/api/v3/email/new) and returns an address plus a token. You use that address in the signup form. The wait_for_verification_code tool then polls the same inbox every 3 seconds for up to 120 seconds and matches the body against seven verification-code patterns in priority order: 'code: 123456', 'verification: 123456', 'OTP: 123456', 'PIN: 123456', any 6-digit number, any 4-digit number, any 8-digit number. Source: assrt-mcp/src/core/email.ts lines 82 to 129. If the form uses split single-character inputs (common OTP pattern), the agent pastes via a specific evaluate expression that dispatches a ClipboardEvent on the parent element; it does not type digit by digit.

How does this avoid the classic e2e testing problem of picking a wait timeout?

With wait_for_stable, which is step-level instead of step-global. It injects a MutationObserver into the page, counts DOM mutations, and returns as soon as there have been zero mutations for stable_seconds (default 2, max 10) or the outer timeout_seconds (default 30, max 60) is hit. The agent calls it after any action that might start async work: form submission, AI chat response, search results loading. Pages that finish fast return fast; pages that finish slow block longer. You never hard-code a number in milliseconds, and you never maintain a per-test wait budget.

Can this really replace my Playwright + Cypress + Selenium suites for a production app?

For AI-native teams and for teams who write mostly small journey tests, yes. For legacy suites with thousands of specs, fixtures, and CI budgets tuned over years, probably not as a one-time migration. But the underlying runner is Playwright. Your org does not buy into a proprietary recorder or a yaml-based DSL; it writes Markdown and runs Playwright. When the plan is checked into git alongside the code, it reads like a product requirement. When it is checked into /tmp/assrt/scenario.md, it is editable in place and auto-syncs. A reasonable adoption path is to let engineers write new flows as Markdown #Cases while leaving stable legacy specs alone.

Does the agent re-run a flaky test automatically, or do I have to re-kick it?

It does not magically retry a failing scenario. Flakiness is usually pushed upstream into the tools themselves: snapshot is called before every interaction so stale-ref flakiness disappears, wait_for_stable replaces wait(ms) so timeout flakiness disappears, ref-based clicks are more stable than CSS selectors against animated pages. When an action does fail, the agent is instructed to snapshot, look at the new accessibility tree, and try a different ref or approach before giving up. Only after three genuine attempts on one step does it mark the scenario failed. You can still build retry into CI around the whole run if you want classical retries.

Where does the evidence live after a run, and what does a CI integration look like?

Three artifacts, all on the local filesystem. scenario.md holds the plan. /tmp/assrt/results/latest.json holds the TestReport (scenarios, passedCount, failedCount, per-assertion evidence, per-step timing). /tmp/assrt/results/<runId>.json is the immutable copy for history. If --video is passed, there is a WebM plus a player.html with keyboard shortcuts for 1x/2x/3x/5x/10x speed. In CI, pass --json to stdout so the pipeline can parse pass/fail without reading files, or just grep results/latest.json. No cloud account is required to run; the optional cloud sync uploads artifacts and returns shareable URLs, but the runner itself is self-hosted and open-source.

What is the difference between this and an agentic Playwright wrapper that generates code?

Generators emit *.spec.ts once and call it a day; the generated file is then your problem to maintain. Assrt is the other direction: the plan stays in plain English forever, and the agent resolves it into tool calls on every run. That is why selector drift does not require a regeneration step: the agent re-reads the accessibility tree next time and finds the same 'Get started' button even if its class changed. You also get real Playwright at the bottom, not a proprietary YAML; when you want to graduate a #Case into a deterministic Playwright spec, you can. When you want to stay in Markdown, you stay in Markdown.