A test automation guide for the era when the test is a plan, not a script

Most guides on this topic are a framework shopping list. This one isn't. It's a walkthrough of the pattern that's actually changing how tests get written: a plain-English plan, a runner that re-discovers the page on every step, and artifacts you can still read in five years. The working example is Assrt, because the source is open and the file paths are specific.

Matthew Diakonov, Written with AI

Published April 24, 202614 min read

4.8from open source, MIT licensed

Runs on your machine

Built on @playwright/mcp

Plans as editable markdown

Video of every run

The plan is the test

Plain English in. Real browser actions out.

Write three lines of English

The agent snapshots the page

It clicks by accessibility ref

It waits until the DOM stops moving

You get a video and a JSON report

0:00 / 0:10

Three layers, one loop

Every automated test, regardless of framework, has the same three layers. What you write. What runs it. What you keep after it's done. The frameworks you've heard of differ mostly in how much of layer two they hand to you, and how ugly the thing in layer three is when you try to read it back six months later. The shift worth writing a guide about is that layer one is collapsing into plain text and layer two is picking up almost everything that used to be the writer's problem.

The plan → runner → artifacts loop

Layer one: the plan

In the scripted era, a test looked like this: await page.click('[data-testid="signup-btn"]'). You named a selector, you trusted the DOM would still match, and you paid when it didn't. In the plan-driven shape, a test looks like the file below. There are no selectors, no timeouts, and no imports. It's a literal markdown file a human can open, edit, and save while the test is running. The runner's file watcher picks up the change within a second.

scenario.md

The #Case header is the only syntax rule. Everything under it is a bulleted plan the model reads as intent, not as a command. The runner enforces nothing about wording. What matters is that a new teammate could read the case and understand what passing means.

What separates a plan that runs from one that flakes

One scenario per outcome you care about. Not per click.
Verbs a new teammate would understand on first read.
Pass criteria stated up front so assertions are not optional.
Variables for anything that changes per environment or per run.
At most one session boundary per plan. Log in once, test many.
No selectors. No waits in seconds. Let the runner handle both.

Layer two: the runner

The runner is where the scripted-era complexity went. It's now the runner's job to find the element, wait for the page, retry on flakes, handle the OTP, and decide when a scenario is done. The writer's job shrinks. The runner's job grows. Assrt's loop, cut down to its essentials:

agent.ts (condensed)

What happens between invoking the runner and the first assertion

⚙️

Plan

scenario.md

✅

Preflight

8s HEAD

🌐

Browser

@playwright/mcp

↪️

Agent loop

snapshot → act → verify

✅

Stability

MutationObserver

🔔

Artifacts

webm + json + pngs

The five things the runner does for every #Case

Write the plan

One file, scenario.md, with #Case blocks. Each case is a short plain-English description of a flow plus bullet steps. Variables interpolate via {{KEY}}.

Length target: three to seven steps. A plan that spills past ten is a signal to split the scenario.

Hand it to the runner

The runner spawns @playwright/mcp in a subprocess, launches Chrome (headless or headed), preflights the target URL with an 8-second HEAD, then walks the plan case by case.

The agent picks actions

On each step it calls snapshot() to get a fresh accessibility tree, picks an element by reference, fires a tool (click, type_text, scroll, press_key, evaluate), and re-asserts.

Artifacts stream as it runs

Video starts with the browser. Screenshots drop after every visual action. events.json gets appended per tool call. execution.log gets a timestamped line for each step. Nothing waits until the end.

complete_scenario closes the case

When the agent decides a case is done, it calls complete_scenario(summary, passed). The runner writes the per-case report and moves to the next #Case, inheriting cookies and localStorage from the one that just ran.

The anchor: one editable file, one watcher

The plan file lives at /tmp/assrt/scenario.md. The runner starts a fs.watch() on it with a one-second debounce at scenario-files.ts:97 through 103. If a human (or another agent) edits the file between steps, the next scenario sees the new content. If the plan changes mid-case, the runner also syncs the new text back to the cloud store via updateScenario() so the same scenarioId is consistent across runs. This is the difference between a test file that's a fossil and a test file that's a live document.

Self-healing, as an implementation detail

"Self-healing" is the buzzword most of the existing write-ups on this topic lead with, and most of them don't explain what it actually means. Concretely: the runner never stores a selector between two actions. Every step starts with snapshot(), which produces an accessibility tree with integer refs like [ref=e42]. The model names an element by description and ref, the action fires, the ref is discarded. Between step two and step three, the page can re-render completely, and step three still works, because the next snapshot finds whatever's there now. There's no CSS selector that went stale because no CSS selector was stored.

zero

“The ref you clicked a step ago is already gone by the time the next step runs.”

agent.ts SYSTEM_PROMPT, lines 208-227

Layer three: the artifacts

The output of a run is the part most write-ups skip. "Green dot, red dot, move on." In practice, 40% of the work of test automation is reading a failed run six weeks later and understanding what broke. A runner that only gives you pass/fail is a poor assistant. Here is what Assrt drops into /tmp/assrt/results/<runId>.json and its companion directory on every run:

Scripts go stale in weeks

A CSS selector baked into a spec file on Monday breaks the first time a designer restyles a button. The average Playwright suite hits its first mass selector rewrite within six weeks of shipping.

Plans go stale in minutes, and you can edit them live

The plan is a file at /tmp/assrt/scenario.md. Edit a line while the runner is mid-case and fs.watch() picks it up in under a second. No restart, no recompile.

Waits are a runner problem, not a writer problem

wait_for_stable uses a MutationObserver with a 500 ms poll. You never type setTimeout(2000) into a test again.

Selectors don't survive a snapshot

Every action starts with a fresh accessibility tree. The ref you clicked a step ago is already gone. Stale-selector bugs don't have a place to live.

Artifacts are readable in five years

A WebM video, a JSON event trace, screenshots, and an assertion log. Open the run folder in 2031 and the story still reads. No proprietary run viewer required.

One scenario, full run

What kinds of flows this pattern handles well

Not everything should be end-to-end. The shape of flow the plan-plus-runner pattern is best at is the one where a browser has to interact with a third party, wait for something asynchronous, and then verify the result. The marquee below is a rough sample of what customers point this pattern at first.

sign-up with verification emailOAuth login that lands back on /dashboardcheckout flow that calls a webhookAI chat streaming responsemulti-step form with conditional fieldsfile upload that waits for processingpassword reset from inbox linkadmin dashboard that opens in a modal

Numbers that let you size the pattern

Four numbers from the Assrt runner that constrain and calibrate the design. If you're comparing runners, these are the shape of knob you should be asking about.

0ms

MutationObserver poll interval inside wait_for_stable

0x0

WebM recording resolution, hard-coded in browser.ts

Preflight HEAD timeout before the browser even launches

Max pages auto-discovered per run, in the background

Scripted vs plan-driven, line by line

Feature	Hand-written scripts	Plan-driven (Assrt)
How a test is written	Typed out as code. Selectors, waits, assertions, all by hand.	A plan in plain English. #Case blocks, editable mid-run.
What happens when a selector changes	The spec breaks. A human rewrites the selector string.	Nothing. The next snapshot finds the new ref. No selector was stored.
How waits are expressed	page.waitForSelector, sleep(2000), hand-tuned per flake.	wait_for_stable. MutationObserver decides when the page is quiet.
Who writes the test	Whoever knows the framework, usually a QA engineer.	Whoever understands the flow. The plan is English.
What you own after the run	Vendor cloud dashboard, proprietary YAML scripts.	A local .md plan, a WebM video, JSON report. MIT license.
CI integration	Framework-specific reporters, flaky retry logic.	Single CLI flag --json. Upload the .webm as an artifact.
Cost for a team	$7.5K / month per seat on closed enterprise runners.	Free. Bring your own Anthropic or Google key.

The comparison assumes you have a working CI and a stable dev server. Nothing fixes a wedged preview URL for you.

Where to actually start

If you're automating the first test for a product, don't start with a framework pick. Start with a plan file. Open a new scenario.md. Write one #Case: the most expensive-if-broken flow in your product. Three to five bullet steps. A line at the top that says what "passed" means. Then hand it to a runner and see what it does. If the runner needs you to name selectors or type setTimeout, that's your signal: the layer-two complexity hasn't been paid for yet, and you're about to pay it by hand. A plan-driven runner takes the plan as-is.

The Assrt CLI runs it with:

$ npx @assrt-ai/assrt run --url http://localhost:3000 --plan-file scenario.md --video

Everything else, from the accessibility snapshot to the video to the assertion log, is the runner's problem. You write the plan once and read the artifacts forever.

Want to port your first scripted suite into a plan file?

Thirty minutes with the founder. Bring one flaky flow; we'll rewrite it as a plan, run it live, and leave you the artifacts.

Frequently asked questions

What is test automation, in one paragraph, without the framework shopping list?

Test automation is the practice of turning 'did this feature work' into a check that a machine can repeat without you watching. In the scripted era, that check lived as code: selectors, waits, assertions, typed out by a person and re-typed every time the UI shifted. In 2026, the check is shifting to a plan: a short plain-English description of the scenario, executed by a runner that re-discovers the page on every step and heals stale selectors on its own. Assrt is an implementation of the second shape. Plans live at /tmp/assrt/scenario.md as lines like '#Case 1: log in as the trial user and confirm the dashboard loads'. A runner built on top of @playwright/mcp reads that plan, snapshots the DOM, clicks by accessibility reference, and records a video of the whole thing.

What should actually be automated first?

Automate the paths that would cost you the most if they silently broke. For most products that's three small things, not a suite of a hundred: sign up and log in, core happy-path conversion (checkout, send, create, whatever your primary verb is), and webhook or billing callbacks that come from a third-party and you can't easily manually retry. The reason to start there is cost of miss, not ease of automation. Everything else (admin panels, edge cases, rare flows) is a second pass. If your first automated scenario isn't 'new user signs up, gets a verification email, reaches the first save state', you are automating the wrong thing.

Plan-driven runs sound fine, but how do they handle dynamic pages that finish rendering at unpredictable times?

The runner's answer is DOM stability, not time-based waits. The Assrt agent exposes a tool called wait_for_stable (agent.ts:956 through 1009) that injects a MutationObserver into window, counts DOM mutations on a 500 ms poll, and returns as soon as the count stops changing for N consecutive seconds (default 2 s of quiet), with a hard cap of 30 s. The scenario writer never names a selector to wait for and never hard-codes a sleep. A fast page resolves in about two seconds; a streaming AI response resolves whenever the stream stops. Both paths use the same line in the plan.

How does a test agent 'heal' selectors? Isn't that just another locator library?

The agent never stores a selector between steps. On every action, it calls snapshot() to get a fresh accessibility tree, each element tagged with an integer reference like [ref=e42]. The model says click(element='Sign in button', ref='e42'), the action fires, and the ref is thrown away. If the page re-renders and the next ref is different, the next snapshot finds the new one. There's no CSS selector that went stale, because there was no CSS selector to go stale. The self-healing is structural, not fuzzy-match. Source: agent.ts, TOOLS array lines 16 through 196, SYSTEM_PROMPT lines 208 through 227.

What's the deal with storing plans as markdown? Why not YAML or a DSL?

Because a human edits them. When a test fails halfway through, you open /tmp/assrt/scenario.md, rewrite a line, save, and the runner picks it up via fs.watch() with a one-second debounce (scenario-files.ts:97 through 103). The file also syncs back to the cloud store so the next invocation on the same scenario id sees your edit. Markdown means the diff is readable in a PR, the plan is readable without a parser, and the model can propose a corrected case using the same syntax the human uses. YAML adds nothing here: there's no nested structure beyond #Case lines and bullet steps. A DSL adds a learning cost for zero gain over plain text with a one-line header.

How do I run automated tests against a flow that requires signing up with a real email?

The runner exposes two tools for this. create_temp_email opens a disposable mailbox through mail.tm and returns the address; the agent fills your signup form with it. Then wait_for_verification_code polls that inbox for up to 120 seconds, extracts the code with a regex, and paste it into whatever OTP input the page has. For multi-field OTP layouts (six separate one-char inputs), the agent uses a hard-coded DataTransfer paste expression so React's onPaste handler distributes the digits in one render instead of losing focus between fields. See agent.ts lines 850 through 891 for the email tools and SYSTEM_PROMPT lines 234 through 236 for the paste expression.

What does a good test artifact look like? What should I be able to open after a run?

Four things at minimum. One: a WebM video of the browser at 1600x900, recorded end to end via Playwright's devtools capability and served by a local HTTP server with Range support for seeking. Two: a structured JSON report with the list of assertions, each carrying a description, a passed boolean, and an evidence string. Three: a per-step event trace, so you can replay the agent's decisions without rewatching the video. Four: screenshots, one per visual action, indexed by step number. Assrt writes all four to /tmp/assrt/results/{runId}.json and opens the video in a player with keyboard controls at 5x playback by default. If your automation doesn't produce artifacts you can actually read later, it's a checkmark, not a test.

How do you keep automated tests stable across a moving UI?

Three practices matter more than any specific framework choice. First, assert on outcomes, not layouts: 'the user reaches /dashboard and sees a heading that says Welcome' beats 'the third button in the second column gets clicked'. Second, use accessibility names, not CSS classes: the accessibility tree is what a screen reader sees and is the least-churn surface of a modern web app. Third, keep tests short. A plan with three to five steps is easy to rewrite when the feature moves. A 400-line Playwright spec with a nested page-object model ends up abandoned. A plan-driven runner makes the second and third practices the default, because the plan is three lines and the runner can only reach elements that are in the accessibility tree.

Does test automation actually belong in CI, or does a local dev loop cover enough?

Both, but with different plans. In local dev the runner is an agent you talk to: 'run the signup flow, the session should stick, report back'. In CI it's a headless batch: on every pull request, replay the five smoke cases against the preview URL, fail the PR if any fail, upload the WebM artifacts to the run so a human can click through them later. The same plan file drives both. For Assrt the CI call is npx @assrt-ai/assrt run --url $PREVIEW_URL --plan-file tests/smoke.md --json --video --no-auto-open, and the output is a JSON report plus a .webm per scenario. If you want one knob instead of two, that's fine, but you pay for it either in slow local iteration or in flaky CI.

What about unit and integration tests? Is this replacing them?

No. Unit tests verify that a pure function does what the name says; integration tests verify that two modules cooperate; end-to-end tests verify that a real browser and a real backend together produce the state a user expects. They're different instruments. Agent-executed plans are an end-to-end tool, optimized for flows that cross UI, auth, email, and billing. Unit and integration tests remain faster, cheaper, and more precise for anything inside one service. The mistake is using the wrong tool at each layer: end-to-end for a date-formatting helper is painful, unit tests for 'can a new user reach checkout in one session' is a fiction.