QA engineer automation in 2026: writing plans an agent runs, not selectors

Most articles on this topic stop at the same shape: a definition, a tool list (Selenium, Cypress, Playwright), and a roadmap that ends with “learn a programming language.” That cut dates from 2019 and quietly assumes the deliverable is a typed file of locators. In 2026, on a working agent stack, the deliverable is something else: a Markdown file of `#Case` blocks. The agent does the typing.

This page is a practitioner cut. It describes the actual day-shape of a QA engineer running automation through Assrt: the file the engineer writes, the 18 tools the agent can call, the artifacts that land on disk, and the work that survives once the clicking is automated. Every claim points at a specific file in the open-source repo so you can verify it.

Matthew Diakonov, Assrt

Published April 26, 202611 min read

4.8from 214 engineers

Plan lives at /tmp/assrt/scenario.md, written in Markdown #Case blocks

Agent calls one of 18 fixed tools per step (agent.ts lines 16-196)

Element refs come from a live accessibility tree, not CSS selectors

The QA engineer's day, rewired

from selector author to plan author

1. Write a #Case in scenario.md (English, not selectors)

2. Agent loops snapshot -> reason -> act, calling 1 of 18 tools

3. WebM, JSON report, and zero-padded PNGs land on disk

4. You verify evidence and diagnose any red

0:00 / 0:05

18 tools

“The agent in Assrt can only call 18 named tools, defined in assrt-mcp/src/core/agent.ts lines 16 to 196. No invented Playwright APIs, no shell exec, no surprise calls. The QA engineer never names a tool: the surface is the markdown plan.”

Verifiable in the open-source repo

The job shape, before and after the agent

The pre-agent QA engineer sat between the product team and a wall of locator strings. Half the work was authoring the test, half was repairing it the next time a designer renamed a button. The post-agent QA engineer writes outcome assertions in English and lets the runtime resolve elements off a live accessibility tree. The work that survives is the part agents are bad at: deciding what edge cases matter and reading run evidence.

Before the agent vs after

Same job title, two different deliverables. The artifact you write changes; the loop you run changes; the cost of a UI rename changes.

Feature	QA engineer with classic Selenium/Playwright	QA engineer with agent runtime (Assrt)
Authoring surface	TypeScript or Python file with locator chains	Markdown file with `#Case N:` blocks in plain English
Element identification	CSS selectors, xpaths, or testids written by hand	Accessibility-tree refs (e5, e12) resolved at run time
API surface the engineer manages	Hundreds of Playwright/Selenium methods	Zero. The engineer never names a tool. The agent picks one of 18.
What rots when the UI changes	Selectors, locator chains, fixture helpers	Outcome assertions only (and only if the outcome itself moved)
Pull request review surface	.spec.ts diff. Selector audits. Mock review.	Markdown diff. Read intent in English, watch a 5x video.
Failure triage	Open trace viewer, hunt for the line that broke	Call assrt_diagnose, get verdict + corrected #Case
Where the test artifact lives	Vendor cloud editor or proprietary YAML in cloud DB	/tmp/assrt/scenario.md and tests/scenarios/*.md in your repo
Cost model	Per-run pricing, seat fees, $7.5K/mo for some vendors	Open source. Self-hosted. Tests are yours.

The 18-tool agent loop, in a single diagram

The diagram below is literally how every step of a run is shaped. On the left, the inputs an engineer can write or set. In the middle, the constrained tool surface the agent can call. On the right, the artifacts that land on disk. Because the middle is a schema, not a code generator, the agent cannot invent a Playwright method. If a model tried, the MCP server would reject the call.

scenario.md -> 18-tool MCP schema -> artifacts on disk

The 18 tools, named

This is the entire surface the agent can act on. Every step of every run is one of these. The QA engineer never names one directly. They are listed here only so a reviewer can audit a tool-call trace later and confirm nothing was invented.

navigatesnapshotclicktype_textselect_optionscrollpress_keywaitscreenshotevaluatecreate_temp_emailwait_for_verification_codecheck_email_inboxassertcomplete_scenariosuggest_improvementhttp_requestwait_for_stable

What a plan you write looks like

The file below is what you actually deliver. Three cases, fifteen steps, zero CSS. Save it as `scenario.md` (or `tests/scenarios/checkout.md` once it stabilizes) and the runtime will pick it up. There is no DSL to learn and nothing to compile.

/tmp/assrt/scenario.md

What the agent actually emits at run time

The trace below is the JSON form of one case from the plan above. Note the alternation: snapshot, click, snapshot, type, click. Every touch of the page starts with a fresh accessibility-tree snapshot so the ref the agent passes to click is always a ref it just saw on the live DOM. The QA engineer never wrote `e5`; the runtime discovered it.

results/latest.json (excerpt)

The six verbs that replace writing selectors

A QA engineer's day on Assrt collapses to six verbs. None of them involve typing a locator. Each one maps to a specific file or command, so handoff between teammates is just a matter of pointing at the right path on disk.

Write the plan

Open scenario.md (or a scratch file). Add a `#Case N: name` header per user flow. Number each step. Describe outcomes in English: `Click Apply promo`, `Assert: subtotal reads $24.99`. Stop the moment a CSS selector enters your hands.

Run the agent

`npx assrt run --url <staging-url> --plan-file ./tests/scenarios/checkout.md`. The agent loops snapshot -> reason -> act, recording a WebM. Each tool call is one of 18 names from agent.ts. There is no .spec.ts being generated under the hood.

Watch the run, not the code

Open recording.webm. Default playback is 5x; spacebar pauses, arrows seek 5s, digit keys set speed. A 30-second test plays back in 6 seconds. The cursor lands on real elements, the keystroke toast shows what got typed.

Verify the assertion evidence

`jq '.cases[].assertions' /tmp/assrt/results/latest.json`. Each assertion carries an `evidence` string (`text $24.99 found near ref=e22`). Evidence in free text beats a green dot in a dashboard, because you can read it.

Diagnose any red

Call `assrt_diagnose` on a failing run. The prompt is configured as a senior QA persona. It returns a verdict (app bug, test flaw, env issue), a recommended fix, and a corrected `#Case` you can paste back into scenario.md.

Commit the markdown

Copy the stable plan into `tests/scenarios/<name>.md` and `git add` it. Pull request review is a Markdown diff. The next contributor inherits a runnable English plan, not a wall of selectors.

A morning of automation work, in a terminal

One file authored, one command run, one video reviewed, one commit pushed. No vendor dashboard, no proprietary editor, no paid seat. Open source, self-hosted, all artifacts on your disk.

qa-engineer-morning.sh

The five artifacts that fall out of every run

Each path is fixed and predictable, which means hand-off and review are cheap. A teammate who has never seen your test can open the same paths and audit it.

/tmp/assrt/scenario.md

The plan. Plain Markdown with `#Case N: name` headers and numbered steps in English. Save it with any editor; an fs.watch in scenario-files.ts re-syncs after a 1000ms debounce.

/tmp/assrt/scenario.json

Metadata only: { id, name, url, updatedAt }. The id is a UUID that doubles as the cloud-sync access token if you opt in.

/tmp/assrt/results/latest.json

Last run as JSON. Per-case timing, per-assertion `evidence` strings, the model used, the tool-call trace. Pipe through jq for ad-hoc reporting.

/tmp/assrt/<runId>/video/recording.webm

Standalone WebM with a painted cursor, click ripples, and keystroke toasts. Plays in VLC, Chrome, or ffmpeg. No trace viewer required.

/tmp/assrt/<runId>/screenshots/NN_stepN_action.png

Zero-padded PNG per step (00_step1_init.png, 01_step2_click.png). Filenames carry the tool name so a file manager is enough to audit a run.

Numbers that pin the loop down

A few concrete values from the source. Everything below is verifiable in the open repo.

0Tools the agent can ever call

0 msfs.watch debounce on scenario.md saves

0xDefault playback speed of the bundled WebM player

0CSS selectors a well-written #Case contains

Last line of the TOOLS array in assrt-mcp/src/core/agent.ts. Open the file at line 196 and you will see `];` closing the entire schema.

0 / mo

License cost for the agent runtime. Compare to closed competitors at four to five figures monthly. Tests, plans, and recordings stay on your hardware.

Five steps to your first run

Two minutes if you already have Node and Chrome installed. The setup command registers Assrt as an MCP server so Claude Code or Cursor can drive it; the run command is a one-shot CLI you can use without an editor at all.

From zero to first commit

1
Install
npx assrt setup (registers MCP for Claude/Cursor)
2
Plan
write the first #Case N: block in scenario.md
3
Run
npx assrt run --url <staging> --plan-file ...
4
Review
open recording.webm, then jq the results
5
Commit
git add tests/scenarios/<name>.md

What “done” looks like for a single case

A concrete checklist for the engineer reviewing their own work before pushing the markdown to a branch. Every box answers a question that selector-only review surfaces could not.

Per-case readiness check

Every step in scenario.md describes an outcome a customer would notice
No CSS selector, xpath, or testid string appears anywhere in the plan
Every assertion in latest.json has an `evidence` string you can read
The recording shows the cursor on real elements (not blank canvas)
Every tool call in the trace is one of the 18 names in agent.ts lines 16-196
The plan is committed under tests/scenarios/ as plain Markdown
Failures are routed to assrt_diagnose before a human rewrites the case

Where the load-bearing files live

Everything above is verifiable in two open-source repos: assrt and assrt-mcp. Three files do most of the heavy lifting.

assrt-mcp/src/core/agent.ts lines 16 to 196 hold the entire `TOOLS` array. Counting the `name:` lines gives 18. This is the agent's vocabulary.
assrt-mcp/src/core/scenario-files.ts lines 16 to 19 pin the disk paths (`/tmp/assrt/scenario.md`, `/tmp/assrt/scenario.json`, `/tmp/assrt/results/`). Line 102 sets the 1000ms debounce on the fs.watch that re-syncs your edits.
assrt-mcp/src/mcp/server.ts holds the MCP tool definitions Claude Code or Cursor sees: assrt_test, assrt_plan, assrt_diagnose, and the optional assrt_analyze_video. Same file is where the senior-QA persona prompt for diagnose is configured.

What the QA engineer keeps doing

The work an agent does badly is the work a senior QA engineer has always done well. Three things matter more in this model, not less.

Choosing what to test. The agent will obediently run any plan you write. It will not tell you that you forgot the empty-cart case, the unicode address, the back-button regression. Test selection is a product call, and it is yours.

Writing assertions a customer would care about. `Assert: subtotal reads $24.99` is a customer-grade assertion. `Assert: the third div has class active` is not. The agent will execute either; only the engineer can tell which one tells you something true.

Calling failures correctly. When a case goes red, was it the app or the test? `assrt_diagnose` will offer a first-pass verdict, but final classification still happens in a human head. The cost of a wrong call (wakeup at 3am for a flake, or shipping a regression while you blamed the test) is the part senior QA work has always been about.

Want to see your own QA workflow on this stack?

Bring one feature you currently test by hand. Walk away with a runnable scenario.md, a recorded WebM, and a plan to commit it to your repo.

Frequently asked questions

What does QA engineer automation actually mean in 2026?

The job has split. The execution layer (clicking, typing, asserting) is now an agent that drives a real browser through Playwright MCP. The authoring layer (deciding what to verify and how to phrase it) is a human writing plain-English steps in a Markdown file. A QA engineer in this model spends almost no time on locators and most of their time on `#Case` blocks, expected outcomes, and edge cases. It is closer to writing a good bug report than writing code.

If the agent does the clicking, what stops it from inventing a Playwright API?

The agent in Assrt cannot call arbitrary Playwright. It can only call 18 named tools defined in `/Users/matthewdi/assrt-mcp/src/core/agent.ts` lines 16 to 196: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, and wait_for_stable. Anything else is a schema error rejected by the MCP server. The QA engineer never picks one of these; the agent does. The QA engineer only writes the English in `#Case` blocks.

How are elements identified if there are no CSS selectors in the plan?

Each step that needs to touch an element starts with a `snapshot` call. Playwright MCP returns the page as an accessibility tree and tags every focusable element with an opaque ref like `e5` or `e12`. The agent then calls click or type_text passing the ref it just saw. The lookup happens on the live page, so a UI redesign that keeps the same accessibility role does not break the test. There is nothing in the plan that says `.btn-primary` or `[data-testid=submit]`. The plan says `Click the Sign In button` and the runtime resolves that string to an `e7` at run time.

Where does the QA engineer's work actually live on disk?

Four paths, all under `/tmp/assrt/` by default. `/tmp/assrt/scenario.md` is the plan, written in Markdown with `#Case N: name` blocks. `/tmp/assrt/scenario.json` carries the metadata (id, name, url, updatedAt). `/tmp/assrt/results/latest.json` is the most recent run, with per-assertion evidence strings. `/tmp/assrt/results/<runId>.json` archives history. Save scenario.md in any editor and an fs.watch in `scenario-files.ts` re-syncs the plan after a 1000ms debounce. There is no proprietary editor between the engineer and the file.

How is this different from Selenium, Cypress, or vanilla Playwright?

Three concrete differences. First, the artifact: Selenium and Playwright deliver `.spec.ts` or `.py` files; Assrt delivers Markdown. Second, the resolver: Selenium and Playwright bind to selectors at write time; Assrt binds to accessibility refs at run time. Third, the loop: traditional automation runs deterministic code; an agent loop reads the page after every action and decides the next move. The classic stacks are still useful (Assrt drives Playwright underneath, after all), but the QA engineer's authoring surface is no longer a typed locator API.

Does this replace senior QA work, or shift it?

It shifts it. The work that survives is the part agents are bad at: deciding what edge cases matter, writing assertion text the customer would care about, classifying a failure as app-bug versus test-flaw, and deciding when a flake is real. Assrt has a dedicated tool for the last one called `assrt_diagnose` that returns a senior-QA-style verdict (bug in app, flaw in test, or environment) plus a corrected `#Case` you can paste back. The mechanical click-and-type work is what the agent absorbs.

How does the QA engineer write a good #Case?

Write outcomes, not interactions. `Click Add to cart` is fine; `Click button.btn[data-testid=add]` is wrong because the engineer is now doing the agent's job. State preconditions explicitly (`as a signed-in user`), name expected results in plain text (`subtotal shows $24.99`, `the URL ends with /thanks`), and prefer one verifiable assertion per step over a long chain of clicks ending in a single check. The goal is for any contributor to read the case and understand both intent and expected behavior without opening the app.

What about flaky tests, retries, and waiting?

Two of the 18 tools exist for exactly this: `wait` and `wait_for_stable`. `wait_for_stable` is the better default because it polls until the DOM stops mutating for N seconds, rather than sleeping for a hard-coded interval. The engineer does not call these directly. Instead, the plan describes the user-visible cue (`Wait until the order summary appears`) and the agent picks the appropriate tool. If a case still flakes, `assrt_diagnose` looks at the run trace and proposes the right wait strategy.

Can I commit scenario.md to git and review it like code?

Yes. The format is plain Markdown, so `git diff` shows exactly what changed between revisions. Many teams copy the plan out of `/tmp/assrt/scenario.md` into a versioned `tests/scenarios/` directory once it stabilizes. Pull request review then becomes a Markdown review: did the case capture the intent? Are the expected outcomes specific? Code review for QA goes from selector audits to product review, which is what most senior reviewers wanted to do anyway.

Where do I run this without sending tests to a vendor cloud?

Locally. Assrt is open source, self-hosted, and the agent runs entirely on your machine through `npx assrt`. Tests, plans, screenshots, and recordings stay in `/tmp/assrt/` and `~/.assrt/` unless you sync them yourself. Cloud sync is opt-in and uses UUIDs as access tokens, not account-bound storage. Compare to closed-source tools that price per-test-run and own the resulting fixtures: with Assrt the markdown is yours, the WebM is yours, the JSON report is yours.

Does any of this work with the test suite I already have?

Yes. Assrt does not replace an existing Playwright suite, it runs alongside. Keep the deterministic `.spec.ts` files for the flows the team has already invested in. Use Assrt for the work that benefits from English-first authoring: exploratory tests, regression cases that change frequently, customer-reported flows that need to be turned into verification quickly. Both stacks ultimately drive Playwright; only the authoring surface differs.

Related guides that pair well with this one

Adjacent reading

Review

AI-generated Playwright tests review: watch the run, not the .spec.ts

The reviewer-first guide. Four artifacts on disk, three minutes per test, zero hallucinated selectors to audit by eye.

Read

Stack

AI-generated end-to-end testing without the YAML lock-in

Why proprietary YAML and visual recorders create vendor lock-in, and how a markdown plan + Playwright MCP avoids it entirely.

Read

Strategy

AI in automation testing: where the agent helps and where it does not

An honest cut: agentic browsers are great at click-and-type and bad at deciding what edge cases matter. The QA engineer keeps the second job.

QA engineer automation in 2026: writing plans an agent runs, not selectors

The job shape, before and after the agent

Before the agent vs after

The 18-tool agent loop, in a single diagram

scenario.md -> 18-tool MCP schema -> artifacts on disk

The 18 tools, named

What a plan you write looks like

What the agent actually emits at run time

The six verbs that replace writing selectors

Write the plan

Run the agent

Watch the run, not the code

Verify the assertion evidence

Diagnose any red

Commit the markdown

A morning of automation work, in a terminal

The five artifacts that fall out of every run

/tmp/assrt/scenario.md

/tmp/assrt/scenario.json

/tmp/assrt/results/latest.json

/tmp/assrt/<runId>/video/recording.webm

/tmp/assrt/<runId>/screenshots/NN_stepN_action.png

Numbers that pin the loop down

Five steps to your first run

From zero to first commit

What “done” looks like for a single case

Where the load-bearing files live

What the QA engineer keeps doing

Want to see your own QA workflow on this stack?

Frequently asked questions

Adjacent reading

AI-generated Playwright tests review: watch the run, not the .spec.ts

AI-generated end-to-end testing without the YAML lock-in

AI in automation testing: where the agent helps and where it does not

Comments (••)

Comments ()