From an X thread on flaky E2E suites

Playwright test maintenance vs generation: it is one decision, not two stages

Teams keep asking whether to spend their effort generating tests or maintaining them. It is the wrong split. Generation and maintenance are welded together by a single thing: the artifact your generator hands you. Pick the artifact and you have already picked your maintenance bill.

Direct answer · verified 18 May 2026

Generation is a one-time cost. Maintenance is the recurring one, and on a locator-bound suite it dominates the total. But the two are not opposites: the artifact your generator emits decides the maintenance bill. Playwright's Generator agent compiles a Markdown plan into TypeScript with selectors baked in, so its Healer agent then maintains that compiled spec for the life of the project. Assrt never compiles: the plain-English #Case file is executed live on every run, so generating and maintaining a test become the same operation.

Playwright's three-agent model is documented at playwright.dev/docs/test-agents. Assrt's generator and maintenance tools are open source in assrt-mcp.

Two activities people keep filing separately

Generation and maintenance feel like different jobs, so most tooling treats them as different jobs. They are not. Here is what each one actually is, and the line that connects them.

Generation

Turning a flow into a test

A walkthrough, a recording, or an autonomous crawl becomes an executable test. You pay this once per scenario. It is the part everyone demos.

Maintenance

Keeping that test true as the app moves

Selectors drift, waits get re-tuned, setup code rots. You pay this every sprint, forever. It is the part nobody puts on the roadmap.

The line between them is the artifact. Generation produces a file. Maintenance edits that file. If the file is TypeScript with concrete selectors, every layout change is a future edit. If the file is plain-English intent that something re-resolves at run time, most layout changes never reach it. You are not choosing between two activities. You are choosing what the first activity leaves behind.

What the planner, generator, healer diagram leaves out

Playwright now ships three test agents, and almost every write-up presents them as a tidy pipeline. A Planner explores your app and writes a Markdown plan. A Generator turns that plan into Playwright test scripts in TypeScript, picking selectors and adding assertions as it goes. A Healer runs failing tests in debug mode, inspects console logs and snapshots, and patches the broken parts.

It is a genuinely good toolset. But notice the quiet step in the middle. The Generator performs a one-way compile: a Markdown plan goes in, TypeScript with hard selectors comes out. After that, the Healer maintains the TypeScript. The plan that the Planner so carefully wrote is now stale documentation that nothing executes. The artifact you generated and the artifact you maintain have split into two, and the maintained one is the brittle one.

That compile step is not a detail. It is the exact moment the maintenance bill is created. Generation did not save you from maintenance; it handed you the thing you will maintain.

In Assrt, the generator and the healer write the same file

Assrt's MCP server exposes a generator, assrt_plan, and a maintenance tool, assrt_diagnose. They are two separate Claude Haiku calls with two different system prompts. Open assrt-mcp/src/mcp/server.ts and read both prompts side by side. They end at the same instruction.

assrt_plan · the generator

// assrt-mcp/src/mcp/server.ts
// PLAN_SYSTEM_PROMPT, lines 219-236

You are a Senior QA Engineer generating
test cases for an AI browser agent.

## Output Format
Generate test cases in this EXACT format:

  #Case 1: [short action-oriented name]
  [step-by-step instructions to execute]

## CRITICAL Rules for Executable Tests
1. Each case must be SELF-CONTAINED
2. Be specific about selectors
4. Keep cases SHORT (3-5 actions max)
6. Generate 5-8 cases max

assrt_diagnose · the maintenance tool

// assrt-mcp/src/mcp/server.ts
// DIAGNOSE_SYSTEM_PROMPT, lines 240-268

You are a senior QA engineer and debugging
expert, given a failing test case report.

1. Diagnose the root cause: an app bug, a
   flawed test, or an environment issue?
3. Provide a corrected test scenario if the
   test itself needs adjustment.

## Output Format
### Corrected Test Scenario
  #Case 1: [corrected case name]
  [corrected steps that will pass]

Both prompts terminate at #Case 1:. The generator produces a #Case block. The maintenance tool produces a corrected #Case block. There is no second, compiled grammar for either one to drift into. The third tool, assrt_test, reads that same scenario.md and drives the run by calling snapshot on the live accessibility tree before each action (see assrt-mcp/src/core/agent.ts, the snapshot-first rules around lines 207-218). The plan is never compiled away. It stays the test.

1 file

“assrt_plan generates it, assrt_diagnose rewrites it, assrt_test runs it. The generator, the maintenance tool, and the runner all operate on the same plain-English scenario.md. There is no compiled artifact for a UI change to rot.”

assrt-mcp/src/mcp/server.ts

One test, watched across a month

Here is the same test through three real events. Notice how few of them are maintenance.

Generate once, run live, maintain only on real change

01 / 05

Day 1: assrt_plan writes scenario.md

The generator crawls the app and produces six #Case blocks of plain-English steps. No selectors, no TypeScript, no compile.

Locator-bound generation vs intent that runs live

The honest comparison is not feature counts. It is what artifact each approach leaves between the generator and the future.

Feature	Codegen / Generator agent (compiled .spec.ts)	Assrt (#Case, no compile)
What the generator emits	A TypeScript spec with selectors and assertions baked in	A #Case block of plain-English steps in scenario.md
What actually runs in CI	The compiled .spec.ts, exactly as it was generated	The #Case file, re-derived against the live page on every run
What a button rename costs	A failed run, then a Healer pass that edits the spec	Nothing. The next snapshot is a new tree; the intent still matches
What the word maintenance means	Editing generated code: selectors, waits, setup blocks	Rewriting a #Case only when the app's behavior actually changed
Generator vs healer output	Different artifacts: plan in, TypeScript out, patched TypeScript after	One artifact. assrt_plan and assrt_diagnose both emit #Case Markdown
Cost per run	Zero LLM calls. The spec is static and fully deterministic	Each run spends Claude Haiku tokens to drive the agent live
Where the test lives	Standard Playwright TypeScript, committed to your repo	Plain Markdown in your repo, run by an MIT-licensed open-source CLI

A compiled .spec.ts wins on determinism and per-run cost. Assrt wins when your UI changes often enough that maintenance, not running, is the expensive part.

Where a compiled spec still wins

Collapsing generation and maintenance into one operation is not free. Because Assrt re-derives the executable test from the live page on every run, every run spends Claude Haiku tokens and needs the agent present. A compiled Playwright spec spends nothing per run. It is static, deterministic, reproducible byte for byte, and it runs fully offline with no API key. If your UI is stable and a given test almost never changes, that compiled file is cheaper to run for years.

So the choice is honest and situational. Test generation becomes worth rethinking at the exact point where keeping tests true starts costing more than writing them did. If your flows churn, if every design pass spawns a batch of red builds, the compile step is working against you and an intent file that runs live will save real hours. If your flows are frozen, a generated spec you maintain by hand is perfectly reasonable. Pick the artifact that matches how fast your app actually moves.

Bring the test that keeps going red

Fifteen minutes with the maintainer. We will take one flow that generates maintenance PRs and rebuild it as a #Case on your actual URL, so you can see what live re-derivation looks like.

Maintenance vs generation, answered

Is test generation or test maintenance the bigger cost?

Generation is a one-time cost per test. Maintenance is recurring, and on a locator-bound suite it dominates the total over the life of the project, because every UI change can break a selector that was correct when it was generated. The useful reframe is that the two are not separate problems. The artifact your generator emits decides your maintenance bill. If generation produces TypeScript with selectors baked in, you have signed up for maintenance the moment the file is written. If generation produces plain-English intent that gets re-derived against the live page on every run, there is almost nothing to maintain.

How is this different from Playwright's Generator and Healer agents?

Playwright ships three test agents: a Planner that explores the app and writes a Markdown plan, a Generator that compiles that plan into Playwright test scripts in TypeScript, and a Healer that runs failing tests in debug mode and patches them. The Generator step is a one-way compile: Markdown plan goes in, TypeScript with concrete selectors comes out. From that point the Healer maintains the compiled TypeScript, not the plan. The plan becomes stale documentation. Assrt removes the compile step entirely. The plain-English #Case file is the test, and assrt_test executes it directly every run.

Does Assrt produce a .spec.ts file I can commit to my repo?

No, and that is deliberate. Assrt's test artifact is a #Case block in plain-English Markdown, stored at /tmp/assrt/scenario.md and meant to be checked into your repo as Markdown. There is no generated TypeScript spec sitting between generation and maintenance to go stale. If you specifically want a static, agent-free .spec.ts that runs deterministically with zero LLM calls, that is what playwright codegen and the Generator agent give you, and Assrt does not. The trade is covered in the comparison table on this page.

What exactly do assrt_plan and assrt_diagnose output?

Both emit the same grammar. assrt_plan, the generator, runs on a system prompt that says to generate test cases in an exact format starting with #Case 1:, keep each case to 3 to 5 actions, and produce 5 to 8 cases. assrt_diagnose, the maintenance tool, runs on a different system prompt that forks the failure between an app bug, a flawed test, and an environment issue, then emits a Corrected Test Scenario in the exact same #Case format. Both prompts live in assrt-mcp/src/mcp/server.ts and both call Claude Haiku. Two tools, two separate model calls, one output format.

If there is no compiled spec, when do I ever maintain a test?

Only when the app's actual behavior changed in a way your intent no longer describes. A renamed button, a re-wrapped element, or a swapped data-testid is not a maintenance event for Assrt, because the agent re-snapshots the accessibility tree before every action and finds the element by role and label. You maintain a #Case when, say, signup grows a new required step. Then you call assrt_diagnose, it returns a corrected #Case, and you paste it back into scenario.md. That edit is the only maintenance, and it is in the same format the generator produced.

Does running tests this way cost money per run?

Yes. Because Assrt re-derives the executable test from the live page on every run, each run spends Claude Haiku tokens to drive the agent. A compiled Playwright spec spends nothing per run: it is static and deterministic. That is a real advantage of the compiled approach and you should weigh it honestly. Assrt earns its keep when your UI changes often enough that maintenance, not running, is your cost center. If your UI is frozen and a test almost never changes, a compiled spec is cheaper to run forever.

Can I use Assrt alongside an existing Playwright suite?

Yes. Assrt runs on Playwright under the hood, driving a real Chromium process through the Playwright MCP bridge, so a scenario.md file can sit next to a Playwright project with no conflict. A common pattern is to keep your stable, rarely-changing flows as compiled specs and move the flows that churn most, the ones generating the maintenance PRs, into #Case files. You are comparing two artifacts, not two test runners.