Guide / Test generation

Sentence to Playwright test generator: a runtime, not a compiler

Every other tool in this category turns your English into a .spec.ts file. Assrt does not. Your sentences stay sentences. The translation to Playwright happens once per run, against the page that actually exists.

Matthew Diakonov, Written with AI

Published May 11, 20269 min read

Direct answer (verified 2026-05-11)

With Assrt you write 3-5 imperative sentences per #Case block in scenario.md, and an LLM agent dispatches each sentence at run time against @playwright/mcp. No .spec.ts file is generated. The dispatch logic is at src/core/agent.ts:693 in the open source repo.

Compilers vs runtimes for sentence-to-Playwright

Search for "sentence to playwright test generator" and you mostly find Playwright Codegen, which is the opposite shape (browser actions in, TypeScript out), plus a handful of AI tools that read your English description and emit a .spec.ts file. Those AI tools are compilers. Sentence in, code out, code is what runs.

The category has a quiet failure mode: as soon as the file is generated, the file becomes the artifact. Your sentence is now a comment at the top of a TypeScript test that nobody re-reads. The locator strings inside the file are frozen against the DOM that existed at generation time. When the design system bumps a data-testid from v3 to v4, the test goes red and a human (or a self-healing model) edits the locator. The sentence is unchanged but the file is now stale.

Assrt picks the other shape. It treats the sentence as a runtime instruction, not as compiler input. The artifact you commit is the sentence file itself. CI reads the same file, calls snapshot() on the live page, hands the sentence to a model with the live tree, and dispatches one of 17 typed tools. Nothing in between is allowed to drift, because nothing in between is written down.

The two shapes side by side

Other tools in this category compile your sentence into a .spec.ts file you commit. The locator becomes part of the artifact. When the UI changes, the compiled selector is stale and the test fails. Self-healing tools then try to guess the new selector from the failure. The compiled test is the source of truth, the sentence is a comment that nobody reads.

sentence -> .spec.ts at generation time
locator strings frozen in the test file
UI change breaks the file, then 'self-heals'
the file, not the sentence, is what CI runs

What happens to the sentence "Click the Sign Up button"

Walk through one specific sentence on a real run. You write it as step 4 of #Case 1 in scenario.md. You start the suite. The runner does five things in order, none of which involve generating code.

Split the file. parseScenarios at src/core/agent.ts:620 runs one regex, /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi, and turns your file into named blocks. The block whose name is "A new user signs up and lands on the dashboard" arrives at the run loop intact.
Read the live page. The runner calls this.browser.snapshot(). The accessibility tree comes back as text with reference attributes: [ref=e17] button "Sign Up". The number 17 is whatever Playwright assigned this run; next run it might be 23. Nothing depends on the number being stable across runs.
Hand the sentence to a model. The runner posts a message containing the screenshot, the tree text, the case name, and the sentence body. System prompt: SYSTEM_PROMPT at agent.ts:198. Tools available: TOOLS at agent.ts:16-196. The model returns one tool_use block: click(element="Sign Up button", ref="e17").
Run the click. Switch statement at agent.ts:766 matches case "click", which calls this.browser.click("Sign Up button", "e17"). That hits @playwright/mcp, which translates the ref into the actual DOM node and clicks it. Real Playwright, real Chromium, no proprietary runner in the way.
Move on.The next sentence ("Wait for the page to stabilize") gets the same treatment, but the model now picks wait_for_stable instead of click. Same loop, same tools, no compilation step ever. The mapping from English to Playwright is decided fresh, every step.

One sentence, dispatched at run time

What you actually commit

Compare the two artifacts for the same five-sentence flow. Left is what a compiler emits and you have to maintain. Right is the file Assrt reads at run time. The right side is the source of truth and the test runner; there is no separate compiled file.

Same flow, two artifacts

// generated_signup_test.spec.ts
// Compiled from a single sentence: "A new user signs up
// and lands on the dashboard." Now you maintain this file.

import { test, expect } from "@playwright/test";

test("A new user signs up and lands on the dashboard", async ({ page }) => {
  await page.goto("https://example.com/signup");

  await page
    .locator('input[data-testid="signup-email-v3"]')
    .fill("test+1747900000@assrt.dev");

  await page
    .locator('input[data-testid="signup-password-v3"]')
    .fill("Hunter2-Hunter2-Hunter2");

  await page
    .getByRole("button", { name: /^Sign up$/ })
    .click();

  await page.waitForURL(/.*\/dashboard$/, { timeout: 10_000 });

  await expect(
    page.getByRole("heading", { name: /Dashboard/i, level: 1 }),
  ).toBeVisible();
});

// Ship a v4 design system and 'signup-email-v3' is gone.
// This file goes red until someone bumps every locator.

40% lines you maintain

The 17 tools your sentence can become

The translator is small. There are exactly 17 tool definitions in src/core/agent.ts, lines 16-196. Every sentence in your scenario.md becomes a sequence of one or more of these. The toolset is small enough that you can read it in a sitting and build a mental model of which sentence shapes the agent will dispatch correctly.

Tool	When the sentence becomes this
navigate	Sentence names a URL or path. Wraps page.goto.
snapshot	Always called first. Returns the accessibility tree with [ref=eN].
click	Sentence names an affordance. Maps the visible name to a ref via the snapshot.
type_text	Sentence says 'Type X into Y'. Clears existing content first.
select_option	Sentence picks values from a dropdown.
scroll	Sentence asks to scroll to bring something into view.
press_key	Sentence presses Enter, Tab, Escape, etc.
wait	Sentence waits for visible text or a fixed duration up to 10s.
wait_for_stable	Sentence waits for streaming or async content. Injects a MutationObserver.
screenshot	Sentence asks for visual evidence. JPEG bytes back to the model.
evaluate	Sentence requires a one-liner of JavaScript (e.g. paste an OTP).
create_temp_email	Sentence needs a disposable inbox before filling a signup form.
wait_for_verification_code	Sentence waits for the OTP at the disposable address. Polls 60s.
check_email_inbox	Sentence inspects the disposable inbox for a magic link or code.
assert	Sentence makes a verification claim. Records pass/fail with evidence.
http_request	Sentence verifies an external side effect (a webhook, a bot message).
complete_scenario	Sentence ends the case with a summary and overall pass/fail.

One tool intentionally omitted from the count above: suggest_improvement, which the agent uses to flag UX bugs it noticed but that are not part of the test. It is not a sentence target; it is a side-channel for the agent to write back.

A live trace, sentence by sentence

What you see in your terminal when you run a #Case. Every line maps to either a sentence boundary or one of the 17 tool calls the dispatcher just emitted. Notice that the second sentence ("Wait for the page to stabilize") chose wait_for_stable and the third (an assert) chose the assert tool. Both decisions were made at run time by the model reading the sentence, not by anything pre-compiled.

assrt_test trace

Why this matters the day someone bumps the design system

The reason a runtime beats a compiler for this category is the quiet, ongoing maintenance bill. A compiled Playwright test is bound to a specific DOM. The bind is implicit (a locator string in your test file), and the bind is what breaks. Even tools that advertise self-healing selectors are still managing the bind, just in a less obnoxious way: the bind exists, the bind goes stale, the model guesses the new bind, the test goes green again.

With Assrt the bind does not exist between runs. Your sentence "Click the Sign Up button" refers to a visible affordance named "Sign Up". As long as that affordance is still present in some form on the page (a button, a link styled as a button, a card with the right label), the snapshot will surface it and the model will dispatch a click on it. The new ref will be different from the old ref. Your sentence does not care.

The sentence only fails when the visible affordance the sentence describes is gone. That is the failure you actually wanted: the test should go red when the user-facing flow breaks, not when an internal data-testid gets renamed in a refactor that changed nothing for the user.

Frequently asked questions

What is a sentence to Playwright test generator and what makes Assrt different?

A sentence to Playwright test generator takes plain English input and produces a runnable Playwright test. Most tools in this category are compilers: you describe a flow in English, the tool emits a .spec.ts file with locators and expect chains, and that file is what runs. Assrt is not a compiler. The tool stores your sentences in a Markdown file at /tmp/assrt/scenario.md, splits them into named #Case blocks via a regex (parseScenarios at /Users/matthewdi/assrt-mcp/src/core/agent.ts:620), then hands each block to an LLM along with a fresh accessibility snapshot. The LLM reads the sentence, looks at the live page tree, and chooses one of 17 typed tools (snapshot, click, type_text, scroll, assert, and so on, defined at agent.ts:16-196) for every step. There is no .spec.ts on disk. The translation from sentence to Playwright tool call happens once per run, against the page that exists right now.

Where is the actual line of code that turns a sentence into a Playwright tool call?

The dispatch loop is at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 693-747. Each iteration sends the current message stack to Claude with system: SYSTEM_PROMPT (line 198 in the same file) and tools: TOOLS, gets back zero or more tool_use blocks, and runs them. The tools are concrete wrappers around @playwright/mcp primitives: case 'click' on line 784 calls this.browser.click(element, ref); case 'type_text' on line 791 calls this.browser.type(element, text, ref); case 'snapshot' on line 778 calls this.browser.snapshot(). The model is the only translator. Your sentence 'Click the Sign Up button' is never compiled into a string like page.getByRole('button', { name: 'Sign Up' }).click(). It becomes a tool_use with name='click' and input={ element: 'Sign Up button', ref: 'e17' }, where 'e17' came from the snapshot the model just read.

If nothing is compiled, what artifact do I commit to my repo?

scenario.md, the same Markdown file the generator wrote. The on-disk layout is defined at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts: scenario.md is the plan, scenario.json is metadata (UUID and name), results/latest.json is the most recent run report. You can grep scenario.md, diff it on a pull request, render it on GitHub, or paste half of it into a different project. There is no generated TypeScript next to it that needs to stay in sync. When CI runs your suite, the runner reads the same file, gets a fresh accessibility snapshot from the live app, and re-decides every tool call. So the artifact you commit and the artifact CI runs are literally one file of plain English.

What are the 17 tools the model can pick for any given sentence?

navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. Full schemas are at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 16-196. Three of those (create_temp_email, wait_for_verification_code, check_email_inbox) wire a disposable inbox so signup flows can complete an OTP loop without a human. wait_for_stable injects a MutationObserver into the page (agent.ts:956 onward) and waits until DOM mutations stop, which is what an English sentence like 'Wait for the search results to load' becomes at runtime. http_request lets the agent verify external side effects (a webhook fired, a Telegram bot got the message). The toolset is small enough to read in one sitting and big enough that almost any user-facing sentence has a tool that fits.

What happens to my sentences when the UI changes? Don't they break the same way locators do?

No, and this is the load-bearing reason a runtime beats a compiler for this category. A compiled test contains a frozen reference: page.locator('button[data-testid=signup-v3]'). When the build pipeline ships a new design system and that data-testid becomes signup-v4, the compiled test breaks and your CI goes red. Self-healing tools paper over this by guessing the right new selector after the failure. With Assrt, your sentence 'Click the Sign Up button' contains no selector. At run time the agent calls snapshot() on the new page, gets the new accessibility tree (which still has a button with the visible text 'Sign Up'), and dispatches click with the new ref. There is nothing to heal because nothing was ever bound. The only way the sentence breaks is if the visible affordance the sentence describes is genuinely gone from the page, in which case the test should fail because the user-facing flow is gone.

How is this different from Playwright Codegen, which also generates tests for you?

Codegen is the opposite shape. You drive a real browser by hand, Codegen records your clicks and keystrokes, and it emits TypeScript with locators inferred from your actions. The input is not sentences, it is a recording. The output is a .spec.ts file that lives in your repo and that you maintain forever. Both Codegen and Assrt write tests so you do not have to, but they pick different inputs and different artifacts. Use Codegen when you have a flow you can demonstrate but cannot articulate, and you want a TypeScript spec to commit. Use Assrt when you can describe a flow in English and you want the description itself to be the durable test, with no compiled artifact in between.

What does the model actually see for a single sentence at run time?

On the first sentence of a scenario it gets one user message containing the initial page screenshot (base64 JPEG), the full accessibility tree from the first snapshot call, the scenario name, the full sentence body, and (optionally) a Pass Criteria block and any test variables. The exact assembly is at agent.ts:679. After that, it gets back tool_use blocks, the runner executes them, and the results (snapshot text, navigation outcome, click confirmation, screenshot bytes) come back as tool_result messages on the next loop turn. The model is never asked to memorize selectors between turns; every snapshot is a fresh reading of the live page. The reason this is cheap enough to run in CI is that snapshots are accessibility text, not screenshots: a 41-node accessibility tree fits in a few hundred tokens, while a screenshot is JPEG-compressed and only emitted after visual actions to keep the round trip lean.

How do I write a sentence the model will reliably dispatch correctly?

Three rules, all enforced by the PLAN_SYSTEM_PROMPT at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:219-236 even when you write the sentences yourself. First, name the affordance by what a user sees, not what the developer named it: 'Click the Sign Up button' beats 'Click the primary CTA in the hero'. Second, verify observable things: 'Assert: the URL path is /dashboard' or 'Assert: the heading reads Welcome' both work; 'Assert: the button has the right CSS color' will fail because the agent has no CSS introspection tool. Third, keep cases short: 3-5 sentences per #Case is the sweet spot, because each sentence costs a model round trip plus a snapshot. A single 20-step #Case spends more wall-clock and is harder to diagnose when it fails. The same prompt also caps you at 5-8 cases per scenario for the same reason. None of this is enforced by a parser; the prompt sets the norm and the model executes accordingly.

Can I edit the sentences after they're generated, or is scenario.md write-once?

Edit freely. The MCP server's instructions explicitly call this out: 'Read /tmp/assrt/scenario.md to see the current plan; Edit /tmp/assrt/scenario.md to modify test cases; changes auto-sync to cloud storage.' Each test run is auto-saved with a UUID (visible in scenario.json), and you can re-run the same scenario later with assrt_test({url, scenarioId: '...'}). The common workflow is: generate a draft from a URL, open scenario.md in your editor, delete the cases you do not care about, rewrite a sentence that the model phrased awkwardly, add a new case the generator missed (often something behind auth that the URL-only generator could not see). Save, run. The runner picks up the new file by mtime.

Want to see this on your real app?

A 20-minute call: bring a URL, we run assrt_plan and assrt_test on it live, you keep the scenario.md.