Assertion coverage in generated Playwright tests: the one rule that turns click logs into tests
Most AI-generated Playwright tests are click logs with zero or weak assertions. Assrt fixes that with 7 lines of system prompt: every Verify/Check/Assert/Confirm/Ensure bullet in the plan must produce exactly one assert(description, passed, evidence) call. Impossible to verify becomes passed=false, not silent omission.
Assertion coverage in Assrt-generated Playwright tests is enforced by a section of the agent's system prompt titled Assertion Coverage (CRITICAL — non-negotiable), at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 256-262. Every scenario bullet beginning with Verify, Check, Assert, Confirm, or Ensure must produce exactly one assert(description, passed, evidence) tool call. Skipping is forbidden; an impossible-to-verify bullet becomes passed=false with an evidence string describing what was missing. The same rule is re-stated in the per-run user prompt at line 731. Repo: github.com/assrt-ai/assrt-mcp.
Click logs are not tests
If you ask a generic coding agent to write a Playwright test for a signup flow, you usually get back a sequence of page.fill and page.click calls ending in waitForURL, and zero expect calls. The file looks like a test, the CI runs it, the green check lands. Then the dashboard renders blank for new users for three days and nobody notices because the test never opened it.
The shape of the bug is consistent across every AI test-generation product I have run side by side. The model is good at recording user actions. It is bad at deciding what to check after each action. Asked to verify, it produces a vague matcher; asked nothing, it produces nothing. The path of least resistance is to skip the assertion. The training signal points at code that runs, not at code that catches the thing you actually wanted to check.
Toggle the panel below to see what the two outputs look like for the same signup flow.
Same flow, two shapes
A typical AI-generated Playwright test, the click-log shape: await page.goto('/signup'); await page.fill('input[type=email]', 'a@b.c'); await page.fill('input[type=password]', 'hunter2'); await page.click('button:has-text("Create account")'); await page.waitForURL(/dashboard/); // Done. Pass. Zero assert calls. The test passes if the navigation lands. It passes if the dashboard is broken and shows a blank screen. It passes if the welcome message says someone else's name. The model wrote a click log and called it a test.
- Zero assert calls; the test passes if navigation lands
- Dashboard could be broken and the test still goes green
- No record of what was actually on the page at any step
The 7-line rule, verbatim
The mechanism is small enough that you can read every line. From /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 256-262, lifted verbatim from the system prompt the agent sees on every run:
## Assertion Coverage (CRITICAL — non-negotiable) Every line in the scenario steps that starts with "Verify", "Check", "Assert", "Confirm", or "Ensure" is a MANDATORY assertion. You MUST produce exactly one assert tool call for each such line. - Do NOT silently merge two bullets into one assert call. - Do NOT skip a bullet because it seems redundant or hard to check. If verifying it is genuinely impossible, call assert with passed=false and evidence describing what was missing. - Do NOT add extra assertions that were not in the scenario steps. - Before calling complete_scenario, re-read each Verify-class bullet and confirm you have made one corresponding assert call. - The description field of each assert call MUST closely match the wording of the bullet it covers.
The same rule is re-asserted in the per-run user prompt at line 731 of the same file, so the agent encounters it twice per scenario before it sees the page. The redundancy is on purpose; the system prompt is long, and the per-run reminder lands the rule closer to where the agent is reasoning.
What the rule actually enforces
Coverage rules the agent reads before every run
- Every Verify/Check/Assert/Confirm/Ensure line produces exactly one assert tool call. The agent cannot silently merge two bullets into one call.
- A bullet cannot be skipped because it looks redundant or hard to verify. The agent must attempt it; failure to verify becomes passed=false with evidence describing what was missing.
- The agent cannot add assertions outside the scenario. Over-asserting is a separate failure mode and the prompt rejects it for the same reason.
- Before complete_scenario, the agent must re-read the bullets and confirm one-to-one coverage. Missing assertions get added in the same conversation turn.
- The description field of each assert call must closely match the bullet wording so a reviewer can match assertions to bullets line-by-line.
The assert primitive is three fields, all required
The tool schema is at agent.ts lines 132-143. Three required fields, no optional ones:
{
name: "assert",
description: "Make a test assertion about the current page state.",
input_schema: {
type: "object",
properties: {
description: { type: "string", description: "What you are asserting" },
passed: { type: "boolean", description: "Whether the assertion passed" },
evidence: { type: "string", description: "Evidence for the result" },
},
required: ["description", "passed", "evidence"],
},
}There is no expect(locator) here, no matcher family, no timeout argument on the assert call itself. The agent has already waited via wait_for_stable and taken a fresh accessibility-tree snapshot before it asserts; the assert call records the verdict and the evidence the agent saw. The structured record lands in /tmp/assrt/results/latest.json as one entry in the scenarios[].assertions array, with a timestamp and the exact arguments the agent passed.
“A test with 12 navigations and 0 assertions is a click log. Mandatory one-to-one coverage is the smallest fix that turns it back into a test.”
agent.ts:256-262
How to measure assertion coverage from a run report
After a run, two files matter. /tmp/assrt/scenario.md is the plan the agent executed. /tmp/assrt/results/latest.json is the structured run report. To compute assertion coverage as a number:
- Parse
scenario.mdand count lines whose first word is Verify, Check, Assert, Confirm, or Ensure. Call thisexpected. - In
latest.json, readscenarios[i].assertions.length. Call thisactual. - Coverage is
actual / expected. Because the rule forces one-to-one mapping, the expected value is 1.0. Any value lower than 1.0 is a bug in the run, not an acceptable percentage. - The pass/fail mix is a separate axis.
passed=falseassertions still count as covered; they are honest failures, not skips.
This is the inversion that the conventional Playwright assertion guides do not cover. They explain how each individual expect works and tell you to write more of them. They do not give you a way to know, at generation time, that the assertions in front of you cover the intent you wrote down. The 1:1 rule is that way.
What the rule does not solve
One-to-one mapping makes the assertions you ask for verifiable. It does not make the assertions you ask for good. If you write a plan with three Verify bullets that all check incidental properties (the button color, the loading spinner duration, the favicon), Assrt will faithfully produce three assert calls and tell you whether each one passed. The mechanism guarantees coverage of stated intent; the quality of stated intent is on you.
It also does not give you pixel-level visual regression. toHaveScreenshot and toHaveCSS are deliberately out of scope. If you need those, keep them in a separate Playwright spec; Assrt does not monopolise the browser. The split is: Assrt covers behavior-level assertions; classic Playwright covers pixel diffs and performance budgets.
Finally, the rule does not bypass the model's reasoning. The agent still has to identify what to look for, find it in the accessibility tree, and write evidence. The rule changes the default behavior from skip-on-uncertainty to assert-with-passed=false-on-uncertainty. That is a meaningful shift, but it is not the same as having a deterministic verifier in the loop.
“The thing nobody flags about LLM-generated Playwright is that they will produce a 50-line spec that asserts on nothing and call it a green build. A rule that forces 1:1 between scenario bullets and assert calls is the actual fix.”
Try it on your own app
One command runs the discovery agent against your URL, writes a #Case plan to /tmp/assrt/scenario.md, and executes it under the rule:
npx @m13v/assrt discover https://your-app.com
Open the resulting scenario.md and count Verify-class bullets. Open latest.json and count assertions. The counts match. Pick a bullet and read its assert description and evidence. The description matches the bullet wording closely enough that you can read the report next to the plan and follow along. Edit the plan, re-run; the counts still match. That is what assertion coverage in AI-generated Playwright tests looks like when it is enforced at the system-prompt layer instead of left to the model's judgement.
Want to see the rule in action on your staging URL?
20-minute call. You bring the URL and a flow you currently do not assert on; we run discover live and walk through the resulting scenario.md and latest.json side by side. Open source, MIT, no SaaS dashboard.
Frequently asked questions
What does assertion coverage mean for AI-generated Playwright tests, exactly?
Assertion coverage is the share of intent statements in your test plan that produce a verifiable check against the page at run time. Not source-code coverage (lines of your app executed) and not selector coverage (UI areas the agent visited). The distinction matters because most AI test generators are great at coverage of the first two and terrible at the third. They click through a flow, the page changes, they call it pass, and you never see a single expect() or assert(). A test with 12 navigations and 0 assertions is a click log, not a test. Assertion coverage is the fraction of 'I should check that X' statements in the plan that turn into an actual recorded assertion in the run report.
Why do AI-generated Playwright tests usually have weak assertion coverage?
Three reasons compound. First, the model is rewarded by training data for producing 'working' code; tests that pass with fewer assertions look cleaner. Second, asserting on a UI state requires the model to understand what specifically to check, which is harder than knowing what to click; vague plans like 'verify it works' resolve to vague or skipped assertions. Third, when the model is unsure whether a check is possible (the element it expected is not there), the path of least resistance is to omit the check rather than to flag it as failed. Almost every AI-driven test runner inherits at least two of these failure modes. The fix has to happen at the prompt and runtime layer, not at the model layer.
How does Assrt force one-to-one mapping between scenario bullets and assert calls?
With a section in the system prompt titled 'Assertion Coverage (CRITICAL — non-negotiable)' at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 256 to 262, plus a per-run reinforcement in the user prompt at line 731. The rule has five parts. Every line in the scenario steps that starts with Verify, Check, Assert, Confirm, or Ensure is a mandatory assertion. The agent must produce exactly one assert tool call per such line. Two bullets cannot be silently merged into one assert. A bullet cannot be skipped because it seems redundant or hard to check; if verifying is genuinely impossible (the element is missing), the agent must call assert with passed=false and explain what was missing as evidence. The agent cannot add extra assertions outside the scenario. The description field of each assert call must closely match the wording of the bullet, so a reviewer can match assertions to bullets one-to-one.
Show me what an assert tool call looks like in practice.
The assert tool has three required fields: description (what you are asserting), passed (boolean), and evidence (free-text rationale). The schema is at agent.ts lines 132 to 143. A real call for a successful 'Verify the dashboard heading appears' bullet looks like: description='Dashboard heading appears after login', passed=true, evidence='Heading element with role=heading and accessible name "Welcome back, Sarah" found at ref=e14 after wait_for_stable settled at 2.4s'. A real call for a failed bullet, where the element was missing: description='Confirm the success toast renders for 3 seconds', passed=false, evidence='No element with role=alert or text matching "saved" found in accessibility tree across 3 snapshots over 4.1s. Snapshot ref=e8 was the closest candidate but its accessible name was "Loading".' Both calls land in /tmp/assrt/results/latest.json as structured records you can grep.
What stops the agent from claiming passed=true on bullets it never actually checked?
Two things. First, the evidence field is required and reviewable; the agent has to write a sentence describing what it observed, and 'looks good' is a tell on review. Second, the assertions array in the run report (assertions.push at agent.ts line 949) is a structured record of every assert call with timestamps; the per-step PNG and the WebM recording let you confirm that what the agent claims it saw was actually on the page at the assert moment. The structural defense against fake passes is that the assertion was claimed against a specific accessibility-tree snapshot you can replay. The cultural defense is that the description and evidence are short prose a reviewer can read in a few seconds per bullet, not a 200-line .spec.ts they have to skim.
Does this work with Playwright's web-first expect() matchers, or does it replace them?
It is a different layer that wraps real Playwright underneath. Web-first matchers like expect(locator).toBeVisible() retry inside the matcher when the page is in a loading state but the selector is fixed. Assrt's assert tool works the other way around: the agent calls wait_for_stable first to let the DOM settle (default 30s timeout, 2s stable window), then takes a fresh accessibility-tree snapshot, then calls assert with evidence drawn from what it just saw. If you want the pixel-level toHaveScreenshot or computed-style toHaveCSS matchers, keep them in a separate spec; Assrt does not monopolise the browser. The split is: Assrt covers behavior-level assertions (element visible, text appeared, URL changed, input accepted, modal dismissed); keep classic Playwright for pixel diffs and performance budgets.
What does the run report look like, and how do I measure assertion coverage from it?
Each run writes /tmp/assrt/results/latest.json. The scenarios[].assertions array contains one entry per assert call: { description, passed, evidence }. To compute assertion coverage, parse the scenario plan from /tmp/assrt/scenario.md, count the lines that start with Verify/Check/Assert/Confirm/Ensure, and compare to assertions.length. Because the agent is required to emit exactly one assert per such line, the expected coverage is 100%. Any mismatch is a bug in the run, not a tolerated drift. The mismatch usually shows up as a complete_scenario call that the agent made before covering all the bullets, which surfaces in the JSON as a partial scenario. The pipeline treats partial coverage as a failed scenario, not as a passing one with a footnote.
What if my scenario has no Verify lines at all? Does Assrt force me to invent assertions?
No. The rule is symmetric: the agent must produce one assert per Verify-class bullet, and it must not add extra assertions outside the plan. A pure exploratory scenario with no Verify bullets produces zero asserts and one complete_scenario call with a written summary, which the reviewer can read to decide whether to add Verify bullets next time. The point is to align the assertion count with the plan's stated intent. Over-asserting is a different failure mode (it makes the test brittle to incidental UI changes) and the prompt rejects it for the same one-bullet-one-assert reason.
Can I see the actual prompt lines you are citing? Are they really hardcoded?
Yes, the prompt is a string literal in the open source TypeScript file. The repo is https://github.com/assrt-ai/assrt-mcp. The 7 lines from agent.ts:256-262 (system prompt) plus the user-prompt reinforcement at line 731 are the entire mechanism. There is no separate config, no enterprise tier feature, no plugin to install. If you fork the repo, the rule comes with it. If you write your own plan in plain English with Verify-class bullets, the rule applies. If you remove the section from the prompt, you lose the guarantee but the tool surface still works; that is your choice. The whole point of putting it in plain text in the system prompt is that you can read it, evaluate it, and change it.
Where does this leave teams currently using QA Wolf, Momentic, or hand-written Playwright?
QA Wolf at roughly $7.5K per month gives you human QA engineers who write Playwright with rich assertions; coverage is high because humans are in the loop. Momentic uses proprietary YAML where assertion mechanics are abstracted away, which makes coverage harder to audit because you cannot easily diff a YAML scenario against a list of assert calls. Hand-written Playwright is whatever your team's review discipline produces; coverage depends on the reviewer. Assrt's lane is teams who want AI to write the plan and execute the test but still want a verifiable coverage guarantee. The 7-line rule is the smallest thing that makes that lane real. Open source, free, MIT license, no SaaS dashboard.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.