Playwright Automated Testing: The Failure Triage Step Everyone Skips
Search "playwright automated testing" and every top result teaches you the same arc: install, write test('thing', ...), add assertions, run in CI. None of them cover the step you actually hit first on a real project: the run is red and you do not know whose fault it is. This guide is about that step. It walks through a reproducible way to classify every failed Playwright case as one of three things (application bug, flawed test, environment issue) and get back a corrected scenario you can re-run.
“Diagnose the root cause — is it a bug in the application, a flawed test, or an environment issue?”
Assrt diagnose system prompt, src/mcp/server.ts:239
1. The Gap: A Red Run Is Not a Diagnosis
A Playwright run fails. You get a test name, a line number, a screenshot, and an assertion message. That tells you a state you expected did not happen. It does not tell you whether:
- The product is broken and a human needs to fix the app.
- The product is fine and the test encoded a wrong assumption (wrong button text, a flow that moved, a selector that matched two things).
- Neither: the preview URL is stale, a seed user was cleaned up, a CI cache is serving an old bundle.
In practice teams route every red run to one of those three piles, but the routing happens in someone's head, in a Slack thread, or in a ticket comment. It is not reproducible, and for flaky long-tail cases the triage cost swamps the fix cost.
The gap in every "playwright automated testing" tutorial on the first search page is this: they stop at "run the test." They do not give you a pattern for what happens next.
2. Three Buckets, Picked By a Fixed Prompt
Assrt ships this triage step as an MCP tool called assrt_diagnose. It takes three strings: the URL under test, the scenario that failed (in plain English or the Assrt #Case format), and the failure description. It returns a structured diagnosis in one LLM call, using claude-haiku-4-5-20251001 (see src/mcp/server.ts:755).
The output is not free-form. The system prompt at src/mcp/server.ts:237 constrains it to four sections in this exact order: Root Cause, Analysis, Recommended Fix, Corrected Test Scenario. The Root Cause is the classification. The model is told to pick one of three buckets and nothing else.
3. The Actual Prompt, Annotated
Here is the real prompt, verbatim from src/mcp/server.ts:237-265 in the assrt-mcp repo:
You are a senior QA engineer and debugging expert.
You are given a failing test case report from an
automated web testing agent. Your job is to:
1. Diagnose the root cause — is it a bug in the
application, a flawed test, or an environment
issue?
2. Provide a fix — give a concrete, actionable
solution:
- If the app has a bug: describe what the app
should do differently
- If the test is flawed: provide a corrected
test scenario in the exact #Case format
- If it's an environment issue: explain what
needs to change
3. Provide a corrected test scenario if the test
itself needs adjustment
## Output Format
### Root Cause
[1-2 sentences identifying the core issue]
### Analysis
[3-5 sentences explaining what went wrong]
### Recommended Fix
[Concrete steps to fix the issue]
### Corrected Test Scenario
#Case 1: [corrected case name]
[corrected steps that will pass]Two design decisions matter here. First, the three buckets are enumerated in the task description, not just the output format. That blocks a common failure mode where a model sees an ambiguous error and invents a fourth category ("possibly a race condition", "might be timing") rather than committing to one of the three. Second, the Corrected Test Scenario section uses the same #Case N: format as the input, which is what lets the output loop straight back into the next run without a translation step.
Run the triage loop against your own Playwright failures
npx @m13v/assrt, then call assrt_diagnose from any MCP-capable client with {url, scenario, error}. Three buckets, corrected #Case block, one Haiku call.
Get Started →4. Closing the Loop: Corrected Scenario Back In
The scenario file lives at /tmp/assrt/scenario.md. When diagnose verdicts "flawed test," the returned Corrected Test Scenario is a drop-in replacement for the failing #Case block in that file. The flow looks like this:
1. assrt_test(url, scenarioId) → Case 3 fails
2. assrt_diagnose(url, scenario, → Root Cause:
error) "flawed test"
→ Corrected
#Case 3: ...
3. Edit /tmp/assrt/scenario.md, (agent or human
replacing the old Case 3. paste; auto-
syncs by UUID)
4. assrt_test(url, scenarioId) → Case 3 passesWhen the verdict is "application bug," the loop stops. The Recommended Fix section becomes the body of a ticket; the scenario stays as-is because it correctly encodes the expected behavior the app is missing. When the verdict is "environment issue," neither the app nor the test changes; the Recommended Fix tells you whether to reseed a user, invalidate a cache, or retry against a different preview.
5. Why This Does Not Overlap With Playwright Retries
Playwright's built-in retries (retries: 2 in playwright.config) exist for one thing: non-deterministic failures. A network blip, an animation race, a toast that came a hundred milliseconds late. Retry the case twice; if one attempt passes, call it green. That works because the assumption is "the case is correct and the world is noisy."
The diagnose loop is for the opposite assumption: the case failed the same way three times in a row, so the world is not noisy, something is actually wrong, and the question is where. Retry cannot help you here because retry cannot distinguish " wrong button name" from "missing feature." You want a classifier.
In a typical run, retries run first on the Playwright MCP side. Only a case that survives all retries gets routed to assrt_diagnose. The two layers compose; they do not compete.
6. Using Diagnose on Hand-Written Playwright Failures
You do not have to move your whole suite to Assrt scenarios to use this step. The diagnose tool only requires three strings. If your existing suite is TypeScript Playwright, call it like this from any MCP client:
assrt_diagnose({
url: "https://staging.acme.com/checkout",
scenario: "Fill the shipping form with a US address, " +
"pick Standard shipping, click Continue, and " +
"verify the order summary shows $4.99 shipping.",
error: "Expected '$4.99' but found '$9.99'. " +
"Screenshot: shipping picker shows 'Express' " +
"selected instead of 'Standard'."
})The returned Root Cause will tell you whether the radio button IDs changed (flawed test), the default shipping method flipped (app bug), or the CI seed's postcode does not qualify for Standard (env issue). The Corrected Test Scenario comes back as a #Case block; a human or a coding agent translates the English back to page.getByRole calls. The value is the classification, not the translation.
7. Where the Classifier Is Wrong or Unhelpful
The diagnose step is not magic. Three known failure modes:
Evidence-starved inputs
If the errorstring is one line ("Case 3 failed"), the model has nothing to classify on and defaults to "flawed test" because that is the safest guess. Always include the assertion message, the last visible URL, and a paragraph of what the screenshot shows. The Assrt runner does this automatically; hand-written callers have to.
Ambiguous app changes
When a product intentionally changes copy or a flow, the classifier will report "flawed test" because, from the outside, an intentional redesign and a regression look identical. The triage is still correct in the sense that the test needs updating; the label is just not as satisfying. A product changelog in the scenario context helps here.
Environment issues the page cannot show
A 500 from a downstream service that the app swallows into a generic toast is hard to call as env without logs. The diagnose loop sees the same thing a user would and classifies accordingly. If your runner captures backend errors in the failure report, feed them in; the third bucket gets sharper.
Frequently Asked Questions
What does a 'diagnose loop' add to playwright automated testing?
A classifier on top of failure output. When a Playwright run goes red, a compiled test throws a line number and a stack trace. That is enough to know something broke, but not whether to file a bug, patch the test, or blame a flaky environment. The Assrt diagnose loop wraps the failure in a fixed prompt (src/mcp/server.ts:237) that forces one of three verdicts: application bug, flawed test, or environment issue, and returns a rewritten #Case block if the test was wrong.
Which model runs the diagnose step, and why that one?
claude-haiku-4-5-20251001, called from assrt_diagnose in src/mcp/server.ts:755. Haiku is cheap enough to call on every failed case without thinking about cost, and the diagnose prompt is short (a failure report plus the scenario text) so a frontier model would be overkill. The prompt does all the lifting by constraining output to four sections: Root Cause, Analysis, Recommended Fix, Corrected Test Scenario.
How is the corrected test scenario fed back into the next run?
The diagnose tool emits a #Case N: block in the exact same markdown format Assrt consumes from /tmp/assrt/scenario.md. You (or your coding agent) paste the corrected case over the flawed one and re-run assrt_test with the same scenarioId. There is no format translation layer; input and output share the same grammar.
Does this replace the Playwright retry/flake handling I already have?
No. Playwright's retries handle genuinely non-deterministic failures (network blips, animation races). The diagnose loop handles the orthogonal problem: a reproducible failure that could be the app's fault or the test's fault. In a typical Assrt run, Playwright retries happen first; if the case still fails after retries, diagnose runs against the stable evidence.
Can I run diagnose against a failure from my existing Playwright suite, not Assrt scenarios?
Yes. assrt_diagnose takes three strings: url, scenario, and error. The 'scenario' can be the body of a hand-written Playwright test in plain English, and 'error' can be the assertion failure or stack trace from Playwright's reporter. The corrected output will still come back in #Case format; you translate back to TypeScript by hand. The classification itself (app bug vs flawed test vs env) is the reusable part.
What counts as an 'environment issue' in this classifier?
Anything the prompt cannot map to app or test: a stale preview URL, a seed user that was cleaned up, a cookie banner from a geo the CI node hits, a CDN caching an old bundle. The DIAGNOSE_SYSTEM_PROMPT at src/mcp/server.ts:243 specifically lists this as the third bucket so the model stops trying to force env failures into the first two.
Is this open-source so I can see the exact prompt?
Yes. The prompt lives at src/mcp/server.ts:237-265 in the public assrt-mcp repo. Read it, fork it, change the three buckets to five if your team needs finer granularity. The whole MCP server is MIT-licensed; there is no 'diagnosis engine' sitting on a paid API you cannot inspect.
Add a Diagnose Step to Your Playwright Suite
Assrt ships assrt_diagnose as an MCP tool: three strings in, a three-way verdict and a corrected #Case block out. Open-source, runs on Claude Haiku 4.5, no vendor lock-in.