Guide

QA Automation Test: The Diagnose-and-Rerun Loop Most Tools Skip

Q: What exactly does assrt_diagnose return?

A JSON object with diagnosis (root cause and corrected scenario from Claude Haiku 4.5), url, and the original scenario echoed back. See src/mcp/server.ts line 773.

Q: Which model powers the diagnosis step?

claude-haiku-4-5-20251001 with a 4096 token cap, called via the Anthropic SDK.

Q: Can I run the diagnose tool without running the test first?

Yes. The signature is just url, scenario, error. You can paste a stack trace and call assrt_diagnose directly without prior state.

Q: How does Assrt avoid infinite diagnose-rerun loops?

The loop is driven by the calling agent. Both tools are pure calls; the agent enforces a max-attempts bound.

Q: Does assrt_analyze_video work without a Gemini key?

No. The tool is conditionally registered when process.env.GEMINI_API_KEY is set. Tests still record video without it.

Q: Is the runner code open-source?

Yes. The MCP server is in the assrt-mcp repo and uses @playwright/mcp under the hood. Generated tests are standard Playwright.

Q: How does this compare to enterprise QA SaaS?

Enterprise QA tools can charge $7,500/month per seat with opaque scenario storage. Assrt is free, self-hostable, and stores scenarios as markdown on disk.

Most QA automation guides stop at the same place. They show you how to write a test, how to run it, and how to read a pass or fail. They do not show you the part that actually consumes engineering time, which is the loop between a failed run and a corrected test. This guide is about that loop. It walks through the three MCP tools that ship with Assrt, shows the exact handoff between a failure and a fix, and explains why the diagnosis step belongs in a separate tool from the runner.

By Assrt · Published 2026-04-12 · Updated 2026-04-12 · ~9 min read

$0/mo

“Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.”

Assrt vs competitors

1. The Loop Most QA Automation Tools Skip

The lifecycle of a QA automation test in practice is not write, run, pass. It is write, run, fail, debug, fix, rerun, fail again, debug again. The hard part is not the first arrow. It is every arrow after it. A flake masquerades as a real failure for an hour. A real failure masquerades as a flake until production breaks. A test passes once and never gets touched until a selector silently drifts six months later.

Most QA automation platforms ship a runner and a dashboard. The dashboard tells you the test failed and shows you a stack trace. The translation from stack trace to corrected test is up to a human. That human is usually the engineer on call, reading a screenshot, guessing whether the form changed or the back end is slow, then editing the test. This is where engineering hours go.

The interesting design question is whether the failure-to-fix step can be its own thing, with a structured input and a structured output, that an agent can call. If yes, then the loop closes without a human in the middle for the easy two-thirds of failures, and the human handles only the cases the diagnosis step flags as ambiguous. That is the design Assrt ships.

2. Three MCP Tools, One Feedback Cycle

Assrt exposes its testing capability over the Model Context Protocol as three discrete tools. The split is deliberate. Each tool has one job, a small input, and a structured output that the next tool can consume.

assrt_plan      // url -> generated #Case scenario
assrt_test      // url + plan|scenarioId -> pass/fail + screenshots + video
assrt_diagnose  // url + scenario + error -> diagnosis + corrected scenario

All three are declared in src/mcp/server.ts in the open-source assrt-mcp repository. assrt_plan reads a page and emits cases in the #Case N: name format. assrt_test drives a real browser via @playwright/mcp and writes results to /tmp/assrt/results/latest.json. assrt_diagnose takes a failure and returns a corrected scenario string ready to hand straight back to assrt_test.

From an agent inside Claude Code, Cursor, or any MCP client, the cycle reads as three tool calls in a row. Plan, test, diagnose. If the diagnosis says the test was wrong, run again with the corrected scenario. If the diagnosis says the application is wrong, surface that to the human and stop. The runner does not have to know the difference. The diagnose tool does.

3. Anatomy of an `assrt_diagnose` Call

Open src/mcp/server.ts at line 725 and the diagnose tool is twelve lines of declaration followed by a focused agent loop. The signature is small on purpose:

server.tool(
  "assrt_diagnose",
  "Diagnose a failed test scenario...",
  {
    url: z.string(),
    scenario: z.string(),
    error: z.string(),
  },
  async ({ url, scenario, error }) => {
    const response = await anthropic.messages.create({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 4096,
      system: DIAGNOSE_SYSTEM_PROMPT,
      messages: [{ role: "user", content: debugPrompt }],
    });
    return { content: [{ type: "text",
      text: JSON.stringify({ diagnosis, url, scenario }) }] };
  }
);

Three things matter about this shape. First, the input is exactly the trio you have at the moment a test fails: the URL, the scenario text, the error. Nothing else. There is no session ID, no run ID, no auth handshake. You can call diagnose from anywhere, including a script, a CI runner, or a separate agent that did not run the original test.

Second, the model is claude-haiku-4-5-20251001 with a 4096 token cap. Haiku 4.5 is fast and cheap, which makes the loop affordable to run on every failure, not just the ones a human escalates. A 4096 token cap is enough for a paragraph of root-cause analysis plus a rewritten scenario block, and not much more, which keeps the output focused.

Third, the response is JSON with a diagnosis field. The agent on the other end parses it, decides whether the diagnosis is "test was wrong" or "app is wrong", and either feeds the corrected scenario back into assrt_test or surfaces the failure. There is no shared mutable state. The loop is pure functions over text.

Run the diagnose loop from your IDE

Install the Assrt MCP server in Claude Code or Cursor. Three tools, no dashboard, full source on disk.

4. Why Diagnosis Lives In Its Own Tool

A reasonable question is why diagnosis is not folded into assrt_test. A single tool that runs the test, sees the failure, fixes itself, and reruns sounds simpler from the outside. In practice, splitting them is what makes the loop robust.

Folding diagnosis into the runner means every test run pays the diagnosis cost, even when nothing failed, or pays the cost of a long rerun chain when the diagnosis is wrong. Splitting them lets the calling agent decide. Run once. If it failed, decide whether this failure is worth diagnosing, or whether you would rather page a human. If you diagnose, decide whether to apply the correction or surface it as a suggestion. Each decision is explicit and the agent can be conservative.

The split also lets a different agent do the diagnosis from the one that did the run. A CI bot can run the test, drop the failure into a queue, and a separate diagnosis worker can pick it up later. The diagnosis tool only needs three strings. It does not need access to the browser, the network, or the test runtime.

5. The Video Channel: A Second Source Of Truth

Sometimes the error message is not enough. The test failed at step 4 but the real cause was a modal that flashed at step 2. For that, Assrt records a webm video of every test run and ships an optional fourth tool, assrt_analyze_video, that hands the recording to Gemini for visual analysis. The tool is gated on a GEMINI_API_KEY env var, so it is opt-in.

if (process.env.GEMINI_API_KEY) {
  server.tool(
    "assrt_analyze_video",
    "Analyze the most recent test recording video using Gemini vision...",
    { prompt: z.string(), videoPath: z.string().optional() },
    ...
  );
}

The model is gemini-3.1-flash-lite-preview. The prompt is free-form: "Did the login form appear?", "Was there a visual error?", "Summarize what happened in step 2". Combine this with assrt_diagnose and the agent has both a textual error trace and a visual trace of the same run. When the two disagree, you have learned something useful about the failure.

The recording also lives on disk, the same way the scenario does. After a run, an HTML player is generated alongside the webm with playback speed controls and pass/fail counts in the header. You can scrub the run yourself if the agents disagree, or share the directory with a teammate.

6. What To Look For In Any QA Automation Test Tool

Whether or not you end up using Assrt, the diagnose-and-rerun loop is the lens worth applying to any QA automation test tool you evaluate. Five questions cut through most marketing pages:

When a test fails, does the tool produce a structured failure object, or only a stack trace? Structured failures can be passed to a diagnosis step. Stack traces require a human.
Can the diagnosis step run independently of the runner? If the only way to get a fix is to rerun the entire suite inside the vendor's UI, you cannot script the loop.
Is the test artifact something a coding agent can edit and write back? A markdown file, yes. A binary blob inside a hosted DB, no.
Is the underlying browser runtime open-source and standard? Playwright, Selenium, and Cypress all qualify. A proprietary runtime is a future migration cost.
What is the per-failure cost of running the diagnose step? If diagnosis costs as much as a full suite, you will skip it and the loop stays open.

Assrt's answer to all five is in the source: structured JSON failures in /tmp/assrt/results/latest.json, a standalone assrt_diagnose tool, markdown scenarios on disk, Playwright as the runtime, and a 4096-token Haiku call as the diagnose cost. Whatever tool you pick, ask the same five questions.

7. FAQ

What exactly does `assrt_diagnose` return?

A JSON object with three fields: diagnosis (free-form root-cause text and a corrected scenario block from Claude Haiku 4.5), url, and scenario (the original failing scenario, echoed back so the calling agent has full context). See src/mcp/server.ts line 773 in the assrt-mcp repo.

Which model powers the diagnosis step?

claude-haiku-4-5-20251001 with a 4096 token cap, called via the Anthropic SDK. Haiku is chosen for cost so the diagnose step can run on every failure, not just escalated ones.

Can I run the diagnose tool without running the test first?

Yes. The signature is just {url, scenario, error}. You can paste a stack trace and a scenario from your CI logs into a separate Claude Code session and call assrt_diagnose directly. The tool has no dependency on prior state.

How does Assrt avoid infinite diagnose-rerun loops?

The loop is driven by the calling agent, not the tool. Both assrt_diagnose and assrt_test are pure tool calls; the agent decides whether to retry. In Claude Code, the conversation length acts as a natural bound. In CI, you wrap the loop in a max-attempts counter.

Does `assrt_analyze_video` work without a Gemini key?

No. The tool is conditionally registered: the file checks process.env.GEMINI_API_KEY at startup, and only adds the tool to the MCP server if the key is present. Without the key, the test still runs and the video is still recorded; only the analysis tool is hidden.

Is the runner code open-source?

Yes. The MCP server is in the assrt-mcp repo and uses @playwright/mcp as the browser runtime. Generated tests are standard Playwright. There is no proprietary glue layer to migrate off later.

How does this compare to enterprise QA SaaS?

Several enterprise QA platforms charge $7,500 per month per seat and store scenarios as opaque blobs you cannot read outside their UI. Assrt is free, self-hostable, and stores scenarios as markdown on your disk. The trade-off is that you bring your own MCP client and your own CI plumbing.

Close the loop on your QA automation tests

Install the Assrt MCP server. Get three tools and one optional fourth. Watch failures turn into corrected scenarios without a human in the middle.

$Free, open-source, self-hostable. No vendor lock-in.

View on GitHub