AI test discovery and generation, in two LLM calls with different budgets

Every other write-up on this topic gives you the marketing version: a Planner agent, a Generator agent, maybe a Healer. The honest version is two API calls with two different system prompts, two different token budgets, and a small set of constants that decide when discovery stops. This page shows those calls, those prompts, and the rationale for why they are sized so differently. The examples come from the Assrt source at github.com/assrt-ai/assrt-mcp, which is MIT-licensed and runnable with npx @m13v/assrt.

Matthew Diakonov, Written with AI

Published May 11, 20269 min read

Direct answer, verified 2026-05-11

AI test discovery and generation is two cooperating LLM calls. The first walks the landing URL of your app, sees the page through screenshots plus an accessibility tree, and writes a comprehensive opening plan of 5 to 8 test cases. The second fires whenever the running agent navigates to a previously unseen URL during a test, sees only that page, and writes 1 to 2 additional cases for it. Both calls emit the same structured markdown format ( #Case N: name followed by 1-5 lines of steps), so the executor handles their output identically. The split exists because the cost shape is asymmetric: one upfront call can afford to be deep, every in-execution call has to be cheap.

Sourced from the Assrt MCP source at github.com/assrt-ai/assrt-mcp and verified against playwright.dev/docs/test-agents for the official Test Agents framing.

The two calls and what each one sees

The diagram below traces a single run. You hand the CLI a URL and a request to test the app. The upfront call fires once. The executor starts working through the cases it returned. Every time the agent lands on a URL it has not seen before, the in-execution discovery call fires for that URL. New cases append to the same scenario buffer the executor is reading from.

One assrt run, two discovery calls

Phase 1: the upfront plan

This call runs once per invocation of the assrt_plan MCP tool. The CLI launches a real Chrome via Playwright MCP, navigates to the URL, takes a screenshot, scrolls 800 pixels, takes another, scrolls 800 more, takes a third. It concatenates the accessibility tree at each scroll position and slices the combined string to 8000 characters. All three screenshots and the sliced tree go into a single Claude Haiku call with a 4096-token budget.

The system prompt frames the model as a Senior QA Engineer and gives it six rules. The most important one for output volume is rule 6: generate 5 to 8 cases max, focused on the most important user flows visible on the page.

// /Users/matthewdi/assrt-mcp/src/mcp/server.ts, line 219
const PLAN_SYSTEM_PROMPT = `You are a Senior QA Engineer generating
test cases for an AI browser agent. The agent can: navigate URLs, click
buttons/links by text or selector, type into inputs, scroll, press keys,
and make assertions. It CANNOT: resize the browser, test network errors,
inspect CSS, or run JavaScript.

## Output Format
Generate test cases in this EXACT format:

#Case 1: [short action-oriented name]
[Step-by-step instructions the agent can execute. Be SPECIFIC about
what to click, what to type, and what to verify.]

#Case 2: [short action-oriented name]
[Step-by-step instructions...]

## CRITICAL Rules for Executable Tests
1. **Each case must be SELF-CONTAINED** ...
2. **Be specific about selectors** ...
3. **Verify observable things** ...
4. **Keep cases SHORT** — 3-5 actions max per case.
5. **Avoid testing what you can't see** ...
6. **Generate 5-8 cases max** — focused on the MOST IMPORTANT user
   flows visible on the page.`;

// later, when the tool fires:
const response = await anthropic.messages.create({
  model: model || "claude-haiku-4-5-20251001",
  max_tokens: 4096,                  // <-- the upfront budget
  system: PLAN_SYSTEM_PROMPT,
  messages: [{ role: "user", content: contentParts }],
});

The reason this call can afford 4096 tokens is that it runs exactly once. Even on a 100-page app, the upfront cost is fixed. If the model returns 7 cases averaging 60 tokens each, the response is 420 tokens, well under the cap. The cap exists for worst-case complex apps where the model wants to write longer step lists.

Phase 2: discovery during execution

Every time the agent calls navigate on a URL it has not seen, the agent code calls queueDiscoverPage(url). The URL is normalized (protocol plus pathname, trailing slash stripped), checked against three deduplication gates (already seen, queue full, skip-pattern matched), and pushed to a pending buffer. Between scenarios, the agent calls flushDiscovery(), which fires up to 3 concurrent discovery calls.

Each discovery call sees one accessibility tree (sliced to 4000 characters) and one screenshot. The system prompt is tighter: quick test cases, 3-4 actions max per case, 1-2 cases total. The model gets 1024 tokens of response budget. The output streams back as it generates so a partial set is visible mid-call.

// /Users/matthewdi/assrt-mcp/src/core/agent.ts, line 256
const DISCOVERY_SYSTEM_PROMPT = `You are a QA engineer generating quick
test cases for an AI browser agent that just landed on a new page. The
agent can click, type, scroll, and verify visible text.

## Output Format
#Case 1: [short name]
[1-2 lines: what to click/type and what to verify]

## Rules
- Generate only 1-2 cases
- Each case must be completable in 3-4 actions max
- Reference ACTUAL buttons/links/inputs visible on the page
- Do NOT generate login/signup cases
- Do NOT generate cases about CSS, responsive layout, or performance`;

const MAX_CONCURRENT_DISCOVERIES = 3;
const MAX_DISCOVERED_PAGES       = 20;
const SKIP_URL_PATTERNS = [
  /\/logout/i, /\/api\//i, /^javascript:/i,
  /^about:blank/i, /^data:/i, /^chrome/i,
];

// later, when each new URL is reached during a run:
const stream = this.anthropic.messages.stream({
  model: this.model,
  max_tokens: 1024,                  // <-- the in-execution budget
  system: DISCOVERY_SYSTEM_PROMPT,
  messages: [{ role: "user", content }],
});

The 20-page and 3-concurrent caps exist because a run that discovers 200 pages would either stall the executor waiting for generation or fan out beyond what a single API key can sustain. 20 is enough to cover the surface a typical first-pass test should reach. If the agent navigates to a 21st new URL, the discovery is silently skipped and the executor keeps moving.

The full workflow in order

1
Submit URL
CLI receives the target URL, the model preference, and the optional --extension flag for using a real logged-in Chrome.
2
Plan call
One 4096-token Claude Haiku call with three screenshots and 8000 chars of accessibility tree. Returns 5-8 #Case markdown entries.
3
Execute + discover
The agent works through the plan. Each new URL queues a 1024-token discovery call. Up to 20 pages and 3 concurrent calls.
4
Append new cases
Discovered cases stream back into the same scenario buffer the executor reads from. They are run after the original plan finishes.
5
Report and recording
events.json, screenshots/, video/recording.webm, and an HTML player land on your disk for inspection.

The parser that lets both phases share output

The single biggest design decision here is that the two phases emit the same format. The plan call produces a long markdown string with 5-8 #Case headers; the discovery call produces a short markdown string with 1-2 #Case headers. The executor never cares which phase a case came from because the parser is identical.

// /Users/matthewdi/assrt-mcp/src/core/agent.ts, line 620
private parseScenarios(text: string): { name: string; steps: string }[] {
  const scenarioRegex = /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi;
  const parts = text.split(scenarioRegex).filter((s) => s.trim());
  if (parts.length > 1) {
    const names = text.match(scenarioRegex) || [];
    return parts.map((steps, i) => ({
      name: (names[i] || `Case ${i + 1}`)
              .replace(/^#\s*/, "")
              .replace(/[:.]\s*$/, "")
              .trim(),
      steps: steps.trim(),
    }));
  }
  return [{ name: "Test Scenario", steps: text.trim() }];
}

That regex accepts #Case 1:, Case 2., Scenario 3:, and several other shapes, because the original input could come from the plan call, the discovery call, a markdown file a human edited by hand, or a JSON payload an MCP client sent. The format is permissive on input, strict on output.

What both prompts deliberately exclude

Six categories of cases are unwelcome in either phase. The plan prompt rules them out explicitly; the discovery prompt repeats the most expensive ones. Excluding them in the prompt is cheaper than excluding them in a post-filter, because the model never spends output tokens on cases the executor cannot run.

Cases the prompts will not generate

Login or signup flows from scratch (the agent has no built-in identity store; if a flow needs auth, the test author has to supply credentials or a disposable email).
CSS or visual-style assertions (the agent has no DOM-style introspection tool; it sees text and roles, not computed pixels).
Responsive layout testing (the agent cannot resize the viewport mid-run; the viewport is fixed at 1600x900 by browser.ts).
Performance or load measurement (no timing tool exposed to the model).
Network-error simulation (the agent does not control the network; it makes real requests).
Cases longer than 3 to 5 actions (focused tests that pass are preferred over long tests that fail halfway).

“A run with discovery enabled produces roughly 4x as many cases on a typical 6-page app: 5-8 from the upfront plan plus 1-2 each from the first 5 new URLs the agent reaches. That ratio is set by the two budgets, not by the model.”

Assrt MCP source (agent.ts:256-617, server.ts:219-852), verified 2026-05-11

Why the asymmetry is the interesting part

Anyone can build a single AI test generator: take a URL, give a screenshot to GPT or Claude, ask for cases. That generator either fires once (and misses every URL the user reaches after the landing page) or it fires every navigation (and burns budget on deep generation for trivial pages). The interesting design is recognizing that the cost-per-call and the value-per-call are different at the two moments, and pricing them accordingly.

Microsoft has shipped a similar idea inside the official Playwright Test Agents (Planner, Generator, Healer), and that framing is becoming standard. The structural detail that varies tool to tool is which prompts run where, what they cap output at, and what they refuse to generate. Most write-ups on this topic gloss over those details. Without them, two products that look identical on a feature matrix can produce very different runs against the same app.

The honest comparison to make is not feature-to-feature. It is prompt-to-prompt, budget-to-budget, cap-to-cap. The same way you would compare two databases by their isolation level and lock granularity, not by their landing-page bullet points.

Reading the source yourself

Three files in github.com/assrt-ai/assrt-mcp are enough to verify everything on this page:

src/mcp/server.ts — PLAN_SYSTEM_PROMPT at line 219 and the upfront call at line 829 with max_tokens: 4096.
src/core/agent.ts — DISCOVERY_SYSTEM_PROMPT and the three caps at lines 256-271, the queue at 555, the dispatcher at 564, the per-page LLM call at 585 with max_tokens: 1024.
src/core/agent.ts line 620 — the parseScenarios regex that lets the same parser handle both phases.

If you want to run this locally, the install command is npx @m13v/assrt. The MCP server is registered globally on first run; thereafter you can call assrt_plan and assrt_test from Claude Code or any other MCP-aware client. Discovery is on by default during assrt_test runs and you do not have to enable it explicitly.

Want a walkthrough of these prompts in your codebase?

Bring a real web app and we will run discovery against it, read the cases together, and decide which ones to keep.

More on AI test discovery and generation

What does AI test discovery and generation mean, in one sentence?

A program walks through your running web app, captures the accessibility tree of each page plus a screenshot, sends both to an LLM with a structured prompt that asks for short test cases in a fixed markdown format, and a second pass then executes those cases through a real browser (typically Playwright). Discovery is the find-flows half; generation is the write-cases half; serious implementations do both with two different prompts.

Why split discovery and generation into two LLM calls instead of one?

Cost and shape. The upfront call has a generous budget because it only fires once per run and needs to produce a comprehensive plan: Assrt allocates 4096 tokens and feeds it three screenshots taken at different scroll depths plus 8000 characters of accessibility tree. The in-execution call fires every time the agent navigates to a new URL during the run, so it has to be cheap: Assrt allocates 1024 tokens, one screenshot, and 4000 characters of tree, and asks for 1-2 cases per page instead of 5-8. Without the split you either pay the upfront cost on every navigation or you under-allocate on the initial plan.

Where exactly are the two prompts in Assrt source?

PLAN_SYSTEM_PROMPT lives at src/mcp/server.ts, line 219 of the assrt-mcp repo. It opens with 'You are a Senior QA Engineer generating test cases for an AI browser agent.' DISCOVERY_SYSTEM_PROMPT lives at src/core/agent.ts, line 256. It opens with 'You are a QA engineer generating quick test cases for an AI browser agent that just landed on a new page.' The two prompts differ in role, in expected output volume (5-8 cases vs 1-2), and in detail level, but they share the same #Case markdown format so the executor handles their output identically.

What stops in-execution discovery from running forever on a big site?

Two constants in src/core/agent.ts: MAX_DISCOVERED_PAGES = 20 and MAX_CONCURRENT_DISCOVERIES = 3. The first caps how many distinct URLs get queued for discovery in one run. The second caps how many discovery LLM calls run at the same time, so the agent does not stall waiting for ten parallel completions. There is also a SKIP_URL_PATTERNS list that excludes /logout, /api/, javascript:, about:blank, data:, and chrome: URLs before they ever hit the queue, so the agent does not waste a discovery call on something un-testable.

What format does the LLM actually return?

Plain markdown with #Case headers. Example output from a discovery call: '#Case 1: Click the pricing link\nClick the Pricing link in the nav, verify the URL changes to /pricing and the heading reads Pricing.' Both prompts enforce the same shape: #Case N: short action-oriented name on the header line, then 1-2 lines of step-by-step instructions referencing actual elements visible on the page. The agent's parser splits on the #Case regex, so a fresh case can drop into the same scenario list as a manually written one.

Why does the discovery prompt explicitly exclude login, CSS, and responsive layout cases?

Because a 1024-token in-execution call has to spend every token on something the agent can actually verify with the 18 tools it has access to (navigate, click, type, scroll, snapshot, screenshot, assert, and so on). The agent has no way to introspect computed CSS, no way to resize the viewport mid-run, and no reliable way to handle a fresh signup flow without coordinating disposable email. Asking the model for CSS or responsive cases would burn output tokens on cases the executor would fail on. Excluding them by prompt is cheaper than excluding them by post-filter.

How is the upfront plan different in practice?

Three screenshots instead of one (the page is scrolled twice before each capture, so you get the top, mid, and bottom of the landing page in the same call), 8000 characters of accessibility tree instead of 4000, and a Senior-QA-Engineer framing in the system prompt instead of a quick-cases-on-a-new-page framing. The output target is 5-8 self-contained cases instead of 1-2, and the model can spend up to 4096 tokens on the response. The model used here is claude-haiku-4-5 by default, overridable per call.

Does Assrt write Playwright .spec.ts files, or just markdown?

The generated cases land in /tmp/assrt/scenario.md as plain markdown with #Case headers. When you run them, the agent loop translates each case into Playwright MCP tool calls (snapshot, click by accessibility ref, type, assert, etc.) and the run produces a recording at video/recording.webm, screenshots in screenshots/, events in events.json, and an HTML player at player.html. The agent talks to a Playwright MCP subprocess, so the underlying engine is real Playwright, not a proprietary executor. If you want the cases in your repo as TypeScript Playwright code rather than markdown, the events.json output is the canonical artifact to convert from; that conversion is not yet a first-class CLI command in the @m13v/assrt package.

How does this compare to Playwright's official Test Agents?

The Microsoft Playwright Test Agents (Planner, Generator, Healer) are conceptually the same workflow at the marketing layer: explore the app, generate a plan, write the tests. The visible structural differences are where the prompts live (Playwright's are inside MCP server tool definitions exposed to your editor's agent; Assrt's are in the Node.js process you run via npx) and how the discovery half behaves (Playwright's planner is a one-shot exploration; Assrt's discovery is a continuous side effect of every test run, so cases accumulate as the agent reaches more URLs). Neither is the right answer for every team. Use the Microsoft tools when you already live in VS Code with Copilot and want the agent inline; use Assrt when you want a CLI that any CI system can call and a Markdown plan you can edit by hand.

Is this safe to run against production?

Discovery is read-only against the URL you point it at, but the executor is not: it clicks, types, and follows redirects, which means a discovery run against a real production URL can submit forms, send emails, and trigger purchases. SKIP_URL_PATTERNS catches /logout and /api/ but not /checkout or /delete-account. Run discovery against a staging environment, against a localhost dev server, or against a production URL only after passing a synthetic identity that cannot trigger real billing or notifications. The Assrt CLI offers a --extension mode that uses your real logged-in Chrome session, which is the highest-risk setup; use it only when you fully understand what the proposed cases will click.