A runnable tutorial, not a glossary

AI testing tutorial: one #Case, 18 tools, run it in your terminal

The ten highest-ranking tutorials for this phrase define AI testing, list benefits, and recommend a SaaS. None of them ships a format you can type into a file, a command you can run, and a directory you can list. This one does. The plan is a plaintext `#Case` block. The model is pinned at a specific id. The tool set is exactly 18 named primitives. Every layer has a line number you can grep.

$ npx assrt-mcp --url https://demo.app --plan scenario.md

A
Assrt Engineering
9 min read
4.9from Assrt MCP users
Default model is Claude Haiku 4.5 (agent.ts line 9)
18 Playwright MCP tools listed in agent.ts between lines 18 and 186
Scenario regex at agent.ts line 569 parses any `#Case N:` file
Every run writes video, screenshots, events, result under /tmp/assrt/<runId>/

What a real AI testing tutorial owes you

An AI testing tutorial that stops at “describe what you want to test in plain English” is a sales deck with better CSS. The format has to be named. The model has to be named. The agent's tool surface has to be enumerated. And the output has to land in a place you can open. Everything that follows is that contract in four concrete layers, each with a file you can read.

Layer 1 · Format
`#Case N: name` blocks

Plain English grouped by a header regex. No YAML, no DSL. Lives in scenario.md.

Layer 2 · Model
claude-haiku-4-5-20251001

Pinned at agent.ts line 9 as DEFAULT_ANTHROPIC_MODEL. Override via `--model`.

Layer 3 · Tools
18 Playwright MCP primitives

Listed between lines 18 and 186 of agent.ts. Each tool has a JSON schema the model obeys.

Layer 4 · Artifacts
/tmp/assrt/<runId>/

ASSRT_DIR is fixed at scenario-files.ts line 16. Five files per run, every run.

Step 1. Write one `#Case` block

Save this as scenario.md in your current directory. The header shape is what the regex expects: a hash, the word Case, a number, a colon, then a name. Body lines are whatever a person would tell a careful QA tester.

scenario.md

Intentionally boring. No selectors, no waits, no assertions-as-code. The agent decides which of the 18 tools to call for each body line. If you can dictate this to a junior tester, you can run it as an AI test.

Step 2. Run it and watch the tool calls

One command. The runner spawns @playwright/mcp under the hood, and Claude Haiku 4.5 calls tools in a loop until the case is marked complete. Every tool call is logged. Every log line is plain text, not a JSON blob, so you can read it in your terminal without a reporter UI.

scenario run

The pipeline, drawn once

Three inputs on the left (the plan, the URL, and any variables), one hub in the middle (the planner), and a tool surface on the right that ends in files on your disk. If you remember one diagram from this tutorial, make it this one: the hub is what turns English into Playwright calls, and it is a specific model with a specific id.

Plan and URL → Claude Haiku 4.5 → Playwright MCP tools

scenario.md
target URL
variables
Claude Haiku 4.5
navigate / snapshot
click / type_text
assert
artifacts on disk

Step 3. Open the run directory

The directory name is the run id. Inside it you get a structured events log, a folder of numbered screenshots, a WebM recording, and a self-contained player. The screenshot filenames follow NN_stepN_<action>.png so the filesystem is already a table of contents. The player defaults to 5x playback because nobody wants a real-time video.

inspect run

Want to try it on your own app?

Install assrt-mcp, paste a #Case block, and you are running AI tests in under three minutes. No account, no cloud lock-in.

Install assrt-mcp

The 18 tools the agent picks from

Every tool is a function the planner can call in the middle of a scenario. The full list and schemas live in assrt-mcp/src/core/agent.ts between lines 18 and 186. Grouped by job:

Page primitives

navigate, snapshot, scroll, press_key, wait, screenshot, evaluate. The agent calls snapshot first to get a typed accessibility tree with [ref=eN] ids, then operates on refs instead of CSS selectors.

navigate(url)
snapshot()
scroll(x, y)
screenshot()

Interaction

click, type_text, select_option. Each takes an element description plus an optional ref. Preferred path: pass the ref from the last snapshot; human description is for the log.

Waiting

wait (text or ms), wait_for_stable (DOM settles). Use the stable variant after triggering async work so the agent does not assert against a loading state.

Disposable email

create_temp_email, wait_for_verification_code, check_email_inbox. Covers signup flows without inventing fixture emails or polluting your inbox. Codes arrive at a Mailinator-style address.

create_temp_email() => rC2q9@mailinator.net
wait_for_verification_code(60) => 4821

Verification

assert (description, passed, evidence), complete_scenario (summary, passed), suggest_improvement. Evidence is the English sentence the model wrote about the page state; that is what you read when a run fails.

External and timing

http_request for verifying webhooks and third-party APIs (Telegram, Slack, GitHub), wait_for_stable for adaptive timing. Two tools that turn a browser agent into an integration tester.

A flow that uses 5 of these is typical. A login tutorial uses 4. A signup flow with OTP uses 7. If you ever feel you need a 19th tool, usually `evaluate` already does what you want — it runs arbitrary JavaScript in the page and returns the result.

0
Playwright MCP tools the agent can call
0
Regex that parses the plan (agent.ts line 569)
0
Artifacts per run (plan, video, shots, events, result)
0
Lines of proprietary DSL the plan depends on
0Named tools
0Regex line
0Artifacts per run
0Proprietary DSL

Step 4. Write a second, harder `#Case`

The first case proves the pipeline. The second case proves your head is in the right shape. Each of these sub-steps uses one more tool from the catalog. None of them requires writing code.

1

Add a second #Case block below the first

Each case is parsed independently by the regex at agent.ts line 569. They share browser state (cookies, localStorage) so case 2 can assume case 1's login, but never shares a turn — failing case 1 does not abort case 2.

2

Lean on create_temp_email for the signup path

For any `#Case` that would otherwise need a fixture email, write the plan as `1. Call create_temp_email to get a throwaway address 2. Type that email into the Email field ...`. The agent wires the temporary inbox to the verification step automatically.

3

Use http_request for webhooks and third-party integrations

After connecting, for example, a Telegram bot in the UI, add a step like `Poll https://api.telegram.org/bot<token>/getUpdates and assert that a message with text 'connected' arrived`. The agent runs that with the http_request tool without needing a stub server.

4

Parameterize with variables instead of hardcoding

Pass a variables map to assrt_test and reference it inline: `Type {{EMAIL}} into the Email field`. The same plan file now runs against localhost, staging, and prod without edits. Variables resolve before the regex splits the file.

5

Run it, then read /tmp/assrt/<runId>/

That directory is the whole debugger: video, numbered screenshots, events.json timeline, and the pass/fail JSON. Tar it, attach to the bug, move on. There is no separate UI to learn.

Why this tutorial runs on your laptop

Every other AI testing tutorial at this keyword quietly assumes a SaaS subscription. The plan format is their DSL, the runner is their cloud, and the artifacts live behind their dashboard. The `#Case` + local runner shape is a different trade: you own the plan file, you bring your own LLM key, and the whole pipeline is three repos you can fork.

FeatureTypical SaaS AI testerAssrt (#Case)
Plan formatProprietary YAML or visual DSLPlaintext `#Case N:` blocks
Model the plan runs throughUnnamed vendor modelClaude Haiku 4.5 (pinned at agent.ts line 9)
Agent tool surfaceOpaque action catalog18 named tools (agent.ts lines 18-186)
Plan portabilityUnexportable after cancelscenario.md stays in your repo
Run artifactsBehind a rented dashboardVideo, screenshots, events.json on disk
Cost to start the tutorialSales call + subscription`npx assrt-mcp` + an LLM key
Software cost at team scale$7.5K/mo typical$0 plus LLM tokens

You have finished the tutorial when

If you can tick every row, you understand the entire stack — plan format, planner model, tool surface, artifact layout — at the level of a line number in the source. That is usually the difference between someone who has “used AI testing” and someone who can debug it.

Self-check

  • I can find the `#Case` regex at agent.ts line 569 and explain what it matches
  • I know the default model and its id: claude-haiku-4-5-20251001 (agent.ts line 9)
  • I can name 6 of the 18 tools without looking them up
  • My run directory under /tmp/assrt/<runId>/ contains video, screenshots, events.json, results
  • My `#Case` passes on rerun with the same plan, on a cold browser
  • I have wired an assertion that would fail if the app silently breaks
  • I can tar the run directory and attach it to a bug report without screenshots from my desktop

Frequently asked questions

What is an AI testing tutorial supposed to teach you?

A real AI testing tutorial should walk you from zero to a runnable test in one session: show the format the plan is written in, the model that interprets it, the tools the model can call, and the artifacts the run leaves behind. The tutorials that rank on Google for this phrase today skip all four. They define AI testing, list benefits, and stop. This guide uses Assrt as the concrete example because every layer of its pipeline is a file or a line of code you can read: the plan is a plaintext `#Case` block, the model is pinned at agent.ts line 9 to claude-haiku-4-5-20251001, the tools are enumerated from agent.ts lines 18 to 186, and the artifacts are written to /tmp/assrt/<runId>/ every run.

What is a #Case and how does it differ from a Playwright spec?

A `#Case` is a named block of plain English inside a scenario file. The header takes the shape `#Case N: name` (the regex at agent.ts line 569 is `/(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi`), the body is a numbered list of actions in regular prose. A Playwright spec is TypeScript: `await page.goto(...); await page.getByRole('button').click();`. Both drive the same browser engine (Assrt spawns @playwright/mcp under the hood), but a `#Case` is written by a person describing intent, while a spec is written by a programmer describing mechanics. When you run the plan, an LLM reads the `#Case` text and decides which Playwright MCP tool to call next.

Which model reads the plan, and why does that matter for a tutorial?

The default is Claude Haiku 4.5 (model id `claude-haiku-4-5-20251001`, pinned at assrt-mcp/src/core/agent.ts line 9 as DEFAULT_ANTHROPIC_MODEL). It matters because most AI testing tutorials treat the model as a black box; knowing which one you are running lets you reason about cost and latency. Haiku 4.5 is the smallest current Claude model, and one `#Case` typically costs fractions of a cent per run. You can override with `--model` to a larger model like Sonnet 4.6 if your flow is complex, or Gemini 3.1 Pro via the same flag. The override mechanism is a thin switch in agent.ts; nothing about the plan file has to change.

How many browser tools can the agent actually call?

Exactly 18. Listed top to bottom in assrt-mcp/src/core/agent.ts between lines 18 and 186: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. Nine of those are standard Playwright MCP primitives (navigate through evaluate). Three handle disposable email (create_temp_email, wait_for_verification_code, check_email_inbox). Three are test-specific (assert, complete_scenario, suggest_improvement). Two deal with integrations and timing (http_request, wait_for_stable). For most tutorials you will use about half of them; the rest are there so signup flows and third-party integrations do not need a custom runner.

What does the agent write to disk after a run, and where?

Everything lives under /tmp/assrt/, pinned in assrt-mcp/src/core/scenario-files.ts at ASSRT_DIR on line 16. After a run you will find: scenario.md (the plan you just ran), scenario.json (metadata: id, name, url), results/latest.json (the most recent pass/fail), results/<runId>.json (historical runs), and a per-run directory with screenshots/, a WebM recording, and a self-contained player.html that opens in any browser without network access. The paths are deliberately boring filesystem paths so you can tar the whole directory into a CI artifact or attach it to a bug report.

Do I need to own an LLM key to follow this AI testing tutorial?

Yes, one. Assrt is BYO-LLM: you provide either an Anthropic or a Google Gemini key, stored in macOS Keychain via the built-in `keychain` helper in assrt-mcp/src/core/keychain.ts, and the agent makes the inference calls from your machine. There is no vendor cloud. This is the difference between this tutorial and a Testim or mabl tutorial: those platforms charge roughly $7.5K per month at team scale because you rent their cloud execution and their test format. In the tutorial below, every step runs on your laptop, every file stays on your disk, and the `#Case` format has zero lock-in.

How do I know the #Case parser will handle my file the same way?

Read the regex. At agent.ts line 569 the scenario parser is exactly `/(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi`. It splits the plan text on any of `Scenario`, `Test`, or `Case` headers, optionally preceded by `#`, optionally followed by a number, followed by `:` or `.`. Everything between headers becomes the body of one scenario. The regex is deterministic; the same input produces the same scenario list on every machine. If your plan has multiple cases and the runner only picks up one, look at the header formatting first — a missing colon or a stray word in the header line is almost always the cause.

What does a failing #Case look like, and how do I debug it?

Open the run directory at /tmp/assrt/<runId>/. Start with the video: player.html opens the WebM at 5x by default (the keyboard binding is defined inline in assrt-mcp/src/mcp/server.ts around lines 96-107 — press 1, 2, 3, 5, or 0 to rebind speed). In 30 seconds of watching, the wrong click is usually visible. Then open screenshots/; filenames are `NN_stepN_<action>.png` so `ls` is already a table of contents. Finally, read events.json for the structured timeline: every assertion row has {description, passed, evidence}, where evidence is the English sentence Haiku 4.5 wrote about what it saw on the page. That narrative is the hard part to get from a raw Playwright trace.

Is this tutorial tied to Assrt, or does it generalize to other AI testing tools?

The general shape generalizes: plaintext plan, LLM planner, browser tool set, on-disk artifacts. The named file paths, the `#Case` regex, and the tool list are Assrt specifics. If you are evaluating a different tool, apply the same lens — ask it to show you where the regex lives, which model is pinned, how many tool primitives the agent can call, and what files land on your disk after a run. A tool that cannot answer those four questions in under five minutes is renting you a dashboard, not teaching you AI testing.

How long does the whole tutorial take to run end to end?

Writing the `#Case` block: under two minutes, because it is English. First run: about a minute while the agent takes snapshots and picks selectors. Reading the artifact tree: about 30 seconds. Total: under three minutes from an empty file to a green `PASS` plus a video, a sequence of screenshots, and a JSON result file on your disk. The point of the `#Case` format is that writing the plan is the only step that requires thought; the rest is execution and file I/O.

Can I run the same #Case headless in CI?

Yes. Pass `--headed=false` (the default) and the runner launches Chromium through @playwright/mcp in headless mode. In GitHub Actions, add a step that calls the CLI with the scenario file, then upload /tmp/assrt/<runId>/ as a workflow artifact. Every artifact — plan, video, screenshots, events, results — is a portable file, so the post-run analysis is the same whether you ran locally or in CI. No dashboard, no cloud account, no extra vendor tokens.

What should I test first if this is my first AI test ever?

The narrowest flow that returns a visible success state. For a SaaS dashboard, that is usually: open the login page, type a known email and password, click sign in, assert the dashboard heading appears. Three `#Case` lines. If that passes on the first try, expand to the next flow (signup with a disposable email via `create_temp_email`, then profile update, then a destructive action like cancel subscription). The mistake that costs new teams a day is starting with a flow that has six steps and three edge cases; a single-case plan that passes teaches you more about the pipeline than a plan that never finishes.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.