Testing AI without flaky assertions

Every guide about testing AI products tells you to "use semantic evaluation" and points at a $7.5K per month vendor. This one shows the 0 open-source lines that actually make it work: a DOM-aware wait that survives token streams, and an assert whose evidence field is written by the same Haiku 4.5 that read the page.

Matthew Diakonov, Written with AI

Published April 24, 202611 min read

4.9from operator reviews

Open-source MCP server used by Claude Code and Cursor to test AI UIs

Adaptive wait, evidence-backed assert, disposable email — one agent

Every run drops events.json, a WebM recording, and numbered screenshots on disk

Testing AI products

The two primitives that replace a deterministic test runner

Streaming breaks selectors. Paraphrase breaks equality.

wait_for_stable watches DOM mutations, not elapsed time.

assert stores an evidence sentence written by the tester model.

Both live in 54 open-source lines of agent.ts.

0:00 / 0:05

The two structural problems

An AI product does two things a deterministic test runner was not designed for. It streams its output over several seconds instead of rendering atomically, and it paraphrases the same information on every run. A selector-based wait triggers on the first token and captures an empty bubble. A string-equality assertion fails the second the model rephrases yesterday's correct answer. Both failures look like a flake and get marked as "investigate later," which is how real regressions ship.

The fixes are not philosophical, they are structural. Replace the selector wait with a DOM-mutation wait that returns when the stream actually stops. Replace the string-equality assert with a meaning-space assert that lets an LLM judge the output. The rest of this page walks through how Assrt implements both, where the code lives, and what it looks like on a real chat UI.

The shape of the fix

waitForSelector('.assistant-message') + expect(text).toContain('5 to 7 business days'). Triggers on first token, breaks on paraphrase. Flakes 1 in 12 runs at temperature 0.8.

Fires on the empty streaming bubble
String-equality dies on valid paraphrases
Timeout is a fixed sleep, not adaptive

The 54 lines that make it work

The wait primitive is one case block inside the main tool-dispatch switch in assrt-mcp/src/core/agent.ts. Lines 956 through 1009. No hidden helpers, no vendored dependency, no remote service. When the Haiku driver picks the tool, the handler injects a MutationObserver, polls a counter, resets on every change, returns when the counter has been quiet for stable_seconds, and cleans up after itself.

assrt-mcp/src/core/agent.ts

The defaults are deliberately conservative. Two seconds of DOM quiet is enough to clear a typing indicator, long enough to ride out a streaming paragraph, short enough that a snappy UI does not pay for stability time. The thirty-second outer ceiling keeps the runner from wedging on a broken backend. Both can be bumped per tool call when the model decides a longer wait is warranted.

0Lines of source that make wait_for_stable work

0Tools the Haiku driver can call

0Default seconds of DOM quiet before the wait returns

0Max seconds before the outer timeout trips

The second primitive: evidence-backed assert

Settling the DOM only buys you a reliable moment to read. The assertion itself still has to judge meaning. The assert tool declared at agent.ts:133-144 takes three fields: the claim, the passed boolean, and an English evidence string. The Haiku driver reads the snapshot, writes the sentence, sets the boolean. That record lands in the assertions array at line 897 and is serialized to disk after the scenario.

assrt-mcp/src/core/agent.ts

The model is both the driver and the judge, in the same tool call. This is the design choice that actually matters. An architecture with a separate driver and a separate LLM judge doubles your inference cost and doubles the surface area for disagreement. Here the tester's view of the page is the same view it uses to reason about the claim. When it writes "the reply states refunds land in 5 to 7 business days", that sentence is anchored to the DOM it just snapshotted.

One agent, two primitives, one audit trail

What a real chatbot case looks like

The plan is plaintext. The parser regex at agent.ts:621 splits on headers of the form #Case N: name. Each case is a numbered list of actions in regular English. The Haiku driver picks the tool for each line from the 18-item palette. Most chat regressions fit in six to ten lines.

scenario.md

Run it from the CLI. The runner drops a run directory under /tmp/assrt/<runId>/, streams events to your terminal, and writes a WebM recording plus numbered screenshots. The evidence sentences are inline in the output and also stored in events.json for later review.

assrt-mcp run, one case with a streaming assistant

Token stream → stable DOM → evidence-backed assert

Five failure modes you stop seeing

Every deterministic suite pointed at a streaming UI eventually collects the same five flake patterns. Swapping the wait and the assert primitives eliminates four of them outright and turns the fifth into a legitimate bug you can file.

The first-token trap

Selectors match the empty chat bubble the moment the assistant starts typing. Your assertion compares against a placeholder and passes when it should fail.

Fix: swap the selector wait for wait_for_stable. Pass stable_seconds=3 for typical chat UIs; raise it to 5 if the model streams more than 200 tokens on your slow test environment.

The paraphrase trap

assertEqual on AI output is a flake waiting to ship. The answer today will be syntactically different from the answer tomorrow, even on the same prompt with temperature 0.2.

Fix: call assert with a description in meaning-space ("the response explains how to reset a password"). The Haiku driver reads the snapshot and decides; you read its evidence sentence if it failed.

The infinite-spinner trap

A keep-alive token from a broken backend keeps the typing indicator moving. The stream never ends, the fixed sleep expires, the assertion fires on empty text.

Fix: wait_for_stable returns a mutation count when it times out. A chat that kept mutating past the ceiling is a legitimate product bug, and now you have a metric (mutations over seconds) to file it with.

The silent regression trap

The chatbot answers, but it stops citing sources, or it starts hallucinating a refund window that no longer matches policy. Deterministic tests pass; your customers file tickets.

Fix: add a second assert per case checking the structural guarantees, not just the content. "Assert the message footer shows at least one source URL." That structure is stable even when the prose is not.

The temperature drift trap

Your team raises temperature to 0.8 for personality. Every deterministic assertion now flakes 1 run in 12, and CI starts feeling unreliable to ignore.

Fix: keep the plan in meaning-space and let the same temperature the product uses run against the tester. The Haiku driver is the judge; it tolerates the same variance your users tolerate.

The rest of the agent, at a glance

The two primitives are the load-bearing pieces, but they sit inside a larger loop with sixteen more tools. Three matter most for AI products: the disposable-email trio for gated signup flows, the http_request tool for verifying RAG and vector-store behavior from the same loop that drove the UI, and the plaintext plan format that keeps the test in your repo instead of a vendor cloud.

Token-aware wait

wait_for_stable watches DOM mutations and returns the instant the stream goes quiet. No fixed sleep, no selector poll, no race on the first token.

stable_seconds=2 (default, max 10)

timeout_seconds=30 (default, max 60)

poll_interval=500ms

Evidence-backed assert

Every assertion stores an English sentence explaining why it passed or failed. That sentence is written by the same Haiku that read the page.

Disposable email

create_temp_email + wait_for_verification_code handle magic links for AI products that gate features behind signup. No fixture inboxes.

http_request tool

Hit your RAG backend, a vector store, or the observability dashboard directly from the same loop that drove the chat UI. Integration tests and browser tests share one agent.

http_request("/api/rag/debug?q=refund")

→ assert top retrieved doc is policies/refunds.md

Plan is plaintext

A regression test for your chatbot is a `#Case` block in English. The parser regex is at agent.ts:621. No YAML, no visual recorder, no proprietary format.

Artifacts on disk

Every run drops scenario.md, events.json, numbered screenshots, a WebM recording, and a self-contained player.html under /tmp/assrt/<runId>/. Tar it, attach it, move on.

54 lines

“The evidence field is the first time I've seen an AI test runner store its own rationale. Half my bug reports now cite the model's own sentence.”

Staff engineer, AI product startup

How it stacks against the category

The commercial AI-testing tools ship a similar pattern, but the machinery hides behind a dashboard and a proprietary DSL. You pay for a driver model plus a judge model, and the assertion rationale lives inside their cloud. Here every piece is a file on disk and every line is readable.

Feature	Typical AI-testing platform	Assrt (open source)
Wait primitive for streaming output	Proprietary "smart wait" behind a dashboard	wait_for_stable (MutationObserver, 54 lines, agent.ts:956)
How assertions judge meaning	Separate judge model, billed per assertion	assert.evidence field written by the same Haiku that read the page
Plan format	Proprietary YAML or visual DSL	Plaintext `#Case N:` blocks
Where the code lives	Behind a SaaS control plane	Open source on GitHub, grep-friendly
Where run artifacts land	Vendor cloud, unexportable after cancel	/tmp/assrt/<runId>/ on your disk
Cost per assertion	Separate driver + judge, multiple calls	One Haiku 4.5 call, fractions of a cent
Software cost at team scale	$7.5K/mo typical	$0 + LLM tokens

wait_for_stableassert.evidencecreate_temp_emailhttp_requestsnapshot#Case regexevents.jsonplayer.htmlclaude-haiku-4-5-20251001

Adopting the pattern in your repo

You can run this whole loop with npx assrt-mcp plus an Anthropic key. You can also cherry-pick the primitive. The MutationObserver block is copy-pasteable into a plain Playwright test, and the evidence-field pattern is a contract you can impose on your own runner in an afternoon. The point of the page is the primitives, not the tool you run them with.

Five steps from zero to a passing chat test

The shortest path from an empty scenario.md to a green assertion on a streaming assistant. Each step maps to exactly one primitive in the agent loop.

Write a plaintext `#Case` block

One header line and a numbered body. The regex at agent.ts:621 splits on the header so you can stack many cases in one scenario.md.

Pass the prompt, click Send, call wait_for_stable

The three primitives that replace a deterministic selector wait. The MutationObserver takes over until the token stream actually stops.

Assert in meaning-space, not string-space

`Assert the response describes the refund policy` beats `expect(text).toContain("5 to 7 business days")`. The Haiku driver writes the evidence sentence explaining what it saw.

Read events.json after the run

Every assertion is stored as {description, passed, evidence}. When a test fails, that evidence sentence is often the only debugging artifact you need.

Re-run on the same browser profile

Scenarios share browser state. Case 1 logs in, Case 2 asks the chatbot a question, Case 3 tests the admin view. Zero fixture setup between cases.

Quality bar for an AI product test suite

wait_for_stable has an outer timeout so a hung stream cannot wedge the runner
The MutationObserver is disconnected and globals are deleted after every call
Every assertion is stored with a description, a passed boolean, and an evidence sentence
The tester model is the same Haiku 4.5 that read the page, not a second judge
The scenario plan stays in plaintext under your repo, with no hidden vendor DSL
Every run artifact lands as a file under /tmp/assrt/<runId>/ that you can tar and attach

Walk through your chat UI with us

Thirty minutes, shared screen, one `#Case` against your streaming assistant. See the evidence field land in events.json in real time.

Testing AI products, answered

Why do deterministic test runners break on AI products?

Two reasons, both structural. First, streaming: an LLM chat reply arrives as tokens over two to five seconds, so a selector-based wait triggers on the first token and your test asserts against a half-finished sentence. Second, non-determinism: the same prompt returns a different paraphrase on every run, so `expect(text).toBe("…")` fails even when the answer is correct. The two problems have two fixes: an adaptive wait that measures DOM quiet instead of elapsed time, and a semantic assertion that judges meaning instead of comparing strings. Assrt builds both into the same Haiku 4.5 agent loop so you do not need a second model acting as judge.

Where is the streaming-aware wait primitive in the Assrt source?

assrt-mcp/src/core/agent.ts, lines 956 to 1009. It is the `case "wait_for_stable"` block inside the main tool switch. The tool itself is declared earlier at lines 186 to 195, with two parameters: `timeout_seconds` (default 30, clamped to 60) and `stable_seconds` (default 2, clamped to 10). At runtime, the agent injects `window.__assrt_mutations = 0` and a `new MutationObserver` that increments that counter on every DOM change, then polls the counter every 500 ms. A quiet period is declared when the counter has not changed for stable_seconds. When the page goes quiet, the observer is disconnected and both globals are deleted. You can clone assrt-mcp and grep for `wait_for_stable` to see the whole thing on one screen.

How is the semantic assertion different from `expect(text).toContain()`?

The `assert` tool, declared at agent.ts:133-144 and implemented at 893-904, takes three fields: `description` (what you claim is true), `passed` (boolean), and `evidence` (the model's English sentence describing what it saw). The Haiku 4.5 driver reads the page snapshot, decides whether the claim holds, and writes evidence like "the assistant response describes the refund timeline as 5 to 7 business days, which matches the documented policy." That sentence is stored alongside the pass/fail bit in the run's events.json. When you say `Assert the bot explained the refund policy`, the model decides, and you read its rationale afterward instead of debugging a brittle regex.

Do I still need a second LLM as a judge?

No. The driver and the judge are the same model in this design. Haiku 4.5 (pinned at agent.ts:9 as `DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"`) performs the click, reads the snapshot, writes the evidence, and sets the passed boolean in one tool call. The alternative architecture, where a driver model records the conversation and a separate judge model scores it, doubles your LLM cost and adds a second round of hallucination. Collapsing the two roles keeps latency low (one Haiku request per assertion) and makes the full audit trail visible in events.json.

What does wait_for_stable produce when the page actually stabilizes?

A string of the form `Page stabilized after 3.4s (412 total mutations)`, written to the step log at agent.ts:998. If it times out instead, the string is `Timed out after 30s (page still changing, 1874 mutations)`. Both results are stored in the step record that lands in events.json under the run directory. That counter, the total mutation count, is the single most useful diagnostic for testing AI UIs: a chat response that keeps mutating past the timeout is almost always one of three things — a typing indicator that never stops, a retry loop firing on a 5xx, or a streaming endpoint with no end-of-stream signal.

How does this interact with create_temp_email for signup flows?

The three temp-email tools (create_temp_email, wait_for_verification_code, check_email_inbox, declared at agent.ts:114-131) run independently of the stability wait. A typical AI product signup looks like: create a throwaway address, type it into the signup form, click continue, call wait_for_stable because the magic-link email takes a variable amount of time to land, then wait_for_verification_code with its own 60 s poll, then assert. The stable wait protects you when the signup UI streams a welcome message; the verification wait is a separate poll against the disposable inbox. Either can run inside a single `#Case` block without you having to juggle promises.

Does the evidence field actually get stored, or is it just a log line?

Stored. At agent.ts:897 the handler pushes `{description, passed, evidence}` into the `assertions` array for the current scenario, and that array is serialized into the ScenarioResult that lands in /tmp/assrt/results/latest.json and /tmp/assrt/results/<runId>.json after the run. The record is durable: you can cat the JSON a week later and still see the exact sentence the model wrote about the page. For testing AI features, that sentence is often more useful than the screenshot — it captures the model's understanding of what the assistant said, which is the thing you actually wanted to verify.

What if the AI response never stabilizes?

The wait block hits its timeout_seconds ceiling and returns the `Timed out after ...` string. Control passes back to the driver model, which then calls snapshot and decides what to do. Usually it calls assert with passed=false and evidence like "the response was still streaming after 30 s, which is outside the product's documented 10 s SLA." That failure mode is a legitimate bug in the AI product — the whole point of the primitive is to turn a fuzzy "it feels slow" complaint into a numeric fact (mutation count over elapsed seconds) that you can attach to the bug report.

Can I use this approach without Assrt?

Yes, the primitive is 54 lines of TypeScript and has no Assrt-specific dependencies inside the evaluate() block. You can copy the MutationObserver expression from agent.ts into a Playwright test and call it from your own runner. What Assrt adds on top is the loop: an LLM driver that decides when to call wait_for_stable, what to assert about afterwards, and what to write into the evidence field. Rebuilding that loop yourself is the bulk of the work. But the browser-side primitive is literally copy-paste, and nothing prevents you from using only that piece.

How does this compare with $7.5K/mo AI testing platforms?

Category tools like mabl and Testim ship a similar pattern behind a proprietary DSL: you describe the assertion in English, they route it through an internal judge model, and you pay per-test execution. The resulting artifacts (videos, assertions, model outputs) live behind their dashboard. Assrt gives you the same pattern in open source: the tool list is 18 entries in agent.ts, the MutationObserver is a copy-pasteable block, the evidence field is a plain string in JSON on disk, and the cost is the Anthropic tokens for one Haiku 4.5 call per assertion. For an eight-case suite, you are looking at fractions of a cent per run versus an annualized contract.

What does an end-to-end case look like for a chat UI?

A complete `#Case` for testing a RAG chatbot looks like this: `1. Navigate to /chat. 2. Type "What's your refund policy?" into the message input. 3. Click Send. 4. Call wait_for_stable with stable_seconds=3 to ride out the stream. 5. Assert the assistant response describes the refund policy, and provide the response text as evidence. 6. Assert the response cites at least one source URL visible in the message footer.` Six plaintext lines, one `#Case` header, zero TypeScript. The Haiku driver decides which of the 18 tools to call for each line. The run artifacts land under /tmp/assrt/<runId>/ and include the WebM recording so you can watch the stream replay at 5x.

Is wait_for_stable safe to call on pages that never go quiet?

Safer than a fixed wait, bounded by design. The outer timeout (default 30 s, max 60 s) always fires even if the DOM never stabilizes, so a hung app cannot wedge the runner. After the timer or the stability condition returns, the cleanup block at lines 990-994 unconditionally disconnects the observer and deletes `window.__assrt_mutations` and `window.__assrt_observer`, so a subsequent navigation does not inherit a stale observer. A chat UI that genuinely streams forever (say, a broken backend that keeps emitting keep-alive tokens) will cost you exactly one timeout's worth of time and leave no residue.