Testing for AI writing

AI writing features break normal end-to-end tests in three specific ways. The response time is variable. The output is never byte-equal across runs. The DOM mutates the entire time tokens are streaming in. A spinner-then-static-text mental model does not survive contact with a real chat reply, an inline AI rewrite, or a generated subject line. This guide walks through the three primitives Assrt uses to handle each one, with the exact source lines so you can read them yourself.

Matthew Diakonov, Written with AI

Published May 1, 20269 min read

Direct answer (verified 2026-05-01)

How do you E2E test an AI writing feature?

Three primitives. (1) A MutationObserver-based wait_for_stable that adapts to actual streaming time, defined in /Users/matthewdi/assrt-mcp/src/core/agent.ts at line 186. (2) Free-text passCriteria instead of string equality, taken by assrt_test at server.ts line 343 and asserted on by the agent at agent.ts line 133. (3) Accessibility-tree refs instead of CSS selectors, served by @playwright/mcp, so the ref to a streaming region survives every token tick. Drive all three from any MCP client (Claude Code, Cursor, plain CLI).

Verified by reading the linked file paths in the open-source repo at github.com/assrt-ai/assrt-mcp.

Why this is different from any other E2E test you have written

A normal end-to-end test has a clear shape. Click a button. Wait for a known thing to appear. Assert the page now equals a known value. Order placed. Email sent. Cart total $42.99. Every step can be expressed as a transition between two static states, and every assertion can be expressed as exact equality.

An AI writing feature has none of those properties. Click Generate and the response begins arriving as a Server-Sent Events stream. The DOM updates token by token: an empty paragraph, then one word, then a sentence fragment, then a full sentence, then more. The total time depends on prompt length, model load, and network jitter. The text itself is the output of a language model that almost never produces the exact same string twice. And the whole thing happens inside a region whose descendants are being rewritten while your test is trying to find them.

Feature	Normal feature	AI writing feature
Response time	Response is constant; a 200ms wait is fine	Response time is variable from 1s to 30s+; needs adaptive wait
Output equality	expect(text).toBe('Order placed')	Output is never byte-equal across runs; need semantic criteria
DOM during the response	DOM transitions once, from loading to final	DOM mutates token by token; selectors flicker mid-stream
Loading affordances	A spinner appears, then disappears	A Stop button is hot during stream, cold after; testable as a state machine
Failure modes	Network 500, validation error, redirect	All of the above plus: rate-limit, refused content, prompt-injection echo, model timeout, partial stream
Selector stability	CSS selectors survive most refactors	CSS selectors break on every frame of streaming; need accessibility-tree refs
Reasonable assertion	Exact string equality	Quote the rendered text and judge against a free-text criterion
Flake budget	Network jitter, rare race conditions	All of the above plus: stochastic model output, streaming chunk size, refused-content guardrails

The test most teams write first

Most engineers reach for the same template. A 5-second timeout. A string-equality assertion against a sample response. It looks reasonable in review and ships green on the first run. Then the model gets faster on Tuesday and the timeout becomes wasteful; the model gets slower on Friday and the timeout races the stream; a routine prompt-template tweak changes the wording and every equality assertion in the suite goes red. The test is technically correct and operationally useless. Toggle below to see the same feature expressed two different ways.

Same feature, two test shapes

A fixed waitForTimeout and a toHaveText against a recorded sample. Race condition on slow runs, stale equality on fast prompt edits, no leverage on the streaming UI primitives that should be testable.

page.waitForTimeout(5000) - races the stream
toHaveText('Cold-water swimming is...') - never byte-equal twice
Selector .response-text resolves mid-stream and goes stale
Loading and Stop button states are not asserted at all

From flaky to stable, line for line

// Hand-written Playwright test against an AI writing feature.
// Looks reasonable. Flakes on roughly half of CI runs.

import { test, expect } from "@playwright/test";

test("generate draft", async ({ page }) => {
  await page.goto("http://localhost:3000/editor");
  await page.getByRole("button", { name: "Generate" }).click();
  await page.fill("textarea[name=prompt]",
    "write a short blog intro about cold-water swimming");
  await page.keyboard.press("Enter");

  // Pick a number, any number.
  await page.waitForTimeout(5000);

  // Race condition: the stream may still be writing when this runs.
  await expect(
    page.locator(".response-text"),
  ).toHaveText(
    "Cold-water swimming is the act of immersing yourself...",
  );
});

25% fewer lines and zero brittle selectors

Primitive 1: a stability wait that adapts to streaming time

The first primitive sits at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 186. The tool is called wait_for_stable. Its description, served to the LLM driving the test, names the case it was built for: chat AI responses, loading states, and other async DOM activity. The system prompt at line 250 is even more explicit: “When the page has loading states, streaming AI responses, or async content, use wait_for_stableto wait until the DOM stops changing. This is better than wait with a fixed time because it adapts to actual load speed.”

The implementation lives at line 956. It injects a MutationObserver onto the page, observing document.body with childList: true, subtree: true, characterData: true. It increments a counter on every mutation, polls that counter every 500ms, and exits as soon as the count has been stable for the requested window. The result string returned to the agent looks like Page stabilized after 12.4s (847 total mutations) on success or Timed out after 30s (page still changing, 1924 mutations) when the model is genuinely stuck.

What wait_for_stable actually observes

childList: any element added to or removed from the response region
subtree: yes, observe nested children too (token spans nest deep)
characterData: yes, watch text-node updates inside spans
Poll cadence: 500ms, reading window.__assrt_mutations
Default budget: 30s timeout, 2s of zero mutations to call it stable
Hard caps: 60s timeout, 10s stability (agent.ts lines 957 and 958)
Cleanup: observer.disconnect() and delete the window globals on exit
Result: returns 'Page stabilized after 12.4s (847 total mutations)'

500ms

“Poll cadence on window.__assrt_mutations. Lower than the typical token chunk interval, so we never miss a stream that is still writing.”

agent.ts line 979 inside the wait_for_stable case block

The practical effect is that one line in your scenario plan, wait for the page to stop changing, replaces the entire conversation about how long is long enough. A 1.2-second response returns control after about 3 seconds (1.2s of stream plus 2s of stability margin). A 22-second response returns control after about 24 seconds. Same plan text, same assertion afterwards, no knob to tune per feature.

Primitive 2: free-text criteria, not string equality

The second primitive is the assert tool, defined at agent.ts line 133. It takes three fields: description (what you are asserting, in English), passed (a boolean), and evidence (a quote or observation from the page that justifies the boolean). At the test-runner layer, the same idea is exposed through passCriteria, a free-text field on assrt_test at server.ts line 343. You write the criterion the way you would describe it to a teammate, the agent reads the page, and the test fails if any criterion is not met.

For an AI writing feature this is the difference between a test that is meaningful and a test that ships. Equality says “the response must be exactly this paragraph.” That assertion is false on the very next run because language models are stochastic. Free-text criteria say “the response is non-empty, mentions the topic of the prompt, contains at least three sentence-ending punctuation marks, does not contain Lorem ipsum or visible [object Object], and ends in a way that suggests it finished on purpose rather than mid-word.” That set of criteria is satisfied by an entire family of valid generations and violated by every realistic regression: a server returning a stub, a template token leaking through, a stream cut off halfway, a guardrail returning a refusal in the middle of an editor.

Primitive 3: accessibility-tree refs that survive token churn

The third primitive is not unique to Assrt; it is the accessibility-tree mode that @playwright/mcp uses by default. But it is load-bearing here. A snapshot of the page returns elements as roles plus accessible names plus stable refs like ref=e12. A button is identified as [Button] “Generate” ref=e7, a streaming region as [region] “Response” ref=e22.

During a stream, the tokens land as text-node updates inside that region. The region itself does not change role or name; only its descendants churn. A CSS selector tied to a class name on a generated paragraph element is stale by the next frame. The accessibility ref to the region as a whole stays valid for the full lifetime of the response. After wait_for_stable returns, the agent calls snapshot one more time and reads the final text out of the same ref it had at the start. That sequence, get ref, trigger, wait for stability, re-read same ref, is the canonical shape of every AI-writing test.

What you actually write

A scenario file. The agent does the rest. There is no Playwright config to learn, no selector helpers to install, no special streaming-test library to import. The plan below is what landed in /tmp/assrt/scenario.md when I tested an AI rewrite button on a Markdown editor; it has stayed green across three model upgrades and one full-page redesign because none of the steps depend on the appearance or the wording of the output.

#Case 1: AI rewrite produces an on-topic, well-formed result
1. Open /editor and click "Sign in" if the auth gate appears.
2. Click "New document" and paste this seed text into the body:
   "the meeting was good and we talked about things"
3. Select the seed text and click the "Rewrite with AI" button.
4. wait_for_stable (timeout 30s, stable 2s).
5. Assert: the body contains the word "meeting", has at least
   two sentence-ending punctuation marks, is between 40 and 400
   characters, and does not contain "Lorem ipsum" or
   "[object Object]". Quote the actual rendered text as evidence.
6. Assert: the "Stop" button is no longer highlighted and the
   "Accept rewrite" button is enabled.

#Case 2: AI rewrite refused content surfaces as a visible error
1. Open /editor on the same session.
2. Paste a clearly disallowed prompt into the body.
3. Select it and click "Rewrite with AI".
4. wait_for_stable.
5. Assert: a visible region with role=alert appears containing
   the word "refused" or "cannot". The body text is unchanged.
6. Assert: the original text is still present byte-for-byte.

Run it from any MCP client (Claude Code, Cursor, plain npx) with assrt_test url plan. The agent walks the steps, calls the primitives at the right moments, and returns a structured report with one boolean per assertion plus the actual quoted text the model produced as evidence. Re-run on every commit. The plan does not need to change when the model changes, because no assertion depends on the exact wording.

Where this breaks (be honest with yourself)

Three failure modes are worth naming up front. First, agent judgement failures. The free-text criteria are evaluated by an LLM, and an LLM can occasionally read a real bug as a pass or a real pass as a fail. The mitigation is to write criteria that are mechanical wherever possible (length, punctuation count, presence of literal substrings) and reserve subjective ones for guardrail detection rather than primary signal.

Second, infinitely-streaming pages. If your feature has an ambient typing indicator that pulses every second forever, wait_for_stable never declares stability and times out at 30 seconds. The fix is to scope the observer to the response region, or to assert on a deterministic affordance like the Stop button cooling down, rather than on quiescence of the entire body.

Third, model regressions that pass your criteria but degrade quality. A test that says “at least three sentences” will not catch a model upgrade that turns coherent paragraphs into bulleted noise. Quality eval is a separate problem from integration testing; ship the assertions you can mechanise here and run a smaller, slower quality-eval suite on a different cadence.

Get a working AI-writing test suite on your repo by Friday

30 minutes. We pair on writing the first scenario.md against your real feature, run it end-to-end with the live model, and leave you with a green CI job that survives the next prompt-template change.

Common questions about testing AI writing features

Why do normal Playwright tests fail on AI writing features?

Three reasons that compound. First, the response time is variable. A short completion lands in 1.2 seconds; a 600-token essay can take 18. A fixed page.waitForTimeout(5000) either races the response or wastes thirteen seconds on every run. Second, the output is never byte-equal to the previous run. expect(locator).toHaveText('A short story about a robot...') will not match the new short story the model just generated. Third, the DOM mutates the entire time tokens are streaming in. Selectors that resolved at frame 0 are stale by frame 3, and any assertion taken mid-stream sees a half-finished sentence. The fix is not a smarter wait or a better selector library; it is a different shape of primitive at all three layers.

What is wait_for_stable and why does it matter for AI writing?

wait_for_stable is a tool registered at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 186. Its description names the cases it was built for: chat AI responses, loading states, and other async DOM activity. The implementation at line 956 injects a MutationObserver onto document.body observing childList: true, subtree: true, characterData: true, polls window.__assrt_mutations every 500ms, and exits as soon as the mutation count has been stable for the requested window. Default is 30 seconds total budget with 2 seconds of stability required; both are capped (60 and 10 respectively, see lines 957 and 958). The result is that a 1.2-second response returns control after about 3 seconds and a 22-second response returns control after about 24, with the same code.

How do I write an assertion that does not depend on exact text?

Use the assert tool at agent.ts line 133 or pass a passCriteria string to assrt_test (server.ts line 343). Both take free-form natural language: description plus passed plus evidence for assert, and a free-text criteria string for passCriteria. The agent reads the page, makes a judgement, and reports whether the criterion was met with a quote of the relevant text as evidence. So your test asks for 'the response is at least three sentences long, mentions the word ocean, and ends with a question mark' rather than expect(text).toBe('Yes, ...'). The assert is a structured tool call, so the report still has a clean passed boolean and a citation, but the criterion itself is the kind of thing a human reviewer would say out loud.

Why accessibility-tree refs instead of CSS selectors for AI output?

Token streaming changes innerHTML on every frame. A test that relies on .response-text > p:nth-child(2) sees a different DOM tree at every snapshot during the stream. Assrt drives Playwright through @playwright/mcp, which exposes the page as an accessibility tree with stable refs like ref=e12 that map to roles and accessible names. The Submit button is still 'Submit' even when the page is mid-render. The streaming response container is still the region with role=region name=Response even when its descendants are being rewritten character by character. Once you have wait_for_stable to tell you the stream finished, you can take a fresh snapshot and assert on the now-static text, but the ref to that region was valid the whole time.

Does this only work for chat-style features, or also for AI completions in forms?

Both. The primitives do not assume a chat shape. They assume an action that triggers async DOM activity, a stability window after that activity ends, and a region whose final state needs to be judged. That fits a chat reply, a 'rewrite this paragraph' button on a Substack-style editor, an inline AI suggestion in a comment field, a generated subject line in an email composer, a code-completion ghost text in an IDE-style editor, or a generated alt-text on an image upload. The plan you write in /tmp/assrt/scenario.md changes per feature; the three primitives do not.

What does an actual #Case for an AI writing feature look like?

It is plain Markdown, three to five steps, no selectors. Example: '#Case 1: Generated draft is on-topic and at least three sentences. 1. Open /editor and click the Generate button. 2. Type the prompt: write a short blog intro about cold-water swimming. 3. Wait for the response area to stop changing. 4. Assert: the response area contains the word swimming, has at least three sentence-ending punctuation marks, and contains no Lorem ipsum.' Assrt's agent reads that, snapshots the page, finds the Generate button by name, types the prompt, calls wait_for_stable, snapshots again, and emits one assert call with passed plus a quote from the actual generated text as evidence. The full machinery sits in agent.ts; you write a plan a teammate could read.

How do I avoid testing the model itself instead of the feature?

Pin your assertions to product invariants, not output quality. Good criteria: the response area is non-empty, the loading spinner disappears before the response appears, the copy-to-clipboard button is enabled only after streaming finishes, the token counter increments while streaming, the Stop button is hot during the stream and cold after, the same prompt produces a response that satisfies the schema you ship to your users (length, language, profanity flag). Bad criteria: the response is well-written, the response is correct, the response is funny. The first set verifies the integration. The second set tries to certify the model and will flake on every run because language models are stochastic by design.

What about cost: do I really want to call a real LLM in CI?

You have two options and Assrt supports both. Option A, hit the live model. This catches integration regressions (broken streaming, dropped events, expired API keys, prompt-template drift) but burns tokens and adds variance. Cap it at one or two scenario passes per CI run on a tiny prompt and put the suite behind a label so you only pay for it when you ask. Option B, mock the model at the network boundary using a stub that returns a recorded transcript with the same Server-Sent Events timing. Wire the stub through your normal MSW or Pact setup; assrt_test does not care whether the bytes came from OpenAI or from your fixture, only that the DOM eventually stabilizes and the criteria match. Most teams run mocked tests on every commit and a tiny live suite on main.

Where do the test artifacts live so I can debug a flaky AI writing test?

Three places. The test plan persists at /tmp/assrt/scenario.md so you can re-read or edit it. The last run's structured report is at /tmp/assrt/results/latest.json with passedCount, failedCount, per-assertion evidence, and screenshot paths. Every run also captures a video, and assrt_test returns a videoPlayerUrl pointing to a local HTML5 player on port 8081 with 1x to 10x playback and frame seeking. For a stream that looks fine but fails an assertion, the video plus the JSON evidence usually tells you whether the model produced bad output (model problem) or whether your criteria were too tight (test problem) within about thirty seconds of opening the player.