When the primitive is wrong

AI output snapshot regression testing: why the literal-byte snapshot is the wrong primitive, and what to do instead

Snapshot tests pin a literal string. Your LLM feature re-rolls that string on every run. The two are incompatible by construction, and the team-wide answer of "just update the snapshot" is how silent regressions ship to users. This guide walks through why the primitive is wrong, what replaces it, and the exact regex Assrt uses to make the replacement enforceable.

Matthew Diakonov, Written with AI

Published May 21, 202612 min read

Direct answer (verified 2026-05-21)

AI output snapshot regression testing is regression testing applied to LLM-generated content in a web app. The classical primitive (a saved literal string compared byte-for-byte) does not work because LLM outputs vary across runs, model versions, and providers. The working replacement is a list of semantic verify-bullets evaluated against the live DOM, plus a coverage gate that fails the run if any bullet has no matching assertion. In Assrt the bullets live in /tmp/assrt/scenario.md, the coverage gate is a regex (/^[\s\-*\d.()]*(?:Verify|Check|Assert|Confirm|Ensure)\b[^\n]*/gim) plus a two-keyword overlap rule, and streaming responses are stabilised by wait_for_stable before any assertion runs.

Source: src/core/agent.ts lines 1141 to 1175 and src/core/scenario-files.ts in the open-source assrt-ai/assrt-mcp repo.

The thesis: literal snapshots assume determinism your AI feature does not have

Snapshot testing was invented for deterministic functions. A pure function with no I/O returns the same value every time you call it with the same input, so saving that value and diffing against it on the next run is a fair test. The diff is loud when the function changes, silent when it does not, and the engineer running CI can read the change line by line.

None of those assumptions hold for an LLM call. Sampling temperature greater than zero re-rolls the output. Sampling at temperature zero still re-rolls because providers re-tokenize, swap model versions, and tune safety filters between deploys. The literal text drifts. The diff is red on run two even when nothing in your code changed, and red on run three because someone re-trained the model. Engineers learn to update the snapshot reflexively. The test becomes a write-once log of whatever the model said most recently, and the regression it was supposed to catch (a quietly broken citation pipeline, a chatbot that stopped saying the user's name) lands in production unnoticed.

This is not a problem with how snapshots are configured. It is a problem with the primitive itself. The thing being pinned (a stochastic string) is not the kind of thing that can be pinned by literal comparison. The only options are to pick a different thing to pin, or to pick a different comparison. AI output snapshot regression testing is the practice of doing both.

Where literal snapshots fail on AI output

Temperature > 0 sampling changes the literal output on every run, so the snapshot diff is red on run two.
Even at temperature 0, the provider re-tokenizes, swaps model versions silently, and tweaks safety filters; the literal output drifts across days.
Application chrome (timestamps, citation IDs, streamed token order) shows up in the snapshot before the model output does, so failures land on the wrapper, not the answer.
Updating the snapshot becomes a habit. The team accepts the new output without reading it, and the test stops catching real regressions.
Multi-line outputs make diffs unreadable in CI logs; engineers rubber-stamp updates instead of grading what changed.
The literal-snapshot file grows to thousands of lines and starts behaving like a write-once log, not a test.

What replaces the literal snapshot: verify-bullets in plain markdown

The Assrt replacement primitive is a list of one-line semantic properties in a plain markdown file. Each property is a thing the AI output must satisfy. The properties are written by the human who knows what the feature is supposed to do, not by the model. They live in /tmp/assrt/scenario.md. The file is editable, gittable, greppable, and diffable in your normal editor. An fs.watch loop syncs every save back to central storage on a one-second debounce so a team can collaborate on the same scenario without copy-pasting strings around.

A scenario looks like this. The lines starting with Verify are the bullets that get enforced.

#Case 1: Chatbot answers a real question

Steps:
- Open the dashboard at /chat
- Type "What is my last invoice amount?"
- Press Enter
- Wait for the assistant message to finish streaming

- Verify the assistant message mentions the dollar amount from invoice 4821
- Verify the assistant message does not include any URL outside our domain
- Verify the assistant message renders inside the .assistant-message container
- Verify the response completes within 6 seconds of the user message
- Verify no other invoices are mentioned in the same reply

Five bullets, five properties. Run after run, the literal text the model produces is allowed to drift. The properties are what get checked, and the properties are not stochastic: either the dollar amount from invoice 4821 is mentioned or it is not.

What happens when the agent runs a scenario

The coverage gate: a regex and a two-keyword rule

A verify-bullet list is only useful if someone proves every bullet was actually checked. Otherwise it degrades into a slightly fancier comment. The piece that makes the pattern enforceable lives in src/core/agent.ts of the open-source assrt-mcp repo, around line 1141. It is a regex and a coverage rule. The regex extracts every line that starts with one of five verbs:

// agent.ts:1141
const verifyBulletRegex = /^[\s\-*\d.()]*(?:Verify|Check|Assert|Confirm|Ensure)\b[^\n]*/gim;

const verifyBullets = (scenarioSteps.match(verifyBulletRegex) || [])
  .map((s) => s.replace(/^[\s\-*\d.()]+/, "").trim())
  .filter((s) => s.length > 0);

// agent.ts:1146 -- normalize a bullet to its content keywords
const normalize = (s) =>
  s.toLowerCase()
   .replace(/[^a-z0-9 ]+/g, " ")
   .split(/\s+/)
   .filter((w) => w.length > 3);

// agent.ts:1149 -- a bullet is covered if some assert description shares
// at least 2 of its content keywords (or all of them when the bullet is short)
for (const bullet of verifyBullets) {
  const bulletKeywords = normalize(bullet);
  if (bulletKeywords.length === 0) continue;
  const covered = assertionDescriptions.some((desc) => {
    const matched = bulletKeywords.filter((kw) => desc.includes(kw)).length;
    return matched >= Math.min(2, bulletKeywords.length);
  });
  if (!covered) droppedAssertions.push(bullet);
}

if (droppedAssertions.length > 0) scenarioPassed = false;

That is the whole contract. Every line in the scenario that starts with Verify, Check, Assert, Confirm, or Ensure (with any leading whitespace, bullet character, or number) is an assertion the agent owes. The agent produces one assert tool call per bullet. After the run, the coverage gate normalises each bullet to its words longer than three characters, then looks for any assert description that contains at least two of those words. Misses get logged into droppedAssertions and the scenario fails.

The rule is intentionally loose on phrasing (the agent can rephrase a bullet, and the keyword match still finds the connection) and intentionally strict on coverage (every bullet has to be accounted for, no exceptions). Loose phrasing keeps the agent from being penalised for using synonyms; strict coverage keeps the agent from quietly skipping the one bullet that would have caught the regression.

2 keywords

“One bullet, one assert. Anything uncovered fails the scenario, with the dropped bullet logged so a human can read which property the model declined to grade.”

src/core/agent.ts, lines 1141 to 1163

Streaming responses, wait_for_stable, and the assertion timing problem

Snapshot tests do not have a timing problem. The function returns and the snapshot is checked. AI output regression tests do. If the agent fires an assertion against the assistant message while tokens are still streaming in, the assertion sees a partial response. The next run sees a different partial response. The test is flaky, the bullet is real, but the timing layer is broken.

Assrt's answer is a tool named wait_for_stable defined at src/core/agent.ts lines 186 to 195. It polls the DOM and returns once no mutations have happened for a configurable stable period (default two seconds), up to a configurable timeout (default thirty). The agent prompt at src/core/agent.ts lines 249 to 254 explicitly tells the model to call wait_for_stable after triggering any AI response, before any assertion runs. The model does not get to skip this step; if it does, the assertion sees half a token and the run fails the bullet, which surfaces to the engineer in the dropped-assertion log.

The timing dance for one assertion on a streaming AI response

User input fires

Agent triggers AI feature

wait_for_stable (2s quiescence)

snapshot the DOM

assert against verify-bullet

✅

coverage gate marks bullet covered

A worked example: catching a quietly broken citation pipeline

Imagine your support chatbot is supposed to cite at least one document from the user's knowledge base on every answer. The classical snapshot test would pin the literal answer, fail on every run that produces different phrasing, and get updated until nobody reads the diff. Then your retrieval pipeline silently breaks. The model still produces fluent answers, just without citations. The snapshot updates accept the new (non-citing) text. The regression ships.

The verify-bullet equivalent has one line in scenario.md:

- Verify the response contains at least one citation chip with class .citation

The agent runs the scenario, waits for the message to stabilise, snapshots the DOM, and asks: does the rendered response contain an element matching .citation? If yes, fire assert(passed=true, evidence="found 2 citations: invoice-4821, contract-9.pdf"). If no, fire assert(passed=false, evidence="no .citation elements in .assistant-message after wait_for_stable"). Either way, the bullet is covered, the coverage gate passes that bullet, and the test fails or passes on the actual property the human cared about.

The model is now allowed to phrase its answer any way it likes between runs. The literal text can drift. The thing being graded is not the literal text. The thing being graded is the structural property the engineer wrote down, and that property is checkable on any run because it depends on the DOM, not on a model sample.

Jest-style snapshot vs verify-bullet for the same AI feature

expect(rendered).toMatchSnapshot(). One literal string saved on first run. The string is the entire AI response: greeting, body, citations, sign-off, timestamp. Any change anywhere (a comma, a tokenizer swap, a re-roll) reds the test. The team learns to run jest -u and accepts whatever the model said this time. The 'pin' is functionally a write-once log.

Pins a literal string
Fails on every model re-roll
Updates accepted reflexively
Real regressions hide inside accepted updates

The counterargument: when literal snapshots still earn their keep

None of the above means literal snapshots are useless. They earn their keep on the deterministic surfaces around the AI feature. The JSON shape of the request body your front end posts to the LLM provider, the schema of the response after your code parses and normalises it, the structural skeleton of the UI before tokens stream in, the email template you generate after the model returns: all of those are deterministic and a literal-byte snapshot is the right tool for each of them. The mistake is pointing a literal snapshot at the part of the surface that is by construction non-deterministic.

A team that does both is the team that catches the most regressions. Literal snapshots on the deterministic edges. Verify-bullets on the model output itself. Pixel diffs on the rendered UI for the things that have to look right. Three layers, each catching what the other two miss, none of them carrying the weight that is wrong for them to carry.

Where the verify-bullet pattern is still wrong (be honest about it)

The two-keyword overlap rule has a quiet failure mode. A bullet that uses uncommon vocabulary the agent does not echo can be marked uncovered even though the agent did the work. The fix is to write bullets in the vocabulary you expect the agent to produce, which is a small editorial discipline, not a structural change. You learn the shape after one or two failing runs and the bullets get tighter.

A bigger failure mode: the agent only sees what is rendered in the DOM. If your front end strips a citation array before rendering, or summarises a structured response into a paragraph, the verify-bullet on the original structure cannot be checked. The fix is a debug payload: a hidden div on the page that renders the raw model response when a test cookie is set. The bullet then points at the debug payload, not at the prettified UI. Most teams resist this until they have shipped one regression and then add it the next sprint.

The last failure mode is the one no test framework solves: an LLM-grader that confidently fakes an assertion. The mitigation is the evidence field on every assert call (defined at agent.ts line 140), which is logged into /tmp/assrt/results/<runId>.json. A human reviewing the artifacts can spot a passed assertion whose evidence does not actually show the property holding. This is a manual safety net; it is also the only one that catches a sufficiently motivated grader.

Where this pattern lives in code (so you can read it yourself)

None of this is a vendor abstraction. The repo is open source. The pieces sit in three short files of the assrt-mcp repository:

src/core/agent.ts lines 1141 to 1175: the verifyBulletRegex, the normalize helper, and the coverage check that marks the scenario failed when a bullet has no matching assertion.
src/core/agent.ts lines 186 to 195: the wait_for_stable tool definition, with the two-second default quiescence and the thirty-second default timeout the agent uses on streaming AI output.
src/core/scenario-files.ts lines 16 to 20, 42 to 48, and 90 to 111: the file paths (/tmp/assrt/scenario.md, /tmp/assrt/scenario.json, /tmp/assrt/results/<runId>.json) and the fs.watch loop that syncs your local edits back to central storage on a one-second debounce.

The whole regression-detection contract for AI output sits in those three files. The total surface is about three hundred lines of TypeScript. If you do not want to use Assrt, you can lift the pattern: a markdown file with verify-bullets, a regex on those bullets, a coverage check on the assertion list, a stable-DOM wait helper, and an evidence field on every assertion. The shape is what does the work, not the framework.

Want to walk through verify-bullets on your own AI feature?

Show me the AI feature in your app you wish you could regression-test without breaking on every model re-roll. I will walk through the bullets I would write for it, the wait_for_stable hook, and how the coverage gate would catch the silent regressions snapshot tests miss.

AI output snapshot regression testing FAQ

What is AI output snapshot regression testing?

It is the practice of pinning the output of an AI feature in your app (a chat reply, a generated summary, a model-produced label, a structured JSON returned by an LLM call) and re-checking on later runs that the output has not drifted in a way that breaks the user experience. The catch is that classical snapshot testing pins a literal string, and LLM outputs vary across runs even when the prompt and model are unchanged, so the literal-byte primitive produces false failures on the second run. AI output snapshot regression testing is the renaming of the same goal (catch drift) with a different mechanism (semantic checks, not literal diffs).

Why does toMatchSnapshot() not work for LLM responses?

Three reasons. First, temperature greater than zero means the model samples a different completion every time, so the literal text changes. Second, even at temperature zero, providers re-tokenize, swap models silently behind the same model name, and ship safety filters that change phrasing without notice, so the literal text still changes across days. Third, the application layer almost always adds non-deterministic chrome (timestamps, citation IDs, ordering of streamed tokens) that the snapshot diff catches first, so the test fails on the wrapper, not the model output. The result is a green-yellow-red cycle where the snapshot is right twice and wrong thirty times.

What replaces literal snapshots for AI output?

Semantic verify-bullets. Instead of pinning the output to a string, you pin it to a list of properties the output must satisfy: 'mentions the user by name', 'cites at least one of the documents we provided', 'does not contain the word price', 'renders inside the .assistant-message container within 6 seconds'. Each property is a one-liner. A test runner evaluates each property against the live DOM (or the parsed response) by reading the rendered output and asking a model whether the property holds. The pass criterion is that every bullet was checked, not that the output matched a saved string.

How does Assrt enforce that every verify-bullet was actually checked?

Via a regex applied to scenario.md and a coverage rule applied to the agent's assertion list. The regex /^[\s\-*\d.()]*(?:Verify|Check|Assert|Confirm|Ensure)\b[^\n]*/gim extracts every line in the scenario that starts with one of those five verbs. The coverage rule then walks each extracted bullet, normalizes it to its content words (lowercased, alphanumeric only, words longer than three characters), and looks for at least one agent-produced assert call whose description contains at least two of those keywords. If any bullet is uncovered, the scenario is marked failed and the dropped bullet is logged. The check lives in src/core/agent.ts lines 1141 to 1158 of the open-source assrt-mcp repo.

Why a regex on scenario.md instead of a vendor DSL?

Two reasons. The first is that the scenario is plain markdown on disk at /tmp/assrt/scenario.md, so you can grep it, diff it, edit it in your editor, and the fs.watch loop syncs your edits back to the cloud copy in one second. A vendor DSL would mean the verify list lives in a database row you cannot diff. The second reason is that the regex makes the coverage contract legible: anyone reading the agent's source can predict which lines will be counted as assertions. A custom parser would hide that contract behind a parser tree, and you would not know if a line was being skipped until your test passed on a missing check.

Why are bullets matched on at least two keywords longer than three characters?

Because anything looser produces false positives (a verify bullet 'check the form' matches any assertion that contains the word 'form') and anything stricter produces false negatives (the agent rephrases 'submit button is enabled' as 'submit CTA is clickable' and the bullet would be considered uncovered). Two distinct content words is the empirically right floor. Short words (articles, prepositions, and verbs like 'show' or 'see') are filtered out before matching, which keeps the match focused on the nouns and modifiers that actually describe what the bullet is about. The exact normalization is in normalize() at agent.ts:1146.

What about streaming AI output? Snapshots taken too early would catch a partial response.

Assrt ships a wait_for_stable tool that polls the DOM and returns once no mutations have happened in N seconds (default 2 seconds of quiescence, 30 second timeout). The agent prompt explicitly tells the model to call wait_for_stable after triggering any AI response, and the assert calls come after that. That sidesteps the classic flaky-streaming bug where the test runs while the response is still being typed in, the assertion fires against half a token, and the next run sees a different half-token. The implementation is in src/core/browser.ts and the tool definition is in src/core/agent.ts lines 186 to 195.

Is this not just LLM-as-judge with extra steps?

It uses an LLM to evaluate each verify-bullet, which is the same primitive LLM-as-judge frameworks use. The difference is the coverage gate. LLM-as-judge frameworks usually let the judge model decide which properties it wants to grade, which is exactly the wrong control flow: if the judge skips the one property that actually matters, the test passes for the wrong reason. The verify-bullet coverage gate inverts that. The human writes the properties in scenario.md; the agent is told one bullet equals one assert call; a regex-driven check at the end fails the run if any bullet is unaccounted for. The model is grading; the human is auditing the grader.

What about visual regression on the rendered AI output? Pixel diffs would still catch some drift.

Visual regression catches drift that is visible in pixels but cannot catch drift that is hidden in the text (a chatbot that quietly stopped citing sources, a summary that swapped a metric for the wrong one, a JSON response that flipped a boolean). For AI output, pixel diffs are a useful but secondary check. The primary check should be on the text and structure: does the response say the things it must say, does it not say the things it must not say, does it render in the expected container within the expected latency. Pixel diffs sit on top of that as a guard against unintended UI changes that the text-level checks would miss.

Can the verify-bullet pattern catch hallucinations?

Yes, if the bullet is written as a negative property and the source of truth is available to the agent. Example: 'Verify the AI response does not mention any URL that is not in the document set we provided'. The agent reads the rendered response, extracts URLs, and checks each one against the document set (which can be loaded via http_request). If a fabricated URL is in the response, the assert call fires with passed=false and the scenario fails. The pattern is not magic, it is the same shape as any other domain check, but it does mean that hallucination tests live in the same file as the rest of your end-to-end tests and run on every push.

Where does the verify-bullet pattern actually fall down?

Three places. First, the coverage check looks for two-keyword overlap, so a bullet that uses uncommon vocabulary the agent does not echo can be marked uncovered even when it was substantively checked; the fix is to write bullets in the same vocabulary the model produces. Second, the agent only sees what is rendered in the DOM, so anything the front end strips or summarizes is invisible; the fix is to expose a debug payload in the DOM (a hidden div with the raw response) when testing. Third, an agent that confidently fakes an assertion is hard to catch; the mitigation is the evidence field on every assert call, which the human reviews in the per-run JSON results file at /tmp/assrt/results/<runId>.json.

Where in the Assrt repo can I read this code myself?

The coverage check is in src/core/agent.ts of the assrt-mcp repository, lines 1141 to 1175, around the verifyBulletRegex constant and the droppedAssertions accumulator. The scenario file paths and the fs.watch sync loop are in src/core/scenario-files.ts. The wait_for_stable tool definition lives in src/core/agent.ts around lines 186 to 195. All three are open source under the project's repo at github.com/assrt-ai/assrt-mcp.

Related guides on AI testing

Regression

AI-generated regression tests, the file the vendors will not show you

Where the generated test actually lives on disk, why scenario.md plus fs.watch beats a database row, and what survives when you switch tools.

Read

QA platforms

Self-healing vs regression detection: do not confuse them

Self-healing is about not failing on irrelevant change. Regression detection is about failing on relevant change. Conflating them is how teams ship broken AI features green.

Read

Visual regression

AI visual regression testing without the snapshot graveyard

Pixel diffs are useful but blind to text drift. How to layer visual regression on top of semantic verify-bullets so each catches what the other misses.

Read

AI output snapshot regression testing: why the literal-byte snapshot is the wrong primitive, and what to do instead

The thesis: literal snapshots assume determinism your AI feature does not have

What replaces the literal snapshot: verify-bullets in plain markdown

The coverage gate: a regex and a two-keyword rule

Streaming responses, wait_for_stable, and the assertion timing problem

A worked example: catching a quietly broken citation pipeline

Jest-style snapshot vs verify-bullet for the same AI feature

The counterargument: when literal snapshots still earn their keep

Where the verify-bullet pattern is still wrong (be honest about it)

Where this pattern lives in code (so you can read it yourself)

Want to walk through verify-bullets on your own AI feature?

AI output snapshot regression testing FAQ

Related guides on AI testing

AI-generated regression tests, the file the vendors will not show you

Self-healing vs regression detection: do not confuse them

AI visual regression testing without the snapshot graveyard

Comments (••)

Comments ()