Find AI fallback bugs end to end without mocking the model
The interesting bugs in AI features almost never live in the prompt or the parser. They live in the wiring between the model response and the UI, where a thrown exception flips isLoading back to false but forgets to set the error state, the spinner spins forever on a silent timeout, or the fallback message gates on the wrong flag and never paints. None of this is visible to a unit test. None of it is visible to a mocked integration test. You only catch it by driving a real browser through a real failure and asserting on what the user actually sees.
Most articles about this are eval-harness pieces: golden sets, prompt unit tests, RAG faithfulness scoring. Those are necessary and they cover the happy path. This is the unhappy path: a MutationObserver that watches the DOM through the failure transition, an HTTP fault injector you flip from the agent itself, a four tool loop that records what the user got. Every claim points at a file path and line number in the open source Assrt agent. The MutationObserver is eight lines.
The class of bug we are after
AI features fail differently from CRUD apps. The model can return 200 with garbage. It can take 90 seconds to answer a question that should take three. It can refuse politely, and your code can render the refusal as content. It can stop streaming halfway through. Every one of those outcomes leaves the application in a UI state that the happy path never tested, because the happy path returns clean structured output every time. The five recurring shapes look like this.
Silent timeout
Provider never replies. The application has no upper bound on the model call. The spinner spins forever and no error message paints. Caught by wait_for_stable timing out at 60s plus a negative assert that no progressbar is left in the DOM.
Malformed JSON
Provider returns 200 with truncated content. The parser throws inside an effect. isLoading goes false, the error state stays at its initial value, the message container renders empty. Caught by an assert that the assistant turn contains a non-empty fallback message.
Rate limit (429)
Provider returns 429 with a retry-after header. The application either ignores the header and retries immediately, or sets the wrong copy on the user-facing message. Caught by an assert that the visible text mentions wait or retry, paired with an http_request to confirm the request rate respects the header.
Content filter rejection
Provider returns a structured refusal. The application's success path renders the refusal as if it were content, leaking provider-specific phrasing into your UI. Caught by an assert that the visible text does not contain provider tell-tales like "As an AI language model" or "I cannot generate".
Partial response then stall
Streaming starts, a few tokens arrive, the connection drops. The success UI renders the half message and the spinner is left active alongside it. Caught by a positive assert on the partial content plus a negative assert that the spinner is gone.
These are not theoretical. Every team that has shipped a model-backed feature has lived through at least three of them. Most of the resulting bug reports read like "the screen just sat there" or "the spinner never stopped" or "the message was empty". The user is not lying and your unit tests are not lying; the bug is in a layer none of those tests cover.
Why end to end is the right level
Consider what a unit test for the parser checks: given a malformed JSON string, the parser throws. That is true and it is unhelpful. The bug is not that the parser throws. The bug is that the catch handler one layer up sets isLoading=false and then returns without updating any user-visible state. The unit test for the catch handler would have to reach into a render context, simulate the parser throw, and assert on a virtual DOM that approximates the real one. That assertion is fragile and it is testing your test renderer as much as your code.
End to end skips that whole layer. Drive a real Chrome, point it at a real failure, watch the actual rendered page settle, read the accessibility tree, assert on text and roles. The accessibility tree is not the DOM and it is not a screenshot; it is the same data a screen reader would announce, which is the most truthful representation of what the application is currently saying to a user. If the tree contains no error message and no fallback text, that bug is real for everyone, not just for users with assistive tech.
Where the fallback transition lives in the agent surface
The Assrt agent has 18 fixed tools. Four of them carry the load for fallback bug detection: wait_for_stable, http_request, snapshot, and assert. The other 14 (click, type_text, navigate, evaluate, screenshot, and so on) are conveniences for the user flow that triggers the model call. The four below are what makes the failure observable.
The four tool loop the agent runs through every fallback scenario
wait_for_stable
agent.ts:186-195 (definition), 956-1008 (runtime). Injects a MutationObserver on document.body, polls window.__assrt_mutations every 500ms, breaks once the count is unchanged for stable_seconds (default 2, max 10) within timeout_seconds (default 30, max 60). The only tool in the surface that reasons about change rate.
http_request
agent.ts:172-184, runtime at 925-955. Plain fetch with a 30 second AbortController timeout. Returns status, statusText, and up to 4000 bytes of body. The agent uses it to flip the fault injector, verify a backend log, or hit an external API to confirm a side effect.
assert
agent.ts:133-144. Description, passed boolean, evidence string. Each call appends to the scenario's assertion record, which lands in the final TestReport with screenshots attached. Negative asserts (no spinner, no stack trace) catch inconsistent fallback states.
snapshot
agent.ts:27-30. Reads the live accessibility tree of the page. Not a DOM dump, not a screenshot, not cached. Roles and names for every interactive element plus visible text. The basis for every assert that follows wait_for_stable.
Why these four are sufficient
wait_for_stable observes the failure transition. http_request triggers the failure. snapshot tells you what landed. assert records what you expected. The 14 other tools in the agent surface (click, type_text, navigate, evaluate, screenshot, and so on) are conveniences. The fallback bug detection loop is built from these four.
wait_for_stable, the load bearing tool
Other waiting primitives ask "is this thing true yet?" (text present, element visible, response received). AI fallback paths violate the assumption that you know which thing to wait for. The success element might never render. The text you would key on might never paint. A wait_for_response fires the moment the 200 lands, even when the body is malformed. wait_for_stable answers a different question: has the page stopped changing? It does not care which UI state you land in.
The implementation is eight lines of injected JavaScript. A MutationObserver on document.body with childList: true, subtree: true, characterData: true increments a counter on every mutation. The Node side polls the counter every 500ms. As soon as the count is unchanged for stable_seconds (default 2, capped at 10) within timeout_seconds (default 30, capped at 60), the loop breaks and the observer is cleaned up so the next scenario does not see a stale instance.
“That is the entire mechanism. Eight lines of injected browser code, one polling loop, one cleanup. It is committed in the public repo. You can lift it into a Playwright helper if you would rather not run the agent at all.”
http_request, the fault injector hook
You cannot find a fallback bug without triggering the fallback. There are three honest ways to do that without mocking inside your application. First, point the test at a development environment whose AI provider key has been revoked or rate-limited on purpose. Second, set AI_PROVIDER_BASE_URL to a small fault injector you run alongside your dev server. Third, use the agent's http_request tool to flip a flag on that injector before each scenario. The third option keeps the test self-contained.
The runtime case is short. Plain fetch with a 30 second AbortController timeout, custom headers merged on top of the default Content-Type, and the first 4000 bytes of the response returned to the agent. That is enough to flip the injector, verify a backend log was written, or check an external API confirms a side effect. None of it is mocked; the application makes its real call to whatever URL it makes.
A 30 line fault injector that covers most of the failure modes
The injector does not have to be sophisticated. It needs three things: a way to flip its mode (the /__inject endpoint), a few canned failure modes (timeout, malformed JSON, 429, optionally content filter), and a passthrough to the real provider in ok mode. A node script with no dependencies fits in 30 lines.
Set AI_PROVIDER_BASE_URL=http://localhost:4000 in your dev environment and your application talks to the injector instead of the real provider. The agent flips the mode through http_request at the start of each scenario and clears it at the end. The injector itself is the only piece that lives outside the scenario file.
End to end, in one sequence
Here is the full data flow for a single fallback scenario. The agent flips the injector, drives the user flow, waits for the page to settle, reads the accessibility tree, asserts on text and roles, then clears the injector. No Playwright code in your repo, no mocked model, no internal state assertions.
Timeout fallback, end to end, no mocked model
The six step loop, with what each step is responsible for
Six numbered steps, one per agent action. Each one fires a single tool call from the bounded surface. The agent does not write Playwright; it picks tools from the schema and the schema rejects anything else.
Flip the fault injector
http_request POST to your fault injector with the mode you want to exercise: timeout, malformed_json, rate_limited, content_filter, partial. The agent runs this before navigating, so the very first request your AI feature makes is already broken in the way you intend.
Drive the user flow that triggers a model call
navigate, type_text, click. The agent uses the accessibility tree refs from snapshot, not CSS selectors. The application code path is unchanged. Every line of code that ships in production runs during the scenario.
wait_for_stable while the failure resolves
MutationObserver counts every childList, subtree, and characterData mutation on document.body. The agent breaks out of the loop the instant the count has been unchanged for stable_seconds, or returns a timeout result if the page never settles. This is the load bearing tool. Definition at agent.ts:186-195, runtime at agent.ts:956-1008.
snapshot to read the rendered fallback UI
Accessibility tree, not the DOM. Roles, accessible names, focus state, ARIA attributes. This is what a real user sees if they have a screen reader, and it is the most truthful representation of what your fallback UI is communicating. The agent reads it cold each scenario; nothing is cached.
assert on what is and is not in the tree
The agent records description, passed, evidence triplets via the assert tool (agent.ts:133-144). The useful asserts here are positive (the error message is present, the retry button is enabled) and negative (the spinner is gone, no leftover progressbar role, no raw JSON visible). Negative asserts catch the inconsistent state bug other approaches miss.
Clear the injector and complete the scenario
One more http_request to put the injector back in pass-through mode. complete_scenario records the outcome with the asserts attached. The video file lands in the run directory. The next scenario in the same #Case file inherits a clean injector.
What you actually write
The scenario file is plain Markdown. Two cases below: the timeout path and the malformed JSON path. The shape is identical; only the inject body and the asserted text change. A new contributor can read both cases on day one and reason about what they cover without opening a single TypeScript file.
What it looks like running, including a real failure
Here is the transcript of running the two cases against a feature with a real bug in the malformed JSON path. The timeout case settles at 47.2 seconds (the application correctly bounded the wait at 45 plus paint time) and the asserts pass. The malformed JSON case fails on the assistant turn assertion: the parser throws, isLoading goes false, the container renders empty. The video file shows the dead state for as long as a teammate cares to watch.
Side by side against what you would otherwise write
Six rows that fit on one screen. The left column is the shape of a typical Playwright spec for this kind of test, with the helpers, fixtures, and mock setup that grow per failure mode. The right column is the same coverage expressed as agent primitives. The trade-off is real and worth stating: the spec gives you full Playwright API access, including custom fixtures and storage state juggling. The agent gives you a bounded surface that handles the common path with little maintenance.
| Feature | Typical Playwright spec with mocks | Assrt agent (Markdown #Case + fault injector) |
|---|---|---|
| What you wait for | A specific element to appear, or a network response to fire. Both fail when the failure path renders a different UI than the success path. | wait_for_stable observes mutation rate on document.body. The page settles whether you land in success, error toast, retry button, or fallback message. |
| How you trigger the failure | Mock the model in the dev harness. Tests pass against mocks that have nothing to do with the wiring between the response and the UI. | http_request flips a fault injector that sits in front of the real provider. The application code path that ships is the one under test. |
| What you assert on | Internal component state through a test renderer. Catches the parser, misses the dead UI when isLoading goes false but the error state never sets. | Accessibility tree of the live rendered page. Roles, names, visible text. The truth a real user gets, including the user with a screen reader. |
| How the failure case is encoded | A 60+ line .spec.ts per fallback variant: imports, fixtures, helpers, a custom waitForFunction that polls a mutation predicate. | A six bullet #Case block per variant. Same wait_for_stable, same http_request, same assert. Maintenance cost scales sublinearly with the number of failure modes. |
| What the failure evidence looks like | A screenshot at the failing assert and a trace zip file. Both require tooling to read. | Full WebM video of the run, screencast-streamed during the run, plus a screenshot per visual action and a TestReport with assertion descriptions and evidence strings. |
| What survives if you switch tools tomorrow | A folder of spec files coupled to a test runner, fixtures, and helpers. | A folder of Markdown #Case files plus a 30 line fault injector. Both readable in a code review without context. |
What this approach does not do
It is not an LLM evaluation harness. It does not score the model on a fixed prompt set. It does not measure faithfulness or hallucination rates or grounding quality. Those are different problems and they need different tooling. The unit of value here is specifically the failure path coverage that offline scoring cannot give you, because offline scoring sees the model output, not the application's response to the model output.
It is also not a replacement for monitoring. A Sentry breadcrumb that fires when the parser throws in production is still valuable, and it tells you something the test never can: how often this is happening to real users. The test tells you whether the fallback is correct when it does happen. Pair them.
Run this on your AI feature today
Nine steps. None require a vendor account, an API key, or a credit card beyond your existing AI provider. The whole thing runs against your local development server in your real Chrome.
Add fallback bug detection to your AI feature in one afternoon
- Read agent.ts:186-195 (wait_for_stable tool definition).
- Read agent.ts:956-1008 (the MutationObserver runtime).
- Read agent.ts:172-184 and 925-955 (the http_request tool).
- Stand up a 30 line fault injector that responds to /__inject.
- Write one #Case for the timeout path. Confirm wait_for_stable settles on the retry UI.
- Add a #Case for malformed JSON. Assert the visible message is human readable.
- Add a negative assert: no progressbar role left in the DOM after settling.
- Run with --extension to watch the failure path play out in your real Chrome.
- Commit the #Case file. It is now your regression test for this fallback shape.
one number to take with you
The MutationObserver that catches every shape of AI fallback bug above is 0 lines of JavaScript. The fault injector that triggers them is 0 lines of node. The agent loop that runs the four tool sequence end to end is 0 tool calls. If you can read 38 lines of code, you can verify the entire claim before you decide whether to use it. The recipes are committed in the open source repo whether you run the agent or lift them into your own Playwright suite.
Got an AI feature whose fallback path nobody has tested?
Bring the feature. We will write the fault injector, the #Case file, and the assertions on a call and watch the failure path play out in your real Chrome.
Frequently asked questions
Why is end to end the right level to find AI fallback bugs at all? Why not unit test the prompt or the parser?
Because most AI fallback bugs are not in the prompt or the parser. They are in the wiring between the model response and the UI. The classic ones look like this: the model returns malformed JSON, the parser throws, the catch block sets isLoading back to false but forgets to set the error state, and the user is left looking at an empty container with no message. The unit test for the parser passes (it correctly throws). The integration test for the API route passes (it correctly returns a 500). Only an end to end test that drives the real browser through the real failure and asserts on the visible DOM catches the dead state. The Assrt agent is built for exactly that level: it talks to a real Chrome through Playwright MCP, watches DOM mutations through wait_for_stable, and asserts on the rendered text the user sees. Every other layer of testing is necessary. None of them are sufficient.
What kinds of fallback bugs does this approach actually catch that mocked tests miss?
Five recurring shapes. First, the silent timeout: the model never replies, the spinner never resolves, no error message paints, the user thinks the app is alive but it is wedged. Second, the swallowed error: the parser throws, isLoading goes false, the container stays empty. Third, the fallback message that never paints because the error state is set but the conditional render guards on a different flag. Fourth, the retry loop that fires forever because the retry counter is reset on each render. Fifth, the inconsistent state: the success UI renders the partial response, the error toast also renders, both are visible at once. None of these show up in a unit test because the unit test mocks the network layer and asserts on internal state, not on what the DOM contains. The Assrt agent reads the accessibility tree of the actual rendered page and asserts on text and roles, so all five surface as scenario failures with screenshots and a video of the run.
How do you trigger the failure in the first place if you are not mocking the model?
Three options that all work without modifying the application. The first and simplest is to point the test at a development environment whose AI provider key has been revoked or rate limited on purpose, so every model call returns a 401 or 429. The second is to use the agent's http_request tool (agent.ts:172-184) to hit a fault injector you run alongside your dev server, flipping a flag like inject_timeout or inject_malformed_json before the scenario starts and clearing it after. The third is to set the environment variable AI_PROVIDER_BASE_URL to a local proxy (toxiproxy, mitmproxy, or a 50-line node script) that drops the request, returns malformed JSON, or holds the response for 90 seconds. None of these are mocks. The application code path is the one that ships. The model surface is what is broken. That distinction is the entire point: mocked tests verify the parser handles a bad response, end to end tests verify your UI is honest about it.
What does wait_for_stable actually do, and why does it matter for AI flows specifically?
It injects a MutationObserver onto document.body, counts how many DOM mutations happen, and breaks out of its polling loop only after the mutation count is unchanged for stable_seconds (default 2, max 10) within timeout_seconds (default 30, max 60). The implementation is at agent.ts:956-1008. It matters for AI flows specifically because every other waiting tool waits for a thing to be true (text present, element visible, network response received). AI fallback paths violate those expectations. The text you would wait for might never arrive. The success element might never render. The network call returns 200 with malformed JSON, so a wait_for_response fires too early. wait_for_stable does not assume which UI state you will land in. It just waits for the page to stop changing, then snapshots, then asserts on what is actually there. That works whether the final state is the success UI, the error toast, the retry button, or the fallback message. It also catches the wedged case: if the page never stops changing for 60 seconds, the tool returns a timeout result with the mutation count, and the agent can mark the scenario failed with that evidence.
How do I assert that the fallback UI is actually correct, not just that something rendered?
After wait_for_stable settles, the agent calls snapshot to read the accessibility tree of the page (the live one, not a cached representation). The tree contains roles and accessible names for every interactive element and a flat list of visible text. The agent then issues an assert call (agent.ts:133-144) with a description, a passed boolean, and an evidence string, and the runtime records that assertion against the scenario. The shape of a useful fallback assertion looks like: description='Error toast appears with an actionable retry message', passed=tree contains text matching /try again|retry|something went wrong/i, evidence='Snapshot shows alert role with text "The model is busy. Try again." and a button labeled Retry'. The agent can also call evaluate to check internal state (e.g. that a global error counter incremented) and it can call http_request to verify a backend log was written. The combination is what catches inconsistent state: the success container is gone, the error toast is present, no spinner is left over.
What does the scenario file look like for an AI fallback test?
Plain Markdown in the same #Case format Assrt uses for every other scenario. A working example: '#Case 1: Model timeout shows the retry UI with an actionable message. - http_request POST to http://localhost:4000/inject with body {"mode":"timeout"}. - Navigate to /chat. - Type "Summarize the doc" into the message input. - Click Send. - wait_for_stable with timeout 60 and stable 3. - Assert that the page contains text matching the words try again, retry, or timed out. - Assert that the send button is enabled, not stuck disabled. - Assert that no spinner is left in the DOM.' That is all that lives in your repo. The agent fills in the rest: which selector to use, how to read the accessibility tree, when to take a screenshot, where to store the video. The scenario file is a contract; the agent figures out the implementation each run.
Does it record the failure run somewhere I can show a teammate?
Every scenario runs in a real Chrome with screencast frames streaming over WebSocket and (in local mode) a video file written to disk. The video captures the full run: the model call going out, the spinner spinning, the failure path firing, the fallback UI rendering or failing to render, the assertions evaluating. The agent also captures a screenshot after every visual action and attaches it to the assertion record. When a teammate asks why a scenario failed, the answer is a 30 to 90 second video that shows them. There is no log diving, no separate APM tooling, no guess about whether the bug was a flake. The video is the evidence.
I already have an LLM evaluation harness. Where does this fit?
The eval harness verifies the model produces good output on a fixed input set. That is necessary, and it has nothing to do with what the user sees when the model is unavailable. You need both. The eval harness scores the happy path. The Assrt scenarios cover the unhappy paths: timeout, malformed JSON, 429, 500, content filter rejection, partial response. They are different test types serving different goals. The same agent can also run end to end happy path tests, so you do not need a second tool, but the unit of value here is specifically the failure path coverage you cannot get from offline scoring.
Why not just write Playwright tests directly for the fallback UI?
You can. The reason this approach exists is that the same scenario file works for every variant of the failure. A handwritten Playwright test for the timeout case is 40 to 80 lines: imports, fixtures, a helper that talks to your fault injector, a wait_for_function with a custom mutation predicate, multiple expect calls for different aspects of the fallback UI. Tomorrow you add the malformed JSON case and write a second 40 line file. Then the 429 case. Then the partial response case. The agent collapses the same surface into a six bullet Markdown file, with the wait_for_stable, http_request, and assert tools doing the work that would otherwise be helpers in your test repo. If you would rather own the Playwright code, the open source Assrt repo shows you exactly the recipes (the MutationObserver, the http_request shape, the snapshot read) you can lift into your own test suite. Either way, the artifacts that matter (which assertions, what evidence) live in one place.
What is the smallest setup that lets me try this against my own AI feature today?
Three steps. Install the Assrt MCP (npx @m13v/assrt-mcp@latest install). Stand up a one file fault injector next to your dev server that returns a 504 or malformed JSON when a query parameter is present (a 30 line node script is enough). Write a six line #Case file that flips the injector on, navigates to your AI feature, sends a request, calls wait_for_stable, and asserts the page text contains a clear error message. Run the agent against your real Chrome with the --extension flag so you can watch the full failure path play out live. If the assertion fails, you have a real bug; the video plays back the moment the spinner went stale or the fallback message never painted. If it passes, commit the scenario; it is now your regression test for that fallback path.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.