The evidence contract

Your AI's try/catch returns []. The unit test passes. The user sees nothing.

Defensive AI code hides bugs because unit tests assert on what a function returned, not on what the user saw. The only assertion that catches a swallowed error is one that quotes the live page. Below is the 12-line tool definition in the open-source assrt-mcp repo that makes that contract mechanical.

M
Matthew Diakonov
9 min read
4.9from developers running Assrt locally on shipped AI code
Every passed assertion stores a quoted page snippet as evidence
Real Chrome, real network, real defensive-path detection
MIT licensed. Tests live in your repo as plain text. No vendor lock-in.

The pattern that ships almost every day

Pick a recent PR from a Cursor or Claude Code agent that touched an API call. Odds are excellent the diff added a try/catch around the fetch, an optional-chaining default on the response, and a falsy guard in the JSX. None of those choices are wrong in isolation. Together they form a perfect cover for a real 500: every layer returns a valid type, the linter is silent, the unit test (which mocked the fetch) is green, and the page renders an empty container the user has no way to interpret.

The standard advice is let exceptions propagate. That advice is correct and also useless when the same agent is going to write the next PR the same way an hour from now. What you need is a test contract the AI literally cannot satisfy with a silent fallback. Not a policy. A schema.

What the AI wrote (in production right now)

The anchor: 12 lines of tool schema in agent.ts

When the Assrt browser agent wants to mark a step passed: true, it has to call the assert tool. That tool's input schema is below, copied verbatim from the public repo. Three fields are marked required: a description of what is being asserted, the boolean result, and an evidence string. There is no way to call this tool with a missing evidence value. The Anthropic and Gemini SDKs both reject the call before it leaves the model.

src/core/agent.ts (assrt-mcp, MIT licensed)

The handler that runs when the tool fires is just as small. It stores the triple in order. The result line that gets surfaced to the agent in the next turn literally interpolates the evidence string. So the agent sees its own evidence claim echoed back, which is a useful guard against drift over a long scenario.

src/core/agent.ts:893-903 (assertion handler)
3

Required fields on the assert tool. Description. Passed. Evidence. The schema is the contract.

src/core/agent.ts:142 — required: ['description', 'passed', 'evidence']

Why the unit test goes green and the browser scenario fails

The diagram below is the same defensive function evaluated by both layers. The unit test never reaches the catch branch because it mocked the fetch. The real browser hits the real 500, the catch branch fires, the page renders an empty container, and the evidence-required assertion has nothing to quote.

Two layers, one defensive function, opposite verdicts

AI agentDefensive codeUnit testReal browserEvidence assertionwraps fetch in try/catch return []calls getOrders() with mocked fetchreturns [] (mock path, never throws)expect(orders).toBeInstanceOf(Array) PASSESreal browser hits real /api/orders (500)returns [] (catch branch fires)snapshot of rendered /orders pageno row with "#ORD-" — assertion FAILS

Six AI defensive patterns, one test contract that catches all of them

The six tiles below are the patterns that keep showing up in AI-authored diffs. Each card states the pattern and the specific evidence the assertion must require. Notice that the right-hand side is always a quotable thing on the rendered page, never a return value.

Catch-all try/catch returning a default

The function wraps everything in try/catch and returns [], null, or {} from the catch. Type system is satisfied. UI renders an empty container. Evidence required: at least one row visible with a real ID.

Optional chaining masking missing data

data?.user?.profile?.name ?? "" hides a 500 from the API as an empty string. Header renders blank. Evidence required: the header text contains the seeded user's known name.

Conditional render on falsy state

if (!items?.length) return null bails before the loading or error UI ever renders. The user sees a fully blank section. Evidence required: either a row, a named loading state, or the literal empty-state copy is visible.

Generic error boundary swallowing the cause

A catch-all <ErrorBoundary fallback="Something went wrong" /> swallows every render error into one string. Evidence required: the boundary fallback text does NOT appear under conditions where the page should render normally.

Server route returning {error: 'failed'}

Next.js route handler catches the DB exception and returns NextResponse.json({error: "failed"}). Client UI renders nothing or a generic toast. Evidence required: the URL becomes /orders/ORD-... or a confirmation row appears with a real ID.

Loading flag flipped in the catch branch

useEffect sets isLoading false on rejection without surfacing the error. Spinner disappears, content never appears. Evidence required: after wait_for_stable, the page contains rendered data, not the post-spinner blank state.

How the evidence contract closes the loop

Three inputs feed the assertion. Six patterns feed the defensive code path. One schema in the middle. The output is binary and quotable.

Inputs, the schema, and the verdict

Plain-text #Case
Real browser snapshot
AI-generated diff
assert tool
Quoted page text
No matching text
Agent re-edits

What the test file actually looks like

No DSL, no YAML, no proprietary format. Two cases in plain text, committed next to the code they test. The same agent that wrote the defensive function writes the case in the same turn. The case names the user-visible outcome; the assertion the agent will fire mid-run is implicit in the prose.

tests/orders.txt

What you see when the defensive path fires

Below is the literal output of npx assrt run against a dev server where getOrders returned an empty array because the catch branch fired. The evidence line is the part to read carefully: the agent could not find a row matching #ORD- in the snapshot, so the assertion failed with the exact reason.

assrt-run-output

The before-and-after the AI agent reads

Once the report comes back with the failed evidence, the coding agent reads it the same way a human would. The fix is almost always to remove the defensive return and let the caller decide what to do with the error. Below: the same function before and after, with the line counts that matter.

getOrders, before and after the evidence-required test fired

export async function getOrders(userId: string) {
  try {
    const res = await fetch(`/api/orders/${userId}`);
    if (!res.ok) return [];
    const data = await res.json();
    return data?.orders ?? [];
  } catch (err) {
    console.error("[orders] fetch failed:", err);
    return [];
  }
}
45% fewer lines

How big the gap actually is

Three numbers worth keeping in mind. The first is the required-fields count on the assert tool: three. The second is the line count for the entire tool definition that enforces it: twelve. The third is the number of distinct AI defensive patterns those twelve lines catch when the assertion targets visible page state instead of return values: six (the cards above), and growing as the next AI training cohort ships.

0required fields on the assert tool
0lines in src/core/agent.ts that enforce it
0defensive patterns one contract catches
0vendor SDKs to install

Patterns we keep seeing in shipped diffs

try { fetch } catch { return [] }data?.user?.name ?? ""if (!items?.length) return null<ErrorBoundary fallback="Something went wrong" />NextResponse.json({ error: "failed" })setLoading(false) in catchresult?.items?.filter(Boolean) ?? []if (err) return defaultConfigPromise.allSettled().then(rs => rs.filter(ok))

How to write a test that has to read the page

Five steps. None of them require Playwright knowledge or a new test framework. The whole loop runs inside the agent that just edited the code.

Evidence-first test writing

1

Pick the user-visible outcome

Not the function's return type. The thing the user reads on the page if it works: a price, an order ID, a confirmation banner with the right text. That is the assertion target.

2

Write the assertion as a sentence

Plain text in a #Case block. "After submit, a row containing #ORD- and a non-zero price is visible." That sentence becomes the description field. The evidence field gets filled at run time from the live snapshot.

3

Run it against the real dev server

assrt run hits localhost:3000 in a real Playwright browser. No mocks. The defensive try/catch fires the same way it would in production. The agent inspects the rendered page after the action.

4

Read the evidence in the report

Pass: the report cites the exact text the agent saw. Fail: the report cites what was on the page instead. There is no ambiguity about what the user would have experienced.

5

Hand the failure back to the agent

Same turn, no hand-off. The Claude Code or Cursor agent that wrote the defensive try/catch reads the failed evidence and re-edits the function to let the error propagate. Loop closes inside one prompt.

Two related guards in the same agent loop

The evidence contract is the load-bearing piece, but two other tools in the same agent loop are worth knowing about because they catch defensive patterns the single assertion misses.

wait_for_stable: catches “loading flag flipped in catch”

Defined at agent.ts:956-1009. Injects a MutationObserver into the page and polls until the DOM stops changing. If the AI defensive code flips isLoading=false in a catch branch without ever rendering the data, the spinner disappears but no expected text shows up. The next assert against that text fails with evidence that names the post-spinner blank state.

suggest_improvement: catches passing-but-broken UX

Defined at agent.ts:158-170 and handled at agent.ts:914-924. Even on an otherwise passing scenario, the agent can flag a UX regression with a severity. A defensive try/catch that makes the form “succeed” but shows no confirmation to the user gets logged here, with the same evidence-style description, so the agent that wrote the form sees it on the next turn.

Want to see this contract running on your code, live?

20-minute call. Bring a recent AI-authored PR. We will run the evidence-required assertion against your dev server and see what falls out.

Frequently asked questions

Why do unit tests miss AI defensive fallback bugs in the first place?

Unit tests assert on what a function returns. The defensive try/catch returns a valid-looking value (an empty array, null, a default object), so the assertion passes. The function honored its type signature. The catch branch never threw. From the test's point of view nothing went wrong. The bug only exists at the rendered-page layer, which the unit test never sees. This is the core mismatch: the AI hardened the wrong layer, and the only test that can catch it is one that asserts on visible browser state.

What does an evidence-required assertion actually look like in code?

Open the assrt-mcp repo and read src/core/agent.ts lines 132 to 144. The assert tool's input_schema declares three required parameters: description, passed (boolean), and evidence (string). The agent's underlying LLM cannot serialize a tool call with passed:true and an empty evidence field — the schema rejects it. The assertion handler at line 897 stores the triple verbatim in the scenario record. So every green check you see in a report is backed by a quoted snippet of what was on the page at that moment. If the AI's defensive code rendered a blank list or a generic toast, there is no quotable evidence for the success state and the assertion has to fail.

How is this different from snapshot testing or visual regression?

Snapshot tests compare DOM trees or pixel diffs to a saved baseline. They flag any change, including legitimate ones, so teams either freeze them and lose signal, or update them constantly and lose the catch. Evidence-required assertions are semantic: they say 'a row containing the order ID must be visible after submit.' If the defensive fetch returns an empty cart, no order ID is visible, and the named assertion fails with a precise message instead of a noisy diff. There is no baseline to maintain.

Will this slow down my test suite the way Cypress or Selenium did?

Each scenario runs against a real Chromium via Playwright MCP, which is fast enough to run inline during a coding agent's turn. The Assrt agent loop drives the browser through the accessibility tree (snapshot refs, not pixel hunting) and only takes screenshots after visual actions, so token use stays low and runs finish in seconds for the typical 3 to 5 step #Case. You can absolutely run a 30-case suite in CI; you can also run a single targeted case after each AI edit and never wait more than a few seconds.

Does the AI agent ever cheat by writing fake evidence?

The agent's evidence string is verified against the latest accessibility snapshot the browser handed back. Snapshot text comes directly from Playwright MCP and is logged with each step (see the screenshot+snapshot loop at agent.ts:649-654 and the per-step screenshot policy at agent.ts:1024). When the run is reviewed, the evidence string is sitting next to the snapshot it was supposedly drawn from. If the AI hallucinates 'Order #1234 visible' but the snapshot has no such text, the discrepancy shows up in the report.

What about errors the AI swallows on the server, before any rendering?

Server-side defensive code (a Next.js route handler that catches a database error and returns {items: []}) renders an empty page in the browser. That empty page has no visible item rows, no totals, no IDs. The visible-evidence assertion 'at least three product cards with non-zero prices appear' fails the same way it would for a client-side fallback. The assertion does not care which layer swallowed the error; it only cares about the user-visible result. That is the entire point of asserting at the rendered layer.

Is assrt-mcp open source and free to run locally?

Yes. The MCP server, CLI, and agent loop are MIT-licensed and ship as @assrt-ai/assrt on npm. One command (npx @assrt-ai/assrt setup) registers the MCP server with Claude Code, Cursor, and other tool-using IDEs. No cloud account is required to run tests locally. Tests live in your repo as plain text. The optional cloud sync is opt-in. Compared to vendor platforms in this space charging four-figure or five-figure monthly minimums, the cost-of-entry difference is the entire reason the evidence-contract design exists out in the open: you can read agent.ts yourself.

Can a single assertion really catch every AI defensive pattern?

No, and that is the right framing. One evidence-required assertion catches the patterns that affect one user-visible outcome. The recipe is one assertion per outcome that matters: a price renders, an ID appears in the URL, a confirmation toast contains the right text, a list has at least N rows. A handful of those, written once per critical flow, cover the entire surface area where a swallowed error would have been visible to a user. That is a lot smaller than a typical unit test suite and a lot more focused than a visual snapshot baseline.