Tests that refuse to be defensive

AI code defensive fallback tests: make "silent pass" impossible at the schema level

Everyone agrees that AI-written code loves fallbacks. Hardcoded defaults on inference failure, bare except clauses, deprecated API rescues, "N/A" strings that make broken UI look healthy. Almost no one talks about the mirror problem: the same model writes your tests and inherits the same risk-averse posture. A test that wraps every step in a try/except and ends with expect(true).toBe(true) passes forever, and the fallback in the app under test never gets exercised. This guide is about the one rule that breaks the pattern: require evidence on every assertion, and enforce it in the tool schema so the model cannot forget.

Matthew Diakonov, Written with AI

Published April 20, 202610 min read

Evidence-first assertions

The schema that forbids a silent pass

AI tests default to try/catch around every step

Assrt's assert tool requires description, passed, evidence

Missing evidence = rejected tool call at the schema layer

Fallback strings like 'N/A' surface in the evidence field

events.json + screenshot prove the real path ran

0:00 / 0:05

4.9from Assrt MCP users

assert tool requires evidence as a schema-level field (agent.ts:133-143)

Auto-screenshot on every state-changing tool call

3-attempt retry ceiling before forced fail-with-evidence (agent.ts:220-226)

Real Playwright, open-source, self-hosted, $0 beyond LLM tokens

What every article about AI fallbacks misses

Search the keyword and the consensus is loud. Sean Goedecke wrote a good piece about agents relying too much on fallbacks. Testkube has a piece on testing AI-generated code. SoftwareSeni, Medium, Nobl9 cover variations. They all describe the same problem: AI writes defensive code that silently rescues errors, and your production signal goes quiet. They stop there. None of them ask the next question, which is: what about the test code the same AI also wrote to guard that app?

A model that adds except Exception: pass to your product code is the same model that adds try around every page.click in your test file. The test passes, not because the feature works, but because nothing in the runner forced the test to prove anything. The fix is not a better prompt. It is a contract the model cannot route around.

The popular fix

Better prompts, linters, style guides

Tell the AI "do not add try/except". Works partially. The model still writes fallbacks when it feels unsure. Relies on good behavior.

The structural fix

Evidence as a required tool argument

Declare evidenceas required in the assert tool's input_schema. The LLM provider rejects the call if it is missing. Enforcement is not a convention, it is the shape of the tool.

The schema that makes silent pass impossible

This is the anchor of the whole argument. Below is the exact definition of the assert tool shipped in Assrt at /Users/matthewdi/assrt-mcp/src/core/agent.ts, lines 132-144. Three properties, three required fields. The third field is the one nobody else enforces.

assrt-mcp/src/core/agent.ts

The required: ["description", "passed", "evidence"] array is the contract. Any LLM calling the assert tool must return all three fields. A missing evidence string is not a warning, a hint, or a suggestion. It is a rejected tool call. This is why prompt engineering for defensive code is a weaker primitive than schema engineering: the prompt is advice, the schema is a gate.

3 required

“Every assertion carries description, passed, and evidence. The third field is the one that breaks defensive test patterns.”

assrt-mcp/src/core/agent.ts, lines 133-143

The same AI, two different test rigs

On the left: what the model produces when you hand it Playwright and let it improvise. Three nested try blocks, a silent console.log, and a trivially-true expect. Every branch survives. On the right: the same plan written for Assrt, where the assert tool forces the model to name what it saw.

The gap is the evidence field

// What the AI writes by default in a traditional Playwright test.
// Every single branch survives silently.

test("dashboard loads", async ({ page }) => {
  try {
    await page.goto("/dashboard");
    try {
      const name = await page.locator("[data-testid=name]").textContent();
      if (!name) {
        console.log("name missing, continuing");   // silent fallback
      }
    } catch (_) {
      // swallow, keep going
    }
    expect(true).toBe(true);                       // "passes" with no check
  } catch (e) {
    console.log("dashboard check failed softly");  // silent fallback
  }
});

26% fewer lines

Six fallback patterns the evidence field surfaces

None of these need to be rare. They show up in nearly every AI-generated diff. Evidence-first assertions do not prevent the fallback from shipping; they make sure it cannot ship in silence. The evidence string literally quotes the fallback back to you.

Hardcoded defaults on inference failure

The model wraps the call to your own AI endpoint and returns a baked-in string when it throws. The UI still renders. Your test says 'it rendered'. Nobody finds out until a user asks why every summary says 'Summary unavailable.'

Bare except clauses

except Exception: pass is by far the most common fallback AI writes. It exists because the training data has examples where the guide wanted to keep going. In production code it mutes real signals.

Deprecated API fallbacks

When an API migration breaks, the AI adds a try block around the new call that falls back to the deprecated one. The migration 'passes'. Deprecation warnings accrue. Eventually the old API is removed and everything collapses at once.

String defaults like 'N/A' or '—'

The riskiest fallback. The UI looks plausible. Evidence-first assertions catch it by quoting the exact string back to you in the evidence field.

Keyword-matcher fallbacks

When the clever path fails the AI drops to a literal keyword check. Tests that only verify the happy path pass forever; the fallback never gets exercised. Evidence assertions force you to describe which path you saw.

Silent console.log('…failed softly')

The trick is the 'softly'. Your log pipeline sees a benign info line. Your test sees a completed function. Your user sees a blank section. Only an evidence-first test catches this because the evidence field cannot be 'softly'.

description:passed:evidence:snapshot refscreenshot.pngevents.jsonresults.jsonscenario.mdagent.ts:133-1433-attempt retry ceiling@playwright/mcp/tmp/assrt/<runId>/

Watching the contract fire

Here is an actual run against a dashboard that silently renders "N/A" in a profile card when the backend returns a 500. Step 3 passes because the user name is really visible. Step 4 fails because the evidence field has to quote the card text, and the card text is "N/A". The fallback surfaced itself. No human had to suspect it.

npx assrt-mcp — fallback caught

Open events.json for the run and the same record is there in machine-readable form. One jq query gives you every assertion the agent made, with its evidence string.

events.json — assertion records

Required fields on every assert call (agent.ts:142)

Playwright MCP tools the planner can call

Attempt ceiling before forced fail-with-evidence (agent.ts:225)

Monthly cost beyond LLM tokens

No silent retries, no silent catches

Evidence on every assertion is half the story. The other half is that a failed action has nowhere to hide. Assrt's system prompt spells out the recovery protocol explicitly. Snapshot, try a different approach, cap at three attempts, and then commit to a verdict. A retry loop is itself a fallback. It masks intermittency. A three-attempt ceiling is the floor that makes the evidence field meaningful.

system-prompt excerpt (agent.ts)

The checklist for an anti-defensive test rig

If you want to implement the same pattern inside another harness (Playwright, Cypress, Puppeteer, anything) these are the eight rules that matter. Miss any one and the defensive patterns find the gap. Assrt enforces all eight by default.

Requirements a test rig must enforce to block AI defensive fallbacks

Every assert call must carry a concrete evidence string
Evidence must quote real on-page text, URL, or ref value
No expect(true).toBe(true) escape hatches in the runner
No silent try/catch blocks wrapping test steps
Failed action triggers a fresh snapshot before verdict
Screenshot auto-attaches to every state-changing call
Retry ceiling: 3 attempts, then fail with evidence
events.json records the pass/fail verdict and the evidence together

Five steps to put this in your repo this afternoon

The whole pattern fits in a half-day. Install the runner, write one plan file, run it once, read events.json, commit the plan. Your CI now catches silent fallbacks the same way a type system catches null derefs, by refusing to compile the silent version.

From zero to evidence-enforced tests

Write the case in English, not in try/except

Save a scenario.md with a single #Case block. Describe the flow the way you would dictate it to a human tester. No selectors, no waits, no error handlers. The AI planner reads the plan and picks tool calls.

Let the schema enforce evidence on every assertion

When the agent calls assert, the LLM provider validates against input_schema.required = ['description', 'passed', 'evidence']. A call missing evidence is rejected before the tool handler runs. You cannot skip it from the prompt and you cannot forget it in the code, because the schema is the contract.

Run against the real app with one command

npx assrt-mcp --url https://app.example.com --plan tests/scenario.md. The runner spawns real Playwright MCP. Every tool call is printed live. Every state change is screenshotted. Every assertion lands in events.json with its evidence string.

Grep events.json for fallback strings

jq '.steps[] | select(.evidence | contains("N/A"))' events.json. One command finds every assertion whose evidence quotes a known-bad placeholder. The fallback your frontend silently renders becomes a test failure because the evidence field is forced to contain it.

Keep scenario.md under version control

The plan file is plain Markdown. It diffs like code. When a PR changes a fallback string, the matching #Case in scenario.md updates in the same commit. No dashboard to sync, no vendor state to migrate, no $7.5K/mo lock-in.

Default Playwright vs Assrt, on the fallback axis

Playwright is the best browser test framework in the world. That is not the critique. The critique is that its primitives do not force the human (or AI) writing the test to prove anything. Assrt is a thin layer on top of real Playwright that does.

Feature	Plain Playwright (AI-written)	Assrt
Assertion shape	expect(condition) with no contextual record	assert { description, passed, evidence } — all three required
Defensive test pattern	try/catch around steps is legal and common	No try/catch in the plan; errors surface as failed steps
Silent pass possible?	Yes (expect(true).toBe(true) is still a pass)	No — missing evidence rejects the tool call at schema level
Retry loop	Configurable, often unbounded	3 attempts max, then fail with evidence (agent.ts:220-226)
Auto-screenshot per action	Opt-in, usually off in headless CI	On by default for every non-inspection tool call
Plan format	Proprietary YAML or vendor dashboard	scenario.md plain text, lives in your repo
Runner	Closed cloud service	npx assrt-mcp, self-hosted, $0 + LLM tokens
Tests survive a vendor cancellation	No — tests live in their cloud	Yes — scenario.md is already in your git history

Bring an AI-written repo, leave with an evidence-first test plan

Thirty minutes. You pick one flow you suspect hides a fallback. We draft the #Case block live and run it against your staging branch.

FAQ on AI code defensive fallback tests

What is a defensive fallback in AI-generated code, and why does it matter for testing?

A defensive fallback is any branch the model inserts that catches an error and returns a placeholder, a default value, or 'unknown' instead of failing. It shows up in production code as catch-all try/except blocks, hardcoded responses when inference fails, or string defaults that hide a missing upstream field. The danger is not that it crashes; it is that it never crashes. Your logs look clean. Your tests pass. And somewhere a user is staring at a UI that says 'N/A' because the model quietly rescued a broken API call. The test layer has to be the thing that proves the real path ran. Every assertion needs to cite a concrete on-page fact, not just 'passed: true'. That is what the evidence field in Assrt's assert tool at agent.ts lines 133-143 exists for.

Why not just tell the AI not to write defensive fallbacks?

Because the same model that writes your app tends to write your tests, and it defaults to the same risk-averse posture. Prompts like 'do not add try/except' reduce fallbacks but do not eliminate them. What works is making the test rig itself refuse a silent success. Assrt's approach is structural: the assert tool's input_schema.required array contains 'evidence' as a mandatory string, so any LLM calling the tool has to return a concrete piece of page state (a heading, a URL fragment, a DOM attribute) or the tool call is rejected by the schema. You can read this at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 133-143. It is not a convention; it is a contract.

How is 'evidence' different from an expect() call in Playwright?

An expect() call checks a condition and throws if it fails. That is good, but the AI can still write expect(page.locator('body')).toBeVisible() and call the test passed. Evidence is a separate field that says 'here is what I saw that made me decide pass or fail.' It lives alongside description and passed in the assert tool. When Assrt replays the run, every assertion has a human-readable record of why the agent believed it. If an AI agent returns evidence: 'dashboard heading Welcome, John visible at top of main' and it turns out the page actually said 'Loading...', you catch the false pass instantly by reading events.json. Without the evidence field you would have to replay video frame by frame to know what the agent saw.

What stops the AI from just making up evidence?

Two things. First, every tool call that mutates the page (click, type, navigate, scroll) triggers an automatic screenshot emit in the agent run loop; see agent.ts around the 'screenshot' branch. If the AI claims to see 'Welcome' and the screenshot shows a blank page, a human reviewer or a downstream check can flag it. Second, the accessibility ref protocol in the system prompt, lines 207-218, requires the model to call snapshot first and use the resulting ref ids. The refs are grounded in the actual DOM. It is much harder to lie about an element that was returned by the accessibility tree seconds ago than to lie about a hand-picked selector. The combination is not bulletproof but it raises the cost of a silent pass high enough that most LLMs stop attempting it.

How does this interact with fallback patterns in the app code itself?

Evidence-first assertions surface product fallbacks as a side effect. If your frontend silently renders 'N/A' when the profile API 500s, the agent has to type 'N/A' into the evidence field when asked about the profile card. You read events.json, see evidence: 'Card text: N/A', and immediately know something upstream is swallowing an error. Compare to a traditional test that would have asserted the card exists and moved on. The test passes, the card is a lie, and you find out weeks later. A test layer that forces evidence exposes fallbacks in the product because the evidence quotes the fallback string back to you in plain English.

Does Assrt generate real Playwright code or its own YAML format?

Real Playwright. Assrt spawns @playwright/mcp under the hood and drives a real Chromium (or Firefox, or WebKit) process through real Playwright APIs. Your plan file is plaintext #Case blocks in scenario.md, not proprietary YAML. The test artifacts that land in /tmp/assrt/<runId>/ are standard Playwright traces, screenshots, and videos. You can open the video in any browser, inspect events.json with jq, and replay the exact tool call sequence without any vendor dashboard. That matters for defensive-test conversations because you can audit what the agent did; closed YAML formats and vendor dashboards do not give you that audit trail.

What does a failing assertion look like in Assrt compared to a false-positive in a vendor tool?

A failing assertion records description, passed: false, evidence: '<what was actually on the page>', and triggers a final screenshot plus a fresh accessibility snapshot before the scenario ends (see error-recovery branch at agent.ts lines 1012-1015). The run directory contains events.json with the failed step, the screenshot that shows the real DOM, and the video showing the sequence that led there. A vendor tool with closed tests gives you a red checkmark and a generic screenshot viewer. The difference matters when the failure is a fallback the AI wrote: you need to see the exact string the UI displayed, because that string is often the only hint that something deeper broke.

Can this run in CI, or does it require a dashboard?

CI only. There is no dashboard. One command: npx assrt-mcp --url https://your.app --plan tests/scenarios/critical-path.md. Your LLM key comes from ANTHROPIC_API_KEY or GEMINI_API_KEY in env. The runner writes /tmp/assrt/<runId>/ with the video, screenshots, events.json, and results.json. You upload that directory as a GitHub Actions artifact on failure and move on. No seats, no billing per test, no vendor lock-in. The tests live in your repo as Markdown. When you cancel (or switch LLM providers) the tests come with you.

How many regex patterns does the OTP extractor try before it falls back to raw digits?

Four keyword-anchored patterns, then three raw-digit fallbacks, in that order, at email.ts lines 101-109. The keyword patterns look for 'code', 'verification', 'OTP', and 'pin' followed by a colon or whitespace and 4 to 8 digits. Only after all four fail does the extractor try \b(\d{6})\b, then \b(\d{4})\b, then \b(\d{8})\b. The ordering matters for defensive-test discussions because raw-digit fallbacks are themselves a fallback risk: any stray 6-digit number in the email body (an order ID, a timestamp) would match. By preferring keyword-anchored patterns, the extractor fails loudly on bad templates instead of silently pulling the wrong number.

Does the agent silently retry on failure, or does it always surface the error?

It surfaces. The error-recovery section of the system prompt (agent.ts lines 220-226) says: take a fresh snapshot, try a different ref or approach, and if stuck after 3 attempts, mark the scenario as failed with evidence and call complete_scenario. There is no silent retry loop. Every attempt shows up in the run log as its own step with a status of failed or running. This matters for defensive-code conversations because a retry loop is itself a kind of fallback; it masks intermittent failures by papering over them. Assrt retries at most three times and then forces the agent to commit to a pass or fail verdict with evidence. You get a real answer, not an averaged one.

What are the five tool fields I should know about to audit a run for defensive behavior?

description (what the agent claimed to test), passed (the verdict boolean), evidence (a concrete quote from the page), plus the auto-attached screenshot and the fresh accessibility snapshot taken after any failed action. All five live in events.json for a given run. When you're triaging a red build, grep events.json for passed:false, read the evidence field, then open the matching screenshot to cross-check. If the evidence says 'Welcome, John' but the screenshot shows a loading spinner, the AI hallucinated a pass; your fallback-catcher test worked. If the evidence and screenshot agree, the app is actually broken.

Is the evidence requirement enforced at the schema level or the prompt level?

Schema. agent.ts lines 135-143 define input_schema with required: ['description', 'passed', 'evidence']. The LLM provider (Anthropic, Gemini, whoever) rejects any tool call that omits required fields before the call even reaches the tool handler. This is important: prompt-level rules can be ignored by a distracted model. Schema-level rules are a hard constraint. The same pattern is used for the complete_scenario tool (required: ['summary', 'passed']) and suggest_improvement (required: ['title', 'severity', 'description', 'suggestion']). Every tool that records a verdict or a bug report has its evidentiary fields declared required in the schema so the model cannot skip them.