AI agent browser automation reliability is not one prompt. It is five recovery primitives, named and open.

Most guides on this topic treat reliability as a prompt-engineering problem. It is not. Real runs fail five specific ways: the ref goes stale, the DOM keeps churning, the model API returns 529, the OTP field is split across six inputs, and a tool throws. This page walks through the five primitives Assrt ships for those five failures, with the file, the line number, and the exact code for each.

Matthew Diakonov, Written with AI

Published April 23, 202611 min read

4.9from live end-to-end runs

Open source: one 1,087-line agent file, five named primitives

Every primitive has a file, line number, and code snippet

Provider-agnostic: same stack works on Claude and Gemini

Scenarios live as plain markdown on your disk

Reliability is recovery, not retry.

Five failure classes, five primitives

Stale ref: inject the fresh accessibility tree on throw.

DOM churn: a real MutationObserver, not a sleep.

529 overloaded: backoff up to four times, never silently.

Multi-field OTP: one DataTransfer paste event, six fields.

Sliding window: cut at a safe assistant boundary, never mid-tool.

0:00 / 0:05

0Recovery primitives

0ms DOM-mutation poll

0chars of tree on throw

0API-overload retries

The reframe: reliability is the recovery matrix

Every article about this topic leans on the same framing: a reliable agent is one that succeeds more often. That is the measurement, not the mechanism. The mechanism is what the agent does in the 10% of runs where something goes wrong, and that is where most tools ship nothing at all. Below is the one picture that matters: five failure classes fan into a single agent, which fans out to five specific primitives. The rest of this page walks each edge.

Five failures, one agent, five primitives

The common framing vs. what actually ships

Most existing guides on this topic talk about reliability at the prompt level. Model choice, locator strategy, temperature. Those matter, but they are not where real runs live or die. Flip the toggle below to see the shift.

Reliability is a prompt quality problem. Use a better model. Write more detailed instructions. Add more examples. Turn down temperature. If a scenario flakes, rewrite the prompt and hope.

Treats reliability as a single dial
No response to stale refs beyond "try again"
Fixed sleeps for DOM settling; either too short or too long
No distinction between a retryable 529 and a fatal invalid_request
Multi-field OTP fields are typed per-digit; flaky by design

Primitive 1: the fresh tree on throw

The single most common failure in AI agent browser automation is a ref that was correct two steps ago and is not correct now, because the page re-rendered. The textbook response is to retry. Assrt does the opposite. The moment any tool call throws, the agent calls snapshot()again, slices the fresh accessibility tree to 2000 characters, and inlines it into the failed tool's tool_resultstring. The model's next turn sees a tree that reflects the actual page, picks a new ref, and the scenario keeps moving. No retry loop on stale state.

assrt/src/core/agent.ts

2000

“When any tool throws, the agent re-snapshots and inlines the fresh accessibility tree into the next tool_result string, truncated to 2000 characters. That is how stale refs stop being a retry problem.”

agent.ts lines 927-937

Primitive 2: the MutationObserver wait, not a sleep

Every modern web app lies about "loaded." Network idle fires while a streaming LLM response is still painting. document.readyState is useless for a React app that has already mounted the shell. Fixed wait(2000)calls either cut off a response mid-token or burn eight seconds on a fast page. Assrt's wait_for_stable tool installs a real MutationObserver on document.body, polls the mutation counter every 500 ms, and unblocks only after the configured stable window. The cleanup at the end disconnects the observer and deletes the globals so nothing leaks between steps.

assrt/src/core/agent.ts

Why this specific observer scope

The observe call uses childList: true, subtree: true, characterData: true. That covers node insertions, node removals, and text edits anywhere under document.body, which is what you care about for a signup flow, a streaming chat panel, or a dashboard populating with data. Attribute mutations are excluded on purpose, because libraries like Framer Motion toggle inline styles on a 60fps timer and would never let the counter plateau. The whole thing is about fifty lines. Every choice in it is visible and editable.

Primitive 3: the retry-vs-fatal matrix

When Anthropic or Google return a 529, a 429, or a 503, the correct response is a backoff. When the API returns invalid_request because a tool_use block got separated from its tool_result, retries will loop forever. The two cases look similar in a crash trace and very different in behavior. Assrt splits them with two explicit regexes.

assrt/src/core/agent.ts

Primitive 4: the DataTransfer paste recipe

A six-box OTP field is the hardest shape in the entire consumer web. <input maxlength="1" /> repeated six times, with per-field focus handoff and an autosubmit that fires on the last digit. Per-field typing breaks in a hundred ways: the agent mistypes one digit, the focus jumps past a field, the autosubmit fires while the fifth box is still empty. Assrt's system prompt pins one specific JavaScript expression. The agent is told, verbatim, not to modify it. Every OTP scenario goes through this one path.

assrt/src/core/agent.ts

Primitive 5: the sliding window that preserves tool_use adjacency

Scenarios in the real world are long. Twenty steps with a screenshot after each visual action turn the message array into the single biggest source of tokens in the run. The obvious fix is to slice off the oldest messages once the array gets big. The less-obvious trap is that this can split an assistant message containing a tool_use block from the user message containing its matching tool_result, which the API will reject with invalid_request. Which immediately hits primitive 3's fatal branch and ends the scenario. Assrt's slicer walks forward from the desired cut index until it finds an assistant or model boundary, then cuts there. Every tool_use stays adjacent to its tool_result.

assrt/src/core/agent.ts

How the five primitives hand off mid-scenario

None of the primitives live in isolation. A real run chains them: a click throws, primitive 1 fires, the next turn needs to wait for React to finish re-rendering so primitive 2 fires, the model provider briefly overloads so primitive 3 fires, the OTP comes in and primitive 4 fires, the message history gets long enough that primitive 5 fires. The sequence below is what one full handoff looks like with five actors: the scenario plan, the agent loop, the browser, the model API, and the disposable inbox.

One recovery cycle across five primitives

A real run through the five primitives

The best way to see why this pattern matters is to watch one scenario trip every failure class in a single pass. The log below is the kind of output Assrt emits to stdout when you run npx assrt run against a localhost dev server. Every entry corresponds to a real line of console.log in the file you can read.

scenario.md run

Five failure classes, five primitives

Each primitive has a specific trigger, a specific response, and a specific cleanup. Read the walkthrough below one class at a time; every step names the file and line that does the work.

Failure class 1: stale ref after a re-render

The agent picked ref=e91 from a snapshot two steps ago. Between then and now, React re-rendered and e91 is gone. browser_click throws. The catch at agent.ts:927 calls snapshot() and inlines the fresh tree into the tool_result string with the prefix "Current page accessibility tree:" truncated to 2000 chars. The model's very next turn sees the new refs and picks e114.

Failure class 2: DOM still mutating

The agent clicked submit and needs to assert on the dashboard. A fixed sleep either cuts off a streaming response or burns eight seconds on a page that finished in one. wait_for_stable installs a MutationObserver, polls window.__assrt_mutations every 500ms, and unblocks only after stable_seconds of zero delta. Cleanup disconnects the observer and deletes the globals. Adaptive, not guessed.

Failure class 3: the model provider overloads

Anthropic returns 529 during a long run. Most agents crash; Assrt's retry loop regex-matches /529|429|503|overloaded|rate/ against the error message, sleeps (attempt + 1) * 5 seconds, and retries up to 4 times. A disjoint regex for invalid_request marks the error as fatal and ends the scenario cleanly so the pass/fail verdict stays honest.

Failure class 4: multi-field OTP inputs

A six-box OTP field (input[maxlength="1"] x 6) is where per-field typing falls apart: wrong focus, wrong digit count, autosubmit fires too early. The system prompt at agent.ts:234 pins one DataTransfer recipe. The agent dispatches a single paste event on the parent element; browsers fan the digits out across the six inputs in one tick. No per-field typing, no off-by-one.

Failure class 5: the message history outgrows the context window

A 40-step scenario plus screenshots at every visual action can blow through the context. Assrt's sliding window keeps the first user message plus the most recent turns, but walks forward from the cut point to an assistant/model boundary before slicing. This preserves every tool_use / tool_result adjacency, which is the one thing the API will reject if you get wrong.

Assrt vs. the common advice

Most of what shows up when you look up this topic is generic prompt hygiene. Here is the diff between that and a reliability stack built from named primitives.

Feature	Typical advice	Assrt
Response to a stale element reference	Retry the same ref or reprompt with a generic "try again"	Immediate this.browser.snapshot() and inline the fresh tree into the next tool_result (agent.ts:932)
Waiting for dynamic DOM to finish	Fixed sleeps, network-idle heuristics, or long timeouts	Injected MutationObserver on document.body + 500ms poll of window.__assrt_mutations until stable_seconds of silence (agent.ts:878)
Handling a model API 529 / 429 mid-scenario	Crash the run, or retry indefinitely without distinguishing the error class	Regex match on /529\|429\|503\|overloaded\|rate/ triggers 5s * attempt backoff up to 4 tries; invalid_request is treated as fatal and ends the scenario cleanly (agent.ts:642)
Multi-field OTP inputs (six separate <input maxlength="1">)	Type each digit into each field; off-by-one and focus-jump bugs	System prompt pins one DataTransfer paste recipe; a single ClipboardEvent fills all fields (agent.ts:234)
Long scenarios that exceed the context window	Arbitrary truncation that can split a tool_use from its tool_result, which the API then rejects	Sliding window walks forward to an assistant/model boundary before cutting, never severing a tool_use / tool_result pair (agent.ts:976)
Deterministic re-run of a failed scenario	Proprietary YAML on a cloud dashboard, cannot inspect or replay locally	Plain scenario.md on your disk; pointer at localhost; the entire agent loop is 1,087 lines in one file you can step through with node --inspect

Why this story outlives a model upgrade

Every six months a new model generation arrives and half the advice on this topic goes out of date. Prompt patterns change. Tool-call shapes change. Temperature defaults change. A reliability story tied to the model is a story you rewrite every quarter. The five primitives on this page are different. Fresh- tree injection is a browser-side pattern; it runs the same on Claude Haiku 4.5 and on Gemini 3.1. The MutationObserver recipe is DOM-level. The retry matrix regexes against HTTP error strings. The DataTransfer recipe is browser-API. The sliding window walks roles, not content. None of them change when you change models. This is why an open, named, file-and-line-numbered reliability stack is the right level of abstraction for AI agent browser automation; you write it once.

Want to see the five primitives fire on your scenario?

Bring a real flow you are having trouble automating. We will walk it through the stack on a call and show you which primitive handles which failure in your run.

Frequently asked questions

What does "reliability" actually mean for an AI agent driving a browser?

It is the probability that the agent finishes a real end-to-end scenario without human intervention, given that the page will re-render, the DOM will mutate, the model API will occasionally 529, OTP inputs will be split across six fields, and at least one tool call will throw. A reliable agent does not just get lucky on the happy path; it has a named response for each of those failure classes. Assrt ships five: fresh-tree injection on tool throw (agent.ts:932), a MutationObserver stability primitive (agent.ts:878), a retry-vs-fatal matrix keyed on error-message regex (agent.ts:642), a DataTransfer paste recipe pinned in the system prompt (agent.ts:234), and a sliding-window message pruning rule that preserves tool_use / tool_result adjacency (agent.ts:976). Reliability is the sum of those five, not the quality of one prompt.

How is the stale-ref problem actually handled when a tool throws?

The switch statement that dispatches every tool call is wrapped in a try/catch. On throw (any cause: the ref is gone, the element moved, the selector ambiguous), the catch at agent.ts:927-937 calls this.browser.snapshot() one more time, slices the fresh accessibility tree to 2000 chars, and formats a single string: "Error: {msg}\n\nThe action \"{toolCall.name}\" failed. Current page accessibility tree:\n{tree}\n\nPlease call snapshot and try a different approach." That string becomes the tool_result the model sees on the next turn. In practice the model then re-picks a ref from the fresh tree instead of retrying the stale one. No explicit retry counter, no hand-tuned selector healing; the recovery mechanism is: tell the model exactly what is on screen right now and let it re-plan.

Why use a MutationObserver instead of Playwright's built-in wait_for_load_state?

Because network idle and load events tell you the network is quiet, not that the page is visually ready. A streaming LLM response, an optimistic UI that swaps once the real data arrives, or a virtualized list that keeps mounting rows can all complete their first render long before the visible content settles. wait_for_stable (agent.ts:872-925) sidesteps this by measuring the one thing that actually matters: mutations against document.body. It injects the observer, polls window.__assrt_mutations every 500ms, and only unblocks after stable_seconds (default 2) of true silence. The observer is disconnected and the globals are deleted when done, so a scenario that calls wait_for_stable fifty times does not leak state. This is the single biggest reason Assrt's AI agent browser automation reliability holds up against modern SPAs; fixed sleeps and network-idle both lie.

What happens if Anthropic returns 529 Overloaded in the middle of a scenario?

At agent.ts:642-660 there is a two-regex matrix around the provider call. /529|429|503|overloaded|rate/i means retryable: the agent sleeps (attempt + 1) * 5000 ms (so 5s, 10s, 15s across four total attempts) and emits a reasoning event so the run log shows the backoff. /tool_use|tool_result|invalid_request/i means fatal: these errors are almost always a message-shape problem that a retry cannot fix, so the scenario ends cleanly with scenarioPassed = false and summary = "API error: ...". Anything else throws back up to the outer try/catch which logs and moves to the next scenario. The practical effect: your CI job does not hang for 30 minutes waiting on an overloaded API, and it does not flip a real functional failure into a "flaky retry" ticket either.

What is the DataTransfer paste recipe and why is it pinned in the system prompt?

A typical six-box OTP input (six separate <input maxlength="1">) is one of the worst shapes for an AI agent to type into because per-field typing requires tracking focus, digit index, and autosubmit timing, all of which drift by one if the scenario runs against a slightly slower page. agent.ts:234-237 pins the exact recipe in the system prompt: find an input with maxlength="1", take its parentElement, construct a DataTransfer with the full code, and dispatch a single ClipboardEvent of type "paste" on the parent. Browsers fan the digits across all six inputs in one event. The prompt explicitly tells the model not to modify the expression. The result is that OTP handling is deterministic: one tool call, one event, six fields filled.

The agent runs tens of steps. How does it avoid outgrowing the context window?

Every turn appends both the assistant's response and the user-side tool_result block. In a 40-step scenario with screenshots after each visual action, that grows fast. The sliding window at agent.ts:976-996 keeps the first user message (which contains the scenario plan) and the most recent turns. The trick is the cut point: naively slicing by index can split an assistant block that contains a tool_use from the user block that contains its tool_result, which fails the next API call with invalid_request. The code walks forward from the initial cut index until it reaches an assistant or model message (the start of a new turn), and cuts there. No tool_use is ever orphaned. Long scenarios stay stable without tripping the retry-vs-fatal matrix.

Is any of this specific to Claude, or does the same reliability story work with Gemini?

The five recovery primitives are provider-agnostic because they operate below the model layer. Fresh-tree-on-throw, MutationObserver stability, and the DataTransfer paste recipe are browser-side. The retry matrix regexes on error strings, and 529 / 429 / 503 are standard HTTP so the same matcher fires for Gemini's rate limits. The sliding-window rule cuts at assistant/model boundaries (model is the Gemini role name, assistant is the Anthropic name), so both shapes work. What does change per provider is the tool-schema translation: the same TOOLS array is remapped into Gemini function declarations at agent.ts:278-301. The rest of the reliability stack, the part that makes AI agent browser automation actually survive in CI, is the same code path regardless of which API you point it at.

Can I see a single scenario exercise all five primitives?

Yes. A signup flow against localhost that uses a disposable inbox does it in one pass. The agent navigates, snapshot returns a tree, and the first click either lands clean or throws (primitive 1). Form submission triggers React re-renders that wait_for_stable handles (primitive 2). The LLM is doing serious planning and may occasionally 529 (primitive 3). The verification code lands in a six-input OTP box (primitive 4). If it is a long multi-case run with screenshots, the message array eventually hits the sliding-window cut (primitive 5). The terminal snippet on this page shows exactly this flow; every line is the kind of log Assrt emits. The whole thing runs on your machine, against your localhost, with your scenario.md on your disk. Nothing is mocked, nothing is cloud-tied.

Why is this a better story than "our agent uses a smarter model"?

Because a smarter model does not change the failure classes; it just reshuffles which ones it hits more often. Stale refs come from the page, not the model. DOM churn is the page. 529s are the provider's rate-limit behavior, not the model's IQ. Six-field OTPs are a design choice by whatever site you are testing against. Message-history corruption is a protocol constraint. A reliability story that hinges on a model upgrade evaporates the moment the next model version ships with a different prompt style or tool-call shape. A reliability story built from named, versioned, file-and-line-numbered primitives lives longer than any single model generation. You can literally git blame every one of Assrt's.

Is the whole reliability stack actually open source?

Yes. The file referenced throughout this page (/Users/matthewdi/assrt/src/core/agent.ts, mirrored at /Users/matthewdi/assrt-mcp/src/core/agent.ts) is plain TypeScript, checked into the repo. Every line number cited is an actual line number you can open and read. There is no closed core, no proprietary backend that owns "the reliability logic." The optional hosted app at app.assrt.ai stores run artifacts for sharing; the test engine itself does not depend on it and can run the whole stack against localhost forever. Compare to closed-SaaS agent testing tools that charge $7,500 a month and lock your generated tests inside their dashboard; here, the tests are markdown on your disk and the engine is yours to fork.

The reframe: reliability is the recovery matrix

Five failures, one agent, five primitives

The common framing vs. what actually ships

Primitive 1: the fresh tree on throw

Primitive 2: the MutationObserver wait, not a sleep

Why this specific observer scope

Primitive 3: the retry-vs-fatal matrix

Primitive 4: the DataTransfer paste recipe

Primitive 5: the sliding window that preserves tool_use adjacency

How the five primitives hand off mid-scenario

A real run through the five primitives

Five failure classes, five primitives

Failure class 1: stale ref after a re-render

Failure class 2: DOM still mutating

Failure class 3: the model provider overloads

Failure class 4: multi-field OTP inputs

Failure class 5: the message history outgrows the context window

Assrt vs. the common advice

Why this story outlives a model upgrade

Want to see the five primitives fire on your scenario?

Frequently asked questions

Comments (••)

Comments ()