Automated QA automation: the four babysitting loops Assrt removes in code you can read
Every guide at the top of this search result uses the phrase to mean "tests that run on a schedule." That is not automation. Automation is when the run does not need a human in the loop. Four specific loops still need a human in almost every Playwright suite, and all four are fixable with code you can open. This page points at each one by file and line number inside /Users/matthewdi/assrt-mcp/src/core/.
What every top result gets wrong about the phrase
Search this keyword and the first ten results are essentially the same article. BrowserStack, Katalon, Wikitechy, TestingXperts, AutomatePro, TestDevLab. They describe "automated QA automation" as running tests on a schedule, wiring Selenium into Jenkins, or asking an LLM to scaffold a .spec.ts from a user story. Those are real things. They are also the beginning of the loop, not the end.
The interesting question is what happens at 03:47 on Saturday when the nightly run hits a streaming chat response, a stale ref after a modal animation, a Stripe checkout OTP, or a product page whose DOM is 340,000 characters long. Those four moments are when a human normally gets paged. Automated QA automation means none of them pages anyone. The rest of this page shows, with file and line numbers, how Assrt handles each one.
Loop 1: fixed-time waits
The classic sleep(3000) guarded by a comment that says "this should be enough." It is not. wait_for_stable replaces it with a MutationObserver quiet-period algorithm at agent.ts lines 906-959. 500ms poll, 2s default quiet period, 30s default timeout. All three bounds are configurable per call, capped by Math.min so no one sets a 5-minute quiet period by accident.
Loop 2: stale selectors
A normal retry picks the same stale ref and fails again. Assrt catches every tool error at agent.ts:962 and splices a fresh 2000-char accessibility snapshot into the LLM's next turn. The model re-locates the element from the up-to-date DOM instead of re-firing the old ref. You never write a retry; recovery is the default.
Loop 3: OTP handoffs
create_temp_email, wait_for_verification_code, and check_email_inbox are tools on the agent (agent.ts:115-130). Disposable mailbox via temp-mail.io, 10-char prefix per run, 7 regex patterns in priority order for code extraction. No SMTP stub, no fixture pool, no manual copy-paste from your inbox.
Loop 4: snapshot overflow
Enterprise apps return accessibility trees in the multi-hundred-kilobyte range. The constant SNAPSHOT_MAX_CHARS at browser.ts:500 is 120,000 (roughly 30,000 tokens). Anything above that is sliced at a clean line break with a literal truncation footer. The agent still sees valid refs in the first section and can scroll to re-snapshot the rest. The context window does not explode.
Where each loop sits in the pipeline
One agent in the middle. The four loops on the right are not optional features you turn on; they are the path every tool call is already walking. You do not import them or configure them. You write a plan in English, and the agent runs that plan through them.
scenario.md → Assrt agent → four unattended mechanisms
Loop 1. wait_for_stable: MutationObserver, not sleep
agent.ts:906-959The most common cause of flaky tests is the wait. A hardcoded sleep is wrong on every page that has a streaming response, a React Query refetch, or a progressively-rendered list. The algorithm Assrt actually uses is a three-step dance: install a MutationObserver, poll a counter, and return when the counter stops changing for a configurable quiet period.
Read the bounds. The quiet period defaults to 2 seconds and caps at 10. The timeout defaults to 30 seconds and caps at 60. Math.min on both lines is how the tool stays honest: nobody can set a 5-minute quiet period by mistake, and nobody can leave a test hanging for an hour.
Loop 2. Auto fresh-snapshot on every caught error
agent.ts:962-970A classic retry re-fires the same stale ref and fails again. Assrt's catch block does something different: it re-runs snapshot, slices the first 2000 characters of the fresh accessibility tree, and appends them to the error the LLM will see on its next turn. The model picks the new ref from the fresh tree. There is no retry logic in the plan. There is no retry logic anywhere in the product; the recovery is the default path.
Because the next turn still has the step's conversation history behind it. 2000 chars is roughly 500 tokens: enough to show the local region around whichever element just failed, small enough that the LLM does not get distracted or waste its turn re-reading the whole page. If the model needs more, it can call snapshotitself; the reminder is literally in the error string ("Please call snapshot and try a different approach.").
Loop 3. OTP handoffs, without a mail server
agent.ts:115-130, email.ts:101-109Automation dies the moment a scenario needs a real email. That is why so many nightly suites end at login with a fixture user. Assrt ships three tools that make signup and OTP tractable without an SMTP stub: create_temp_email, wait_for_verification_code, and check_email_inbox. The mailbox is a 10-character prefix on temp-mail.io (email.ts:49). Codes come out of a seven-pattern regex extractor (email.ts:101-109). You write two English lines; the agent does the rest.
Math.min at agent.ts:810 keeps the wait honest. Tight enough to catch fast providers, wide enough not to rate-limit temp-mail.io with a dozen parallel runs.
Keyword-anchored matches first (code, verification, OTP, PIN), then raw digit-length fallbacks (6, 4, 8). The pattern that matched is logged so a false match is a grep away.
This loop has its own dedicated tutorial at /t/qa-automation-tutorial if you want the full plan file and every regex spelled out.
Loop 4. SNAPSHOT_MAX_CHARS and the context-window trap
browser.ts:500, browser.ts:358Accessibility trees on real SaaS apps are bigger than people assume. A virtualized list with 1,000 rows, a dense marketing page with every hero variant rendered, a dashboard with nested tab panels: all of these regularly produce 300k+ characters. Hand any of them to an LLM as a single prompt and the context window explodes before the first click.
Two details matter. First, the slice happens at the nearest newline so the truncated region ends on a clean element, not mid-token. Second, the footer is literal text the model sees: it knows the tree was cut and it knows to scroll plus re-snapshot if the target is below the cut.
The sibling detail at browser.ts:358 is TOOL_TIMEOUT_MS = 120,000. The MCP SDK default is 60s. That is fine for a fast localhost run; it is insufficient for a cold-start enterprise SPA. The 120s ceiling is the survival margin, and the transport reset at line 392 means a transient timeout does not end the run.
What all four loops look like in one run
One real scenario: delete a workspace. Step 3 fails because ref e22 went stale when a modal animated in. The catch block at agent.ts:962 takes a fresh snapshot. Step 5 clicks the correct new ref. Step 6 uses wait_for_stable to ride out a backend refetch instead of guessing at a sleep. Ten tool calls, one auto-recovery, nine seconds, no human.
How to move your suite toward this
You do not need to migrate. These are five plan-level changes you can make today, one by one. None of them require writing code; all of them are English you type into scenario.md.
Take the sleep out
Replace every fixed-time wait in your plan with wait_for_stable. Your English step is literally "wait for the page to settle." The tool handles the rest: MutationObserver installed, 500ms poll, 2 seconds of DOM silence. A streaming chat that took 12 seconds yesterday and 3 seconds today returns in both cases without a hard-coded number.
Stop writing retry blocks
Delete your try/catch/re-query scaffolding. The agent catches every tool call at agent.ts:962 and hands the fresh accessibility tree to the LLM. If a ref goes stale between snapshot and click, the next turn succeeds with the new ref. You see one red step in the log and one green step after it, and the run keeps going.
Stop standing up mail servers for OTP
Write "call create_temp_email" into the plan. The tool returns a fresh aQ9mZ2cP7r@1secmail.net address. Write "call wait_for_verification_code" after the submit. The regex extractor at email.ts:101-109 pulls the code out. There is no fourth thing to do.
Trust the 120k ceiling
On any page with a big virtualized list or a long marketing layout, SNAPSHOT_MAX_CHARS stops the accessibility tree at 120,000 characters. The plan does not need to know. The agent keeps running with valid refs from the top of the tree and can scroll to reach the rest.
Run it, attach the run directory on failure
npx assrt-mcp --plan scenario.md is the whole command. Every artifact lands in /tmp/assrt/<runId>/ including video/player.html at 5x default playback. In CI, upload that folder on failure and the next engineer to look at it has everything they need without reproducing the environment.
The four loops, vendor vs Assrt
Everything on the left is solvable. Plenty of teams ship custom Playwright helpers that do each of these things. The Assrt argument is not that the loops are uniquely hard; it is that they are uniquely worth not maintaining yourself.
| Feature | Typical vendor / hand-rolled Playwright | Assrt |
|---|---|---|
| Waiting for async content | sleep(3000) guessed by a developer | MutationObserver quiet period, 500ms poll, 2s default silence (agent.ts:906-959) |
| Stale ref recovery | Manual retry block around every flaky step | Fresh snapshot spliced into LLM prompt on every caught error (agent.ts:962-970) |
| Per-call tool timeout | Whatever the SDK default is (often 60s) | TOOL_TIMEOUT_MS = 120_000 with transport reset on timeout (browser.ts:358, 392) |
| Accessibility tree overflow | Context window blows up, run dies | SNAPSHOT_MAX_CHARS = 120_000, truncated at nearest newline, refs preserved (browser.ts:500) |
| OTP / email verification | Mock SMTP, Gmail IMAP, vendor add-on | Three tools on the agent, temp-mail.io under the hood, 7 regex patterns (agent.ts:115-130, email.ts:101-109) |
| Plan format | Proprietary YAML or dashboard recorder | scenario.md #Case blocks, plain English, lives in your repo |
| Debug surface when it fails | Vendor dashboard behind a login | /tmp/assrt/<runId>/ with video, events.json, scenario.md, player.html at 5x |
| Monthly cost at team scale | $7,500/mo typical (Testim, mabl, QA Wolf) | $0 + LLM tokens, open source |
Want to see the four loops running on your app?
A 20-minute call where we point Assrt at your staging URL, run one #Case block, and walk through the MutationObserver, auto-snapshot, and snapshot-ceiling paths as they fire in real time.
Book a call →FAQ: automated QA automation, in concrete terms
What does 'automated QA automation' actually mean?
The phrase is redundant on the surface, which is why most top results hand-wave it. Taken seriously, it means automating the automation: the parts of a browser-automation run that a human still has to babysit. Those parts are not the test steps themselves (any framework can click a button). They are the four failure modes that force a human back in the loop: the test sleeps too long and gets killed, a retry picks a stale selector, an OTP arrives in a real inbox, or the accessibility tree is too large for the model. Assrt removes each one in code you can read: wait_for_stable at agent.ts lines 906-959, auto fresh-snapshot at agent.ts lines 962-970, the three email tools between agent.ts lines 115 and 130, and SNAPSHOT_MAX_CHARS = 120_000 at browser.ts line 500.
Why does wait_for_stable use a MutationObserver instead of a fixed timeout?
A fixed timeout is a guess at how long your content takes to load. The guess is wrong on every page that has a streaming response, a React Query refetch, or a progressively-rendered list. wait_for_stable injects `window.__assrt_observer = new MutationObserver(...)` at agent.ts line 915, sets it to watch the entire body with childList, subtree, and characterData turned on, and then polls a counter at window.__assrt_mutations every 500ms. When the counter stops incrementing for the configured quiet period (default 2 seconds, capped at 10), the function returns. This adapts automatically to both a 200ms static page and a 12-second streaming answer. The default timeout is 30s, capped at 60s, so the run never hangs indefinitely.
What happens when an individual tool call fails mid-run?
The catch block at agent.ts lines 962-970 runs. It takes a fresh accessibility snapshot of the current page, slices the first 2000 characters, and appends them to the error message that goes back to the LLM: 'Error: <original message>. The action X failed. Current page accessibility tree: <tree>. Please call snapshot and try a different approach.' The model now sees the up-to-date DOM structure without you writing a single line of retry logic. This is the single mechanism that turns most transient selector failures into a one-step recovery and the reason the run continues unattended through modals, navigation, and late-loading elements.
Why is the snapshot truncated at 120,000 characters?
Accessibility trees can grow into the millions of characters on apps with long virtualized lists or dense marketing pages. browser.ts line 500 defines SNAPSHOT_MAX_CHARS = 120_000, which is roughly 30,000 tokens with Claude's tokenizer. When a snapshot exceeds that, resolveAndTruncate at browser.ts line 506 slices to the nearest newline and appends the literal footer `[Snapshot truncated: showing 120k of <original>k chars. Use element refs visible above to interact.]`. The LLM still sees the first section of the tree with valid element refs, so it can scroll and re-snapshot the region it needs instead of dying on a context-window overflow. Without this ceiling, the run falls apart on any real e-commerce listing page.
What is TOOL_TIMEOUT_MS and why is it 120,000 not 60,000?
browser.ts line 358 declares `private static readonly TOOL_TIMEOUT_MS = 120_000;` and passes it as the per-call timeout when the agent invokes any underlying Playwright MCP tool (line 381). The Model Context Protocol SDK defaults to 60s. Real navigations on enterprise apps, cold starts, and heavy dashboards routinely exceed that. The 120s ceiling is a survival margin. If a tool still times out, the agent doesn't hard crash: it marks the MCP client as dead (line 392), clears the transport, and the next call reconnects. This is why a network blip on step 34 of 60 does not end your run.
Does automated QA automation require generating test code?
No, and the Assrt design argues that generating code is a side-effect, not the goal. A #Case block in scenario.md is English: 'Navigate to /settings, click Danger Zone, confirm the dialog, assert the row is gone.' The agent translates that into a chain of real Playwright MCP calls (the 18 tools registered at agent.ts lines 16-196). If you need the code out, you can log the chain. If you only need the run to pass, you never look at it. That is the actual 2026 change: the code becomes a byproduct of a reliable run, not a deliverable you maintain.
How does this work in CI if the agent needs an LLM key?
Same as any other test: pass ANTHROPIC_API_KEY or GEMINI_API_KEY as an environment variable. The default model is claude-haiku-4-5-20251001 (agent.ts line 9) which is fast and cheap enough to run on every commit. Playwright MCP launches Chromium headless by default. The run directory at /tmp/assrt/<runId>/ contains the video, events.json, and every tool-call snapshot, so a failing GitHub Actions job uploads one artifact and anyone on the team can replay the run locally with player.html.
How is the MutationObserver cleaned up after wait_for_stable returns?
Explicitly. Agent.ts lines 940-944 run a second evaluate call: `if (window.__assrt_observer) { window.__assrt_observer.disconnect(); } delete window.__assrt_mutations; delete window.__assrt_observer;`. This matters on long runs because the observer would otherwise accumulate listeners across every subsequent page. The explicit disconnect and delete keep the window clean for the rest of the scenario; a new wait_for_stable call in the next step gets a fresh observer on the fresh page.
Why should I care about auto-snapshot on failure if my tests are stable?
Stability is a property of the run, not a property of the suite. The same test is stable on Monday and flaky on Thursday when a deployment changes an aria-label. Auto-snapshot on failure is the difference between a flake that ends your CI job and a flake that costs one extra LLM turn. In a 40-step scenario, a single transient failure costs about 2-3 seconds of agent time (one snapshot, one re-read, one fresh click) and the run continues green. Without it, you are paging an engineer to read a trace viewer. This is the core of what 'automated QA automation' should mean in 2026.
Can I tune the quiet period and timeout for wait_for_stable?
Yes. Both parameters are optional on every call. timeout_seconds defaults to 30 and is capped at 60 by Math.min at agent.ts line 907. stable_seconds defaults to 2 and is capped at 10 at line 908. For a streaming chat where tokens arrive over 8-10 seconds, raise the quiet period to 4 or 5 so the observer does not return between bursts. For a static page that should settle in under a second, lower it to 1. The cap prevents anyone from setting a 60-second quiet period that would stretch one run into forever.
What is in the run directory, and why does it matter for unattended runs?
/tmp/assrt/<runId>/ contains events.json (every tool call with arguments and timing), execution.log (the raw step stream), results.json (pass or fail with evidence), scenario.md (a snapshot of the plan you ran), screenshots/ (one PNG per visible action), and video/ (recording.webm plus player.html at 5x default playback). For an unattended run, this folder is the entire debugger. Tar it, upload it as a GitHub Actions artifact, and a failing build has everything a human needs to diagnose the failure from a different continent without reproducing the environment.
Is this better than building retry-and-wait logic into my own Playwright suite?
You can absolutely build a MutationObserver quiet-period and an auto-snapshot helper into your own Playwright suite. The question is whether you want to own and maintain the helper forever. Every one of these mechanisms is about 30-60 lines of code; together they take an engineer a sprint to build, then ongoing time to keep working as Playwright evolves. Assrt's value is that it is open-source code you can read, not a vendor service you rent, so the choice is between writing it yourself and copying someone else's working version with the commit history intact. Both beat paying a closed-source vendor $7,500 a month for the same primitives.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.