Automated UI testing flakes because the selector layer is wrong, not because the framework is
Direct answer: automated UI testing is having software drive the same UI a human would (clicks, typing, navigation, assertions) without a human in the loop. For it to be reliable on a modern app, the locators must be re-resolved per step from a fresh accessibility tree, not committed to disk at record time and read at run time against a DOM that has changed. The rest of this page is the argument, walked against the open Assrt source so every claim is checkable.
The agent source referenced here is at github.com/assrt-ai/assrt-mcp. If a line number looks specific, it is because the file is open.
The real failure mode: locators committed at write time, read against a DOM that changed
When a team says "our automated UI tests are flaky", the conversation usually goes to the framework: Playwright is broken, Cypress is broken, Selenium is unreliable. It is almost never the framework. The same engine that flakes on a real product passes its own example app a hundred runs in a row.
What actually fails is the selector layer. A test written six months ago references page.locator(".btn-primary:nth-of-type(2)"). At write time, that was a unique reference to the "Create" button. Today, after a styling refactor that moved the same button into a flex container with a new wrapper, it points at a different element, or no element at all. The test fails. The DOM is "different". The cause of the difference is two lines of unrelated CSS.
The same problem in a different costume: a generator records a click, emits page.getByRole("button", { name: "New" }).click(), and on the next run a streamed-in Suspense boundary inserts a second button labelled "New" above the first. The locator is no longer unique. Playwright resolves "first match". The test clicks the wrong button.
Both failures have the same shape: a snapshot of the DOM at one moment in time was written to disk as a selector, and a different moment in time was used to look it up. The selector layer assumed the DOM would not change in the meantime. That assumption is wrong on every app that ships more than once a week.
Where the selector lives
The locator is written into a .spec.ts and read at run time, regardless of whether the DOM still matches.
- page.locator('.btn-primary:nth-of-type(2)') hardcoded in the test file
- Test depends on the DOM being identical to record time
- A styling refactor or a streamed-in card changes the match
- The fix is to rewrite the locator
Three ways to drive the UI, ranked by what survives a redeploy
Setting aside the testing vocabulary for a moment, there are three mechanical ways to make a browser click the button labelled "Create". Each one trades off setup speed against robustness.
1. Pixel coordinates
The recorder captures x=412, y=271 and the runner sends a mouse event at that point. Fastest to record, breaks on the next viewport change, the next browser zoom, the next CSS tweak that pushes the button down by 16 pixels. Almost nobody runs pixel automation for UI testing anymore, except in the rare case where the rendered UI is a canvas with no accessible tree (game UIs, specific data-viz tools). For HTML applications, pixel-driven tests die at the first redeploy.
2. CSS or XPath selectors
The recorder captures div.sidebar > button.btn-primary:nth-of-type(2) and writes it into the spec. The runner queries the DOM at run time, gets back the element (or doesn't), and clicks. This is the default mode of every popular framework, and it is the mode where most flake comes from. Selectors derived from class names, sibling order, or DOM structure encode incidental facts about the markup that you do not actually intend to guarantee. Every refactor breaks them.
3. Accessibility tree, re-resolved per step
The runner asks the browser for the live accessibility tree before each action: every interactive element with its role (button, link, textbox), its accessible name ("Create", "Pricing", "email"), and a temporary [ref=eN] id good for this step only. The intent in the test ("click the Create button") is resolved against that snapshot, not against a snapshot taken six months ago. This is the surface a screen reader uses. It is the layer the W3C standardized for accessibility (ARIA roles, accessible names), and it is more stable across refactors than the markup it describes because changing it is a deliberate semantic act, not a side effect of a styling tweak.
The third path is what Assrt does. The tool declaration is at assrt-mcp/src/core/agent.ts:27-29: a snapshot tool described as "Get the accessibility tree of the current page. Returns elements with [ref=eN] references you can use for click/type. ALWAYS call this before interacting with elements." That last sentence is the discipline the whole design enforces.
“ALWAYS call snapshot FIRST to get the accessibility tree with element refs. Use the ref IDs from snapshots (e.g. ref='e5') when clicking or typing. This is faster and more reliable than text matching. After each action, call snapshot again to see the updated page state. If a ref is stale (action fails), call snapshot again to get fresh refs.”
System prompt at agent.ts:207-218, the runner's hard rule
What the snapshot actually returns
Words like "accessibility tree" do real work in this argument, so it is worth looking at one. The snippet below is the shape Assrt's browser_snapshot call returns (via Playwright MCP, wired at browser.ts:589). It is YAML-shaped, indented by parent-child role, and every interactive node has a [ref=eN] id the runner can target.
The refs are not part of the DOM. They are not attributes on real HTML elements. They are an in-memory mapping Playwright MCP maintains for the duration of one snapshot. Two consequences. First, your selectors expire by design: ref "e125" is valid until the next snapshot, then it is gone. Second, there is no temptation (or possibility) to commit them to disk; the only thing the test file holds is intent. "Click the Start free link" gets translated to "click [ref=e125]" on this run, "click [ref=e98]" on the next, "click [ref=e144]" on the run after that. The intent is stable. The refs are temporary by construction.
When a test does fail, the failure mode is informative: the agent sees the snapshot does not contain a node matching the intent, and fails with a description of what it saw instead. Compared to page.locator(...) timing out with no context, that is a step up: you do not have to load up the trace viewer to figure out what was on the page when the assertion fired.
The runner loop: snapshot, resolve, click, snapshot
The agent loop is small enough to draw. For every step in the test plan, the runner asks for a fresh snapshot, finds the node that matches the intent, sends the click or the type through Playwright's standard locator engine, and asks for a fresh snapshot again to see what changed. There is no mode where a previous ref is reused across a navigation.
One step of the runner loop
The dispatch is at agent.ts:778-797: the snapshot case calls this.browser.snapshot(), the click case forwards the human-readable element description and the optional ref to this.browser.click(element, ref), and the browser wrapper sends browser_click through to Playwright MCP at browser.ts:600. There is nothing else in the click path. Nothing intercepts the ref, nothing rewrites the selector. Playwright resolves it the way Playwright always resolves a ref.
What reliable automated UI testing actually requires
Pull this thread to the end and the list of what a working automated UI test suite needs is shorter than most articles make it. The properties below are necessary; the rest (dashboards, scheduled runs, screenshot diffing, parallel sharding) are nice but not load-bearing.
The properties a UI test suite has to have
- Locators resolved from the live page, not committed to disk and read at run time
- Stability wait that watches the DOM (MutationObserver), not the network (waitForLoadState)
- Auth and storage state shared across scenarios so each test does not re-login
- Failures that name what the agent actually saw, not just a selector-timed-out message
- Test plan in your repo, in a format you can grep, diff, and review in a PR
- Runner that returns the same JSON shape on success, failure, and timeout (so CI does not have to special-case)
- No proprietary file formats you cannot leave; the artifact is the artifact
Those seven cover the common failure modes. They do not cover taste: which scenarios to test, what counts as a regression, how to keep the suite from rotting into a maintenance liability. Those are judgment calls and they remain judgment calls regardless of the tool. The argument here is just that the tooling should not be the part that fails.
The fastest way to try this on your own app
If you want to see whether the accessibility-tree approach holds on your real product, the open path is one command. Assrt's generator hits the URL, snapshots the page at three scroll positions, asks Claude Haiku for five to eight #Case blocks in plain English, then the runner takes over and executes them through Playwright. The plan file goes to /tmp/assrt/scenario.md where you can read, edit, or check it in. Generated tests are not locked to a SaaS; the runner that executes them is the same @playwright/mcp you would call yourself.
Whether this beats the suite you already have depends on what you already have. If the answer to "where do our UI tests live?" is "committed selectors in a Cypress repo we have not opened in three months", the accessibility-tree path will probably feel like a relief. If you have a healthy Playwright suite with stable getByRole locators and reasonable waits, the win from switching is smaller; the win from layering AI generation on top of it is the real upgrade.
Either way, the artifact is text in your repo, and the runner is open at github.com/assrt-ai/assrt-mcp. If a sentence on this page seems implausible, the file path is there to check.
Walk through a flaky suite with us
Bring one scenario from your own app that flakes more than you'd like. We'll trace where the locator layer is fighting your DOM, and whether a per-step snapshot path actually helps.
Frequently asked questions
Frequently asked questions
What is automated UI testing, in one sentence?
Automated UI testing is having software drive the same user interface a human would (clicks, typing, navigation, selection, scrolling, file uploads, assertions on visible state) without a human in the loop. It is one slice of test automation, distinct from unit testing (which calls functions directly) and from API testing (which hits endpoints without rendering a page). The point is end-to-end confidence: if the test passes, the app loaded, the JavaScript executed, the network requests resolved, the state updated, and the result rendered the way a real user would experience it.
Why do automated UI tests flake on modern web apps?
Three reasons, in order of frequency. First, the locator layer is wrong: most test frameworks ask you to commit a CSS selector or XPath at write time, then look it up at run time against a DOM that has changed (a streamed-in Suspense boundary, a hydrated client component, a re-rendered list). Second, the wait primitive is wrong: waitForLoadState('networkidle') was designed for a 2014 page that finishes a single XHR, not a 2026 page that keeps the response open while React Server Components stream chunks. Third, the data isn't reset between runs: a test that signed up user42@example.com on Tuesday hits a 'user exists' branch on Wednesday. Frameworks themselves are not broken. The way most teams wire them up is.
What is the difference between automated UI testing and end-to-end testing?
Overlap, but not synonyms. End-to-end testing means exercising the whole system top to bottom: frontend, API, database, background jobs, external integrations. Automated UI testing is the slice of end-to-end testing that drives the UI. You can do end-to-end testing without touching the UI (call the API, inspect the database). You can do UI testing without going end-to-end (mock the API, drive the rendered frontend). In practice, the test pyramid groups them together because they share the same infrastructure (a headless browser, a runner, an assertion library). The distinction matters when you are deciding what to mock.
Which tools do automated UI testing today?
The honest landscape: Playwright (Microsoft, open source) and Cypress (open source with a closed cloud) are the two most-used modern choices. Selenium remains in legacy stacks and CI configs that have not been migrated. Puppeteer is in the same family as Playwright but narrower. On top of those engines, AI-layer tools like QA Wolf (managed service, around $7,500/month publicly), Momentic, Testim, Mabl, and Rainforest QA add generation and dashboards. Assrt sits in the AI layer too, but writes its output as a Markdown #Case plan plus a real Playwright runner that re-resolves locators per step, all open source at github.com/assrt-ai/assrt-mcp. None of these are the right answer for every team. The choice that survives a year is the one whose generated tests you can actually read.
Is record-and-playback automated UI testing a good idea in 2026?
It is a fine starting point and a bad ending point. Recording your clicks and replaying them works the first time and breaks the second time because the recorder captured a snapshot of the DOM that no longer exists. The pattern survives only if the runner re-resolves the recorded intent against the live page each step, instead of replaying recorded coordinates or recorded CSS paths. The mechanical operation looks the same to the person recording. The difference is whether the artifact is 'click element [ref=e5]' (one-shot) or 'click the button labelled New project' (re-resolved every run). Treat record-and-playback as a way to draft test cases, not as a way to maintain them.
How do you wait for the page to be ready without flaky waits?
Stop waiting for the network and start waiting for the DOM. Network-based waits (networkidle, idleConnections, no-pending-fetches) were appropriate when a page made one or two XHRs after load. A modern app keeps the response open while React Server Components stream, opens a websocket for live updates, and re-renders after every hydration tick. The signal that the page is actually settled is that no DOM mutations have arrived for a configurable quiet window. Assrt's runner injects a MutationObserver on document.body, polls window.__assrt_mutations every 500ms, and returns when the count has been quiet for stable_seconds (default 2s, cap 10s) or the timeout (default 30s, cap 60s) elapses. The full mechanism is at agent.ts:962-994 in the open source.
Do automated UI tests replace QA engineers?
No, and the framing is wrong. Automated UI tests replace the part of QA that is repetitive and mechanical: regression coverage of flows you have already verified. The part of QA that is creative (figuring out which flow the customer actually walks, designing the test cases, deciding what 'this looks broken' even means) is still a human job. The realistic effect of good automated UI testing is that the same QA engineer covers a system three times bigger without the test count exploding into a maintenance liability. The unrealistic claim, that you can fire QA and let the machine sort it out, is the one that produces the test suites everyone hates.
How do you keep automated UI tests fast enough to run on every push?
Three levers. Parallelize at the worker level (Playwright workers, Cypress parallel containers, separate browser contexts per scenario); your suite is as slow as your slowest scenario, not the sum. Reuse auth state across scenarios so each test does not log in from scratch (Assrt's agent shares cookies and storage between #Case blocks for this reason, see agent.ts:692-747). And keep the wait primitives dynamic, not fixed: a 500ms sleep multiplied by 200 assertions across 60 scenarios is the difference between a five-minute CI and a thirty-minute one. None of these three are exotic. They are just the ones most suites skip until the build time forces it.
Open source vs. managed automated UI testing: how do you decide?
Decide on three axes. Output portability: do the generated tests live in your repo, or in a vendor dashboard? If you can't grep them, you can't audit them. Hosting cost: do you pay per scenario per month, or do you run on your own CI minutes? Managed services are often the right answer at low scale and a tax at high scale. Lock-in if you leave: if the vendor goes away tomorrow, do you have to rewrite the suite or can you keep running it? Assrt's bias is open source on all three axes (Markdown #Case files in your repo, run on your CI, no vendor to leave). Managed services like QA Wolf have a real dashboard and managed maintenance that an open tool does not match. The honest answer depends on which axis hurts more for your team.
Where is the source so I can verify these claims?
The MCP server, agent loop, browser wrapper, and CLI live at github.com/assrt-ai/assrt-mcp. The snapshot/click/type pipeline is at src/core/agent.ts (tool declarations at lines 27-53, dispatch at lines 778-797) and src/core/browser.ts (browser_snapshot at line 589, browser_click at 600, browser_type at 610). The MutationObserver stability wait is at agent.ts:956-1009. The npm package is @m13v/assrt; npx @m13v/assrt discover https://your-app.com generates a #Case plan and runs it. Every file and line cited on this page is checkable on the main branch.
Adjacent reading
Accessibility tree end-to-end testing: drive the same surface a screen reader sees
Deeper look at why the accessibility tree is the right level to test against, and what it gives you that pixel matching and CSS selectors don't.
Playwright AI test agents, explained
What the runner actually does between the snapshot and the click. Traced line by line: tool_use, tool_result, stop_reason: end_turn.
Playwright test generator for Next.js apps
Why npx playwright codegen on an App Router app emits a spec.ts that breaks the next time RSC streams in a different order, and the design that sidesteps it.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.