Self-healing test maintenance hours, the math nobody publishes
A senior QA engineer at a team running a real Playwright or Cypress suite spends roughly 42 hours a month on test maintenance, and about 80% of those, near 34 hours, are selector and locator churn. That is the headline number, sourced consistently across vendor surveys and developer studies. The whole self-healing category sells against it. Mabl publishes 95% of test maintenance eliminated. Virtuoso publishes 83% less effort. Functionize, Testsigma, and the rest sit in the same band. They are all selling a better repair pipeline for your locators.
This page is the math for the option none of them publish, the one where the repair pipeline does not exist because no locator was stored. The number it returns is not 95% of 34. It is 34.
Direct answer (verified 2026-05-07)
Self-healing testing saves roughly 0-0 hours per month per QA in published vendor case studies, by patching 70-95% of broken locators automatically and leaving 2-10 hours per month of heal-review work. A no-locator architecture (re-snapshot the accessibility tree before every action, no stored selector) returns the full ~34 hours, because there is nothing to patch.
Sources verified today against mabl.com/auto-healing-tests, momentic.ai, and QA Flow on Medium.
Where the 42 hours actually go
The 42-hour number is a composite, not a survey response from one source. Webomates publishes a 80%-of-test-cost-is-maintenance figure. QA Flow on Medium puts it at 50% of total QA time. Mabl, Virtuoso, and Testim quote case studies in the 200 hour-per-quarter and 40 hour-per-month range. They roughly agree on the bucket sizes.
| Activity | Hours / mo | What it actually is |
|---|---|---|
| Selector / locator churn | ~34 | Update getByRole names, fix data-testid drift, rewrite XPath when a parent div gets added |
| Timing flake debugging | ~4 | Pass-locally-fail-on-CI loops, hardcoded waits, retry-3x bandaids |
| Test data and env drift | ~3 | Seeded user expired, staging DB reset, third-party API contract drift |
| Genuine product evolution | ~1 | Flow changed, the test now describes a feature that no longer exists, rewrite the case |
| Total | ~42 | Per senior QA, monthly, suite of ~200 cases |
The line items are not equally addressable. The bottom one (genuine product evolution) is genuine work. No automation strategy returns that hour and pretending otherwise is dishonest. The other three are the structural budget self-healing tries to attack, and which a no-stored-locator architecture removes outright.
“Up to 95% of test maintenance is eliminated by auto-healing. The remaining work is reviewing the proposed locator change before it ships.”
mabl.com/auto-healing-tests, accessed 2026-05-07
The 34 hours is the bucket Mabl, Virtuoso, Functionize, and Testsigma all aim at. The disagreement is over what fraction comes back. The unspoken question is what the bucket would look like if the model underneath did not store a locator at all.
Two architectures, two hour curves
The toggle below shows the same monthly maintenance budget under two different architectures, against the same fictional 200-case suite. Numbers in the "repair pipeline" column are the published best case from the SaaS side. Numbers in the "no locator stored" column are derived from the actual snapshot-per-action loop the Assrt agent runs.
Where the hours land
You author tests with stored locators (page.locator, getByRole, data-testid). Tests run, locators occasionally drift, the vendor's ML service detects the drift and proposes a patch. You review the patch in a dashboard, approve or reject, ship.
- ~3-7 hrs/mo: review heal proposals, decide if the new locator matches intent
- ~1-2 hrs/mo: false-heal regressions where the model patched around a real bug
- ~1 hr/mo: data and env drift that healing does not address
- ~1 hr/mo: genuine product evolution
- Total residual: 6-11 hrs/mo, plus per-seat subscription cost
The seven words that erase the bucket
The mechanism is three lines of system prompt and one error-recovery path. It is not magic and it is not a model upgrade. It is a choice about where the locator gets resolved. In a stored-locator world, resolution happens at authoring time and the storage rots. In a snapshot-first world, resolution happens at the moment the action fires and there is nothing to rot.
The seven words that do the work are ALWAYS call snapshot FIRST to get the accessibility tree. The rest of the rules block enforces that this happens every action and that stale refs trigger a fresh snapshot rather than a retry against a stale one. The agent never accumulates a cache of selectors. Every interaction is resolved against a tree that did not exist a tick ago.
The recovery path is the normal path
Most self-healing products structure recovery as a separate pipeline: action fails, fault detection fires, a model is called to propose a new locator, the proposal is logged, a human reviews it. Each of those steps has a cost line. Assrt collapses recovery into the normal control flow at agent.ts line 933. When an action fails, the runner inlines the first 2,000 characters of a fresh accessibility tree directly into the failed-tool result that the model sees on its next turn. The model does not get told "a heal happened". It gets told the action failed and here is the page right now. Continue.
The 2,000-char slice is the entire heal protocol. There is no confidence score, no proposed-locator audit trail, no separate model call. The next turn either succeeds against the new tree, or fails again, in which case the agent re-snapshots, tries a different approach, and after roughly three failed attempts marks the case as failed for human review. That last hour is real and honest. It is the only hour the snapshot-first architecture spends.
What disappears from your weekly maintenance loop
Concrete tasks, not abstractions. If you have run a Playwright suite past 100 cases, you have done some version of every line below inside the last 90 days. Every one of these is a budget item that does not exist in the no-locator-stored model.
Maintenance work that vanishes
- Update getByRole({ name }) when copy gets rewritten
- Fix XPath when a marketing redesign adds a wrapper div
- Coordinate with frontend on adding or removing data-testid attributes
- Migrate locators between Cypress and Playwright syntax
- Re-record a Codegen run because the page-object got out of date
- Triage flaky-locator alerts in CI Slack at 9pm
- Approve or reject AI heal proposals in a vendor dashboard
- Audit false heals where the model patched around a real bug
- Maintain a heal-confidence threshold per environment
- Keep a stale-selector-of-the-week running joke
Wait timing is the other big bucket and it is also structural
If your team spends fewer than 4 hours a month on timing flakes, either the suite is small or someone has done careful work on waits. Most do not. The standard pattern is await page.waitForTimeout(5000) which makes the test slow on fast pages and still flaky on slow ones, or await expect(...).toBeVisible({ timeout: 10000 }) which is better but still a fixed ceiling that you tune by superstition.
The wait_for_stable tool at agent.ts:872-925 measures actual DOM quiet rather than wall-clock time. It injects a MutationObserver on document.body, polls every 500ms, and exits the moment 2 seconds pass with zero new childList, subtree, or characterData mutations. The 30-second ceiling is there to prevent infinite loops on a page that never settles, not as the typical wait. A typical wait on a Next.js dashboard with a streaming widget is under 2 seconds plus the 2-second quiet window. A typical wait on a static page is under 500ms. You do not write any number into your scenario; the page tells the test how long to wait.
The honest residual
A page that claims zero maintenance hours would be lying, so here is what stays. About one hour per month goes into env and data drift, the same as any approach: a seeded user expires, a staging DB resets, a third-party API contract changes. About one more hour goes into genuine product evolution, where a flow your scenario describes no longer exists in the product. You read the failure, rewrite the #Case block, ship. assrt_diagnose will emit a corrected #Case block in Markdown if you want a head start, but the rewrite itself is the actual work and there is no way to automate that without telling the test what the new flow is. Total residual on a mature suite, two hours a month per QA. Compare 6 to 11 hours residual on the SaaS-self-healing path.
The other thing that goes away is the per-seat subscription cost that closed AI QA platforms charge. Comparable products price in the $7,500 per month per seat range for the bundle that includes AI execution. The runner under @m13v/assrt is open source and self-hosted; the bill is whatever Anthropic charges for the model calls your runs consume, and you can swap to a local proxy via the ANTHROPIC_BASE_URL environment variable if you want the bill at zero.
How to verify the math on your own suite this week
Before you trust any vendor (this one included) on hour-savings claims, measure your own baseline. Three commands plus an afternoon of attention is enough to know whether your team sits at 10 hours a month or 50.
- Run
git log --oneline --since="90 days ago" -- tests/and grep for "flaky", "selector", "locator", "testid", "timing". Multiply the count by 20 minutes (the rough cost of a fix including review). That is your floor. - In CI, query for tests that failed and went green on retry within the same week without a code change to the test file. That is stale-selector or timing flake mass. Multiply by 15 minutes per investigation.
- For one week, ask every engineer who touches a test file to tag the commit either "repair" or "extend". Sum repair hours. Multiply by 4 to project monthly. This usually matches vendor case studies within 10-20%.
If your team comes in under 10 hours a month, you have a stable suite or a very small one and any self-healing product (this one included) will be marginal. If you come in over 30 hours a month you have the bucket the whole category aims at, and the question becomes whether you want a 95% repair pipeline or no pipeline at all.
Want to see this run on your suite?
A 20-minute call where we point Assrt at one of your real flows and watch the snapshot-per-action loop on your DOM, not on a marketing demo.
Frequently asked questions
What is the actual hourly cost of E2E test maintenance for a senior QA engineer today?
The recurring numbers across vendor reports and developer surveys land in roughly the same range. Senior QA engineers spend about 42 hours per month on test maintenance once a suite passes 200 cases. Roughly 80% of that, about 34 hours, is selector and locator churn (the rest is environment drift, data setup, and assertion updates). On top of the IC time, anyone who has been on a flaky-suite team will recognize the 3 to 5 hours per engineer per week of investigation time when a CI run goes red and someone has to figure out whether the app broke or the test broke. That investigation tax is not always assigned to QA, which is why it tends to disappear from roadmap planning. It still costs hours.
How much does mainstream self-healing actually cut, and what does the residual hour look like?
Vendor-published numbers cluster around 70 to 95% reduction in locator-maintenance time. Mabl publishes 'up to 95% of test maintenance eliminated'. Virtuoso publishes 83% less maintenance effort. Functionize and Testsigma sit in the same band. The residual is real, not zero. After healing happens, someone reviews the proposed locator change, decides if it is a legitimate UI evolution or a regression masquerading as drift, then either approves the heal or files a bug. That review is roughly 1.7 to 10 hours per month for the QA who owns the suite, plus the implicit cost of trusting the heal-confidence score and the cost of false heals (where the model patches around a real bug and the test goes green for the wrong reason).
Why can a no-locator architecture claim zero maintenance hours instead of 5% of 34?
Because the maintenance budget exists only because a stored selector can drift. Drop the storage step and the budget item disappears. Assrt's runner does this concretely: the system prompt in agent.ts:206-218 says 'ALWAYS call snapshot FIRST to get the accessibility tree with element refs' and 'If a ref is stale (action fails), call snapshot again to get fresh refs'. Refs like ref="e5" are issued by the live accessibility tree at the moment of the click, used once, and discarded. There is no .locator string in your spec to maintain, no test-id to coordinate with the frontend team, no XPath to update when a designer rewraps a button. The math is not 'we heal 95% of locator failures', it is 'no locator was ever stored, so nothing can drift'.
What about waits and timing flakes, isn't that half the maintenance budget?
It is the other big bucket, and most of it is also addressable structurally rather than via heal logic. Assrt ships a wait_for_stable tool implemented at agent.ts:872-925. It injects a real MutationObserver onto document.body, polls every 500ms, and exits once 2 seconds pass with zero new childList, subtree, or characterData mutations (ceiling 30 seconds). A fast SPA settles in 400ms; a streaming chat UI might churn for 4 seconds. Both are handled by the same primitive, and neither requires the test author to guess a timeout. The hours that traditionally go into 'why did this pass locally and fail on CI' shrink considerably when timing is measured against actual DOM silence rather than a sleep(5000).
How do I sanity-check the 34 hours/month number for my own team?
Three checks that take less than an afternoon. First, run git log --oneline -- tests/ for the last 90 days and count commits whose message contains 'fix selector', 'update locator', 'flaky', or 'broken test'. Multiply average commit time (15-30 min including review) by that count. Second, in CI, query for tests that failed and were retried as green within the same week without a code change to the test; that is your stale-selector or timing flake. Third, do a one-week tally where every engineer notes when they touched a test file specifically to repair it (not extend it). The first method usually undercounts; the third method usually matches vendor surveys. If your team comes in under 10 hours a month, you either have a small stable suite or the cost is hiding in 'engineering' instead of 'QA'.
What is the catch with re-snapshot per action, doesn't that cost CPU and tokens?
Yes, snapshots and screenshots are not free, and that is the trade. Each step pays a snapshot call against the Playwright MCP bridge and the agent reads back an accessibility tree. The runner caps that tree at 120k characters fed to the model for context, and the failed-action recovery path inlines just snapshot.slice(0, 2000) at agent.ts:933 so the next turn sees the fresh tree without blowing the context window. In wall-clock time you spend a fraction of a second per action on snapshot. In tokens you spend Anthropic-priced inference. The total bill scales linearly with steps, not with how many of your selectors broke this sprint, which is the point: the cost becomes predictable infrastructure spend rather than unpredictable engineering hours.
Where do my tests live, and what happens to the maintenance burden when I want to leave?
Tests live in /tmp/assrt/scenario.md as plain Markdown #Case blocks. You can check that file into your repo next to a normal Playwright project; grep, diff, and git blame all work on it. The runner is open source under @m13v/assrt and the MCP server is open source under @assrt-ai/assrt-mcp. If you decide tomorrow that you want to leave for a different runner, your scenarios are 30 lines of English per test and the equivalent Playwright spec is a finger exercise to write. Compare with closed-source AI QA platforms where tests live in a vendor dashboard or proprietary YAML and migration is a multi-month project. The maintenance cost has a fixed ceiling because the lock-in cost is approximately zero.
Doesn't a snapshot-first agent still need maintenance when intent itself changes?
Yes, and this is the honest residual. If the team renames Sign up to Create account, no selector strategy needs to change because the agent reads the live tree and matches the button by role and label. If the team replaces signup-with-email with passkey-only, your scenario.md still says 'Type a temp email into the email field' and there is no email field. The test should fail, you read the failure, and you rewrite the #Case block. That is genuine product evolution, not maintenance. Calling assrt_diagnose returns a corrected #Case in the same Markdown format which you paste back. The hours that go here are usually under one per month per team because product flows do not change that often. The 34 hours that vanish are the hours your team currently spends on selectors that no longer match the DOM, not on flows that no longer match the product.
Other angles on the same architecture
Adjacent reading
AI Playwright test maintenance: the locator-less approach that cannot rot
The system-prompt-level deep dive on snapshot-first execution, with the full agent.ts rule block.
Self-healing tests guide: the category above selector patching
Why prose tests against a fresh accessibility tree are not the same product as locator-patching SaaS, even when the marketing pages overlap.
Reduce test maintenance costs
The cost-side framing: where the per-sprint maintenance dollars actually go and which buckets are addressable structurally.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.