Self-heal as a regression-masking failure mode

AI QA platform self-heal vs regression detection. Gate auto-decisions on signal type and write every one of them to an audit log.

Self-heal is correct on selector drift; it is dangerous on business-logic deviation. Saw it happen twice last quarter, both times the platform silently routed around a removed flow and reported pass. The fix is not turning self-heal off. The fix is gating it on signal type (selector drift heals, fuzzy match goes pending, removed flow fails) and writing every auto-decision to a single log a human can review. 0 signals, 0 policies, one audit file.

Matthew Diakonov, Written with AI

Published April 27, 202611 min read

Install assrt-mcp Book a 30 minute platform review

4.9from design choices that survive a vendor change

Selector drift heals automatically inside a 0.25 label distance band

Business-logic deviation outside a 5 percent tolerance band always fails

Every auto-decision lands in .qa/heal.log with signal type and reason

Pending heals create a yellow CI state; release blocks until review

The QA platform leak nobody talks about

Self-heal that hides a real regression behind a green check.

Scenario asks for the Upgrade button. The button was removed by a regression.

Ungated self-heal finds 'Downgrade' instead, clicks it, reports green.

Gated self-heal classifies the deviation as removed_flow and fails fast.

The audit log records the classification, the reason, and the override path.

0:00 / 0:05

The principles you cannot ship without

signal-type gatingselector drift healslogic deviation failsauto-decision audit loghuman override pathno silent patchesqualified greenpending heal queueregression vs drift callreview cadence

The leak: self-heal that silently routes around a real regression

A scenario asks the agent to click Upgrade. The button is missing because a feature flag regressed. A pre-AI test framework throws a selector error and CI goes red. An ungated AI QA platform notices that Downgrade is nearby, clicks it, and reports green. If the rename was real, that behavior is what you wanted. If the original button was removed because a feature-flag regressed and the only visible target is Downgrade, the test still passes. The user-visible flow is broken. CI is green. The platform did exactly what it advertised.

The leak is not the heal itself. The leak is that the heal is ungated and unlogged. There is no record of the deviation, no review queue, no signal in CI that something other than the literal scenario got executed. Green stops meaning what green used to mean. Over months, the meaning of a passing test erodes to 'the agent found something close enough to click'. Saw this happen twice last quarter, both times in production AI QA platforms, both times nobody noticed until a customer complained.

Where self-heal helps

Selector drift, layout drift, label trims

These changes happen on most front-end PRs. The cost of a human review on every one is too high; teams stop using the platform. Auto-heal inside a tolerance band is the right default.

Where self-heal hurts

Business logic, numeric values, removed flows

These are the signals the test exists to catch. Healing them is healing the bug. The right default is fail_fast, with a tolerance band where the contract allows one.

Decision 1: declare a per-signal policy file

.qa/policy.json

Seven signal types, three policies (auto_heal, heal_with_pending, fail_fast), explicit tolerance bands. The platform can no longer decide what to do; the team decided in code review. The policy file is committed to the repo and reviewed alongside any change to the scenarios.

Decision 2: write every auto-decision to a single log

.qa/heal.log (after a run)

One JSON record per decision. Step number, intent, signal type, policy applied, old value, new value, reason in plain English, review status. A different human can read the log six months later and reconstruct what happened. Without this, a green check has no audit trail.

The review CLI that closes the loop

Pending heals accumulate. The team needs a five minute ritual that processes them before they pile up and silently degrade green. A single CLI command shows what is pending, lets you accept, reject, or edit, and writes the decision to the overrides file.

review session at the end of the week

The flow of a gated AI QA platform

Scenario in, agent runs, every deviation gets classified, the policy file decides whether to heal silently, queue a pending review, or fail the run. The audit log persists across runs, the override file persists across model upgrades.

Scenario → agent → classify → gate → log

The seven design choices that make signal-type gating real

Configurable settings buried in vendor UIs do not count. Each of these is a structural choice about which signals heal, which fail, and how a human stays in the loop without becoming the bottleneck.

Selector drift is allowed to heal

When the only thing that changed is a label trim, a class rename, or a sibling DOM node, the platform should heal and move on. That is what self-heal was designed for. The cost of human review on every pixel-perfect change is too high; teams stop using the platform.

Business logic deviation must fail

If a price changes, a count changes, an email subject changes, a status changes, the platform must not paper over it. These are the signals the test was actually designed to catch. The default for any numeric or text contract should be fail_fast with a tolerance band.

Fuzzy match is heal_with_pending

When the platform matches 'invoice' to a scenario asking for 'receipt' it might be right; the rebrand might have shipped; or the email might be wrong. The honest answer is to run the test, mark it pending, and put a human in the loop within 24 hours.

Removed flows always fail

If the scenario says click Upgrade and there is no upgradeable target on the page (no role=button with any upgrade-shaped label, no link path that leads to a checkout), the platform must not invent a substitute. That is the regression-masking failure mode in its purest form.

Every auto-decision is logged

Selector swap, retry, timeout extension, fuzzy match, layout heal, model fallback, all of them. One JSON record per decision, one log file per run. The log is the audit trail; without it, a green check means whatever the platform wants it to mean.

Pending heals are yellow, not green

CI dashboards need a third state. Hard fail is red, full pass is green, and pass-with-pending-heal is yellow. Yellow blocks release until reviewed. Most teams pick lenient yellow for early adoption and migrate to strict yellow once the queue is small.

Override path lives in your repo

Reject a heal in the vendor UI and you cross your fingers it survives the next platform upgrade. Reject it in a repo-tracked file and the override survives a vendor migration, a model change, and a tool swap. One file, one git history, one diff at review time.

0%percent of auto-decisions written to audit log

0signal types the policy gates separately

0hmax hours a pending heal can sit unreviewed

0silent patches allowed in CI green

Ungated self-heal vs signal-type gated self-heal

Both have self-heal. The difference is what happens after the heal. An ungated platform mutates the run silently and reports green. A gated platform classifies the deviation, applies a per-signal policy, writes a record, and emits a CI state that matches what actually happened.

Feature	Ungated AI QA platform	Signal-type gated platform (Assrt is one option)
Default behavior on selector drift	Auto-heal, no log entry, run reports green.	Auto-heal within tolerance band, JSON log entry with old and new ref, run reports green with audit trail.
Default behavior on business logic deviation	Auto-heal if it can find any element matching loosely; the price went from $190 to $228 and the test still passes.	Fail fast. The numeric contract is not negotiable. Run reports red, scenario step 7 names the expected vs observed value.
Where the policy lives	Inside vendor settings, configurable per scenario, opaque to source review.	In .qa/policy.json, repo-tracked, reviewed in pull requests.
Audit log of auto-decisions	Marketed as 'AI healed your test', no per-decision record.	One JSON record per decision per step per run, queryable, diffable, persists across runs.
Pending heals in CI	No third state. Pass or fail.	Yellow state. Pending heals block release until reviewed. The qa review CLI walks the queue in batch.
Override path	Vendor UI; survival across upgrades is uncertain.	Repo-tracked overrides file; survives upgrades, model swaps, vendor migrations.
Tolerance bands on auto-heal	Hidden inside the model.	Declared per signal: max label distance 0.25, max DOM path change 2, max timeout factor 3.
Behavior on removed flows	Finds the closest button shape and clicks it. Test reports green.	Fails fast. A removed Upgrade button cannot be substituted by a Downgrade button under any signal policy.

The competitor column describes the shape of a typical ungated agentic testing product, not any single vendor. The gated column maps to how Assrt is wired today; you can grep agent.ts in the public repo to confirm the model and tool surface.

Six structural steps to wire signal-type gating

Walk a candidate platform through these six checks. If it fails any of them, the platform cannot be gated meaningfully; it can only be rented. That may still be the right trade for some teams, but the trade should be visible.

Inventory the signal types your platform fires

Before you can gate, you need to enumerate. Read the platform's documentation or source for every classification the AI can return: selector drift, layout drift, fuzzy text match, timeout extension, model fallback, missing target, removed flow, value out of tolerance. Anything not on the list will end up un-gated by default.

Pick the policy per signal

Three policies cover most cases. auto_heal proceeds and logs without a review. heal_with_pending proceeds and queues a human review. fail_fast aborts the run. Map each signal to one policy. Selector drift maps to auto_heal. Business logic deviation maps to fail_fast. Most others map to heal_with_pending.

Set tolerance bands

Auto_heal needs limits. Label distance under 0.25, layout DOM path changes under 2, timeout factor under 3, synonym distance under 0.30. Outside those bands the signal escalates: a selector_drift with distance 0.45 falls through to heal_with_pending; a fuzzy_text_match with distance 0.50 falls through to fail_fast.

Wire the audit log

Every decision the platform considered, including the ones it auto_healed, lands in .qa/heal.log with the run id, the scenario, the step, the signal type, the input, the output, the policy, and the review status. The point of the log is that a different human, six months later, can reconstruct what happened.

Make pending heals a yellow CI state

Either fail the build until pending heals are reviewed (strict), block release but allow merge (lenient), or batch on a daily cadence. The wrong default is treating pending as green; the meaning of green erodes within weeks.

Run a weekly heal review

30 minutes, the team lead, the qa review CLI. Walk every pending heal. Accept the obvious ones, reject the ambiguous ones with a note explaining what the scenario expects. Patterns emerge: most rejects come from the same two scenarios; fix the scenarios and the rejects stop.

What a gated run looks like in your shell

One CI run, one auto-heal, one fail_fast on a price deviation. The decisions land in a file you can grep, diff, and revert. Tomorrow's run is governed by today's policy, not by an unspecified inference.

a normal day with a gated AI QA platform

The reframing

The point of self-heal is not to remove humans. It is to remove the boring 80 percent and route the real 20 percent to a queue.

Selector drift, layout shuffles, label trims, all of those should heal automatically inside a tolerance band. Removed buttons, wrong prices, broken email contracts, all of those should fail fast and reach a human within the same day. The platform's job is not to decide which is which without telling you; the platform's job is to classify the deviation, apply the policy the team wrote, and put the call in front of someone who knows the product.

Assrt is one platform built on these principles: scenarios in your repo, every auto-decision logged with a signal type, one overrides file, model and tool surface readable in agent.ts. There are others. The principles outlast any single tool.

7 signals

“Self-heal is correct on selector drift; it is dangerous on business-logic deviation. The audit log of every auto-decision, classified by signal type, is what separates a useful platform from a regression-masking machine.”

Self-heal design notes, 2026

Want a signal-type gating review of your current QA platform?

Bring the dashboard, we will walk through the seven structural choices in 30 minutes. No pitch, you keep the notes.

Frequently asked questions

How is self-heal a regression-masking failure mode?

Self-heal becomes a regression-masking failure mode when it patches a real product bug and reports a green check. Imagine the Upgrade button is removed because of a feature-flag regression and the only nearby button is Downgrade. An ungated self-heal will find 'something close enough', click it, and pass. The CI dot is green. The deploy lands. The user-visible upgrade flow is broken. The leak is the green check that did not need to be green. Saw this happen twice last quarter on real products; both times the platform was working as advertised, and that was the problem.

Why not just turn self-heal off?

Because selector drift is real, frequent, and not interesting. UI teams rename labels, swap class names, restructure DOM trees on a weekly cadence. If every one of those changes fails the test suite, the team disables the suite within a sprint. The right answer is to gate self-heal on the signal type. Allow it where the cost of a false positive is low (selector drift, layout drift). Forbid it where the cost is high (business logic, removed flows, numeric values).

What signals should always heal automatically?

Selector drift inside a tight label distance band, layout drift inside a small DOM-path-change budget, timeout extension up to a small multiple of the original timeout. These are the changes a test framework cannot reasonably ask a human to review every time, because they happen on most front-end PRs and almost never indicate a regression. Setting a numeric tolerance band on each one keeps the auto-heal honest.

What signals should always fail fast?

Business logic deviation outside a tolerance band (a price, a count, a status). Missing assertion targets that the scenario named explicitly (a div with id 'order-total'). Removed flows where the agent cannot find any element with the right role and a label even loosely related to the intent. Email subjects, since the subject is the contract with the user, not a free-text approximation. These are the signals the test exists to catch; healing them defeats the purpose.

What does heal_with_pending mean operationally?

The platform proceeds with the heal (so CI does not block on every borderline call) but writes a record to the heal queue with a 24 hour SLA on human review. A pending heal blocks release, not merge. After the SLA, the platform either auto-rejects (strict) or lets the heal stand (lenient), depending on team policy. Most teams start lenient and tighten as confidence grows.

What does the audit log actually contain?

One JSON record per decision. Run id (timestamp), scenario name (path plus case), step number, intent (the human-readable instruction the scenario expressed), signal type (selector_drift, business_logic_deviation, fuzzy_text_match, etc.), policy applied, old value, new value, distance metric, reason in plain English, review status (auto_accepted, pending, rejected). The records persist across runs in a directory committed to the repo or pushed to an artifact store the team controls.

How does this work in an AI QA platform like Assrt?

Assrt is one option that exposes the agent's tool surface in source so policy gates can be wired explicitly. Scenarios live in your repo as Markdown, the agent reads them and executes them with an LLM-backed tool loop, and every retry, selector swap, fuzzy match, or timeout extension is recorded. The policy file lives next to the scenario; the override file lives next to the policy; the heal log writes per run. Other platforms can offer the same shape if they expose enough of the agent surface to gate. The principle outlives any single tool.

What if the platform vendor will not expose signal types?

That is the strong negative signal. If you cannot enumerate the classifications the AI returns, you cannot gate. You are at the vendor's mercy on every borderline call. The right reply is to ask for the signal taxonomy in writing and the policy hooks per signal. If the vendor declines, you are buying a black box. That may still be the right trade for some teams, but the trade should be visible.

How does signal-type gating relate to flake?

Most flakes show up as low-confidence signals: a fuzzy match at distance 0.40, a layout drift with three DOM path changes, a timeout extension to 4x the budget. Gating forces these into the heal_with_pending queue rather than silent auto-heal. The flake stops being a mystery in CI and becomes a queued review. Either the underlying issue gets fixed, or the policy band gets tuned, or the scenario gets rewritten. All three are progress; silent retry is not.

How big is the queue in practice?

On a 200-scenario suite running on every push, expect 5 to 15 pending heals per day in the first month, dropping to 1 to 3 per day by month three as scenarios stabilize. Total review cost is 20 to 40 minutes per week for the team lead. The investment pays for itself the first time a real regression that would have been auto-healed instead lands as a fail_fast and gets caught before deploy.

Should I treat pending heals as fail or pass for release gating?

Block release. Allow merge. The point is that a pending heal is information, not a verdict; the team has a window to look at it. Blocking merge creates a queue of stuck PRs and the team starts auto-accepting to clear the queue, which defeats the purpose. Blocking release lets feature work move forward while the qa review CLI processes the queue in batch.

How does this compare to traditional record-and-replay self-heal?

Traditional record-and-replay self-heal patches the test file in place, reports green, and offers no per-decision record. The trust model is 'we know better than you'. Signal-type gating inverts this: the platform classifies the deviation, applies a per-signal policy, and writes the decision to a log. The trust model becomes 'we noticed something, here is what, here is the policy we applied, you can override'. The cost is one weekly ritual. The benefit is that green stays meaningful and regressions stop hiding behind heals.