Build vs buy QA platforms in the AI era

Co-create the QA platform. Make every auto-decision readable and overridable in one place, not just configurable.

The build vs buy debate for QA platforms changed shape once self-healing arrived. The risk is no longer that tests break, it is that they silently keep passing while patching over a real regression. The honest answer is co-create: own the scenarios and the overrides, lease the runtime, and require every auto-decision (selector swap, retry, skip, fuzzy match, model fallback) to be readable and overridable in 0 place. Configurable is not enough.

Matthew Diakonov, Written with AI

Published April 27, 202611 min read

Install assrt-mcp Book a 30 minute platform review

4.9from design choices that survive a vendor change

Scenarios live as Markdown in your repo, not in a vendor database

Every auto-decision lands in .assrt/heal.log with reason and override path

One overrides file per scenario, committed to git, reviewed in PRs

Model and tool surface readable in agent.ts, no proprietary engine claims

The QA platform leak nobody talks about

Self-heal that hides a real regression behind a green check.

Scenario asks for the Checkout button. Self-heal finds 'Buy now' instead.

If the rename was real, healing is correct. If the button was removed by a bug, healing hides it.

Configurable means a setting before the run. Reviewable means a record after the run.

Co-create: own the scenarios and overrides, lease the runtime.

0:00 / 0:05

The principles you cannot ship without

co-createreadable decisionsoverridable in one placescenario as source of truthexplicit self-heal logregression vs driftno silent patchesauditable model callsone file to grepno vendor escrow

The leak: self-heal that silently patches a real regression

A scenario asks the agent to click the Checkout button. The button is not there. A pre-AI test framework throws a selector error. A modern AI-driven test framework finds 'Buy now' instead, clicks it, and marks the test green. If the rename was real, that behavior is what you wanted. If the original button was removed because a feature flag regressed and 'Buy now' is actually the wishlist save button, the test still passes. The user-visible flow is broken. CI is green. The platform did exactly what it advertised.

The leak is not the heal itself. The leak is that the heal is invisible. There is no record, no review queue, no signal in CI that something other than the literal scenario was executed. Green stops meaning what green used to mean. Over months, the meaning of a passing test erodes to 'the agent found something close enough to click'. That is not what you bought a QA platform for.

The promise

Tests survive UI changes

When the button is renamed, the test still finds it. Selector maintenance disappears. Marketing pages talk about hours saved per week per engineer.

The leak

Tests survive real regressions too

The platform cannot tell a redesign from a bug. Without a review loop, it heals both. Green means 'the agent found something' not 'the user-visible flow worked'.

Decision 1: scenarios stay in your repo

checkout/scenario.md

Plain Markdown. Reviewable in PRs. Diffable when product changes the flow. The platform reads this file; it does not own it. The scenario is the contract: any heal that deviates from it requires a human decision before it becomes the new normal.

Decision 2: every auto-decision is logged

.assrt/heal.log (after a run)

One JSON record per decision. Step number, intent, old value, new value, reason, auto-action, review status, override path. Anyone on the team can read this file in the morning and say 'that was a real regression' or 'codify it, the rename is intentional'. The record is the audit trail; the heal is not invisible.

Decision 3: every decision is overridable in one place

.assrt/overrides/checkout.json (committed to your repo)

Reject means the next run fails on that step until a human fixes the scenario or the DOM. Accept means the heal becomes the new baseline. One file, one path, one git history. Survives platform upgrades, survives a model change, survives a vendor migration.

The review CLI: how a human actually closes the loop

Pending heals accumulate. The team needs a five minute ritual that processes them before they pile up and silently degrade green. A single CLI command shows what is pending, lets you accept, reject, or edit, and writes the decision to the overrides file.

review session at the end of the day

The flow of a co-created QA platform

Scenario in, agent and tools execute, every deviation lands in a review queue, the human decision feeds back into the next run. The platform learns by writing to a file, not by mutating a test silently in the background.

Scenario → agent → log → review → next run

The seven design choices that make co-create real

Configurable settings buried in vendor UIs do not count. Each of these is a structural choice about who owns what artifact, where it lives, and how a human stays in the loop without becoming the bottleneck.

Scenario lives in your repo

The scenario is the contract. It lives next to the code as plain Markdown, gets reviewed in pull requests, gets diffed when product changes the flow. The platform reads it; the platform does not own it. If the platform shuts down tomorrow, the scenarios stay where they are.

Every auto-decision is logged

Selector swap, retry, skip, fuzzy match, timeout extension, model fallback. Every one of them lands in a single .assrt/heal.log file with the step, the reason, the new value, and the previous value. No silent patches.

Every decision is overridable in one place

A single overrides file (per scenario or per repo) lets you accept, reject, or edit any auto-decision. Reject means 'this run is now failing again because the heal was wrong'. Accept means 'the platform was right, codify it'. The override survives version upgrades.

Regression vs drift is your call

The platform cannot know whether a renamed button is a UI redesign or a removed feature. It can only flag the discrepancy and propose a heal. The human reviewer decides whether the heal becomes the new normal or the heal becomes a failed test.

Model and tool surface are readable

You should be able to grep the source for the model name and the list of actions the agent can take. In Assrt that is agent.ts: claude-haiku-4-5-20251001 on line 9, and 18 tool definitions between lines 16 and 196. Closed platforms refuse the same disclosure.

Costs land on your invoice

Co-create means you bring the API key. The model invoice lands at Anthropic, priced per token. There is no opaque per-seat fee that hides the actual marginal cost of a test run from the budget owner.

No vendor escrow on tests

Tests live as files on disk. They run inside your CI on your hardware. You can grep them, diff them, port them. The platform is the runtime; the tests are yours.

0%percent of auto-decisions written to one log

0file path to override any decision

0tools the agent can call in agent.ts

0vendor-private decisions hidden from review

Closed AI testing platform vs co-create

Both have self-heal. The difference is what happens after the heal. A closed platform mutates the test in memory and reports green. A co-create platform records the heal, runs the new value, and tags the result for review. The user-facing CI dot looks the same on the first day. By the third month, the closed platform's green has lost most of its meaning, and the co-create platform's green still means what it always meant.

Feature	Closed AI testing platform	Co-create platform (Assrt is one option)
Where the scenario lives	In the vendor's database, edited through the vendor's UI.	In your repo as plain Markdown. PR-reviewable, diffable, portable.
Self-heal behavior on selector drift	Patches the running test in memory, marks the run green.	Writes the heal to .assrt/heal.log, runs the new selector, marks the run 'pass with pending heal'.
Where to override an auto-decision	Vendor UI, sometimes with per-test config; survival across upgrades is uncertain.	One repo-tracked overrides file. Reject blocks the next run on that step until fixed.
How regressions vs UI drift are distinguished	The platform decides. You see only the green check.	The platform proposes. The human in code review decides.
Model and tool surface disclosure	Marketed as 'proprietary AI engine'.	agent.ts line 9 names the default model. Lines 16 to 196 list all 18 tools the agent can call.
Who pays for inference	Bundled in a per-seat SaaS subscription.	You bring the Anthropic API key. Token-priced. Itemized invoice.
What CI exit code means	Green means passed. You do not know what got patched on the way.	Green means passed with no pending heals. Yellow means heals waiting for review. Red means a hard failure.
If the vendor sunsets the product	Tests live in their database. Migration is a project.	MIT-licensed runtime. Tests, overrides, and results are files on your disk. Run offline forever.

The competitor column describes the shape of a typical closed agentic testing product, not any single vendor. The co-create column maps to how Assrt is wired today; you can grep agent.ts in the public repo to confirm the model and tool surface.

Six structural choices to make before you adopt any AI QA platform

Walk a candidate platform through these six checks. If it fails any of them, the platform cannot be co-created with; it can only be rented. That may still be the right trade for some teams, but the trade should be visible.

Make the heal land in a log, not in the test

When a selector drifts, the platform must not silently patch the running test in memory and move on. It must write the heal to a log, run the test with the new selector, and tag the result as 'pass with pending heal' rather than 'pass'. The CI green light should be qualified, not unconditional.

Make every kind of auto-decision human-readable

Not just selector swaps. Retries, timeouts, skipped assertions, fuzzy matches, model fallback to a different provider, every decision the platform made that was not 1:1 with the scenario. One JSON record per decision, one log file per run.

Make every decision overridable in one place

Pile all overrides into a single repo-tracked file (or one file per scenario), not in vendor settings. Reject means the next run fails on that step until a human fixes the underlying issue. Accept means the heal becomes the new baseline.

Make 'pending heal' a CI gate

Treat any pass-with-pending-heal as yellow, not green. Block the merge until a reviewer accepts or rejects. Or batch them into a daily review. Either way, the heals do not accumulate silently.

Make the surface auditable in source

The model name and the tool list should be greppable strings in source you can audit before adoption. If the platform vendor cannot or will not show you those, you are not co-creating, you are renting an opaque opinion.

Make the tests portable

Scenarios, results, and overrides all live as files on your disk. If the vendor disappears, you keep the artifacts and run them with a different runtime. Lock-in is the failure mode of the previous decade of test platforms; do not repeat it.

What a co-created run actually feels like in your shell

One CI run, two pending heals, a five minute review session. The decisions land in a file you can grep, diff, and revert. Tomorrow's run is governed by today's decisions, not by a vendor's silent inference.

a normal day with a co-create QA platform

The reframing

The point of self-heal is not to remove humans. It is to remove the boring 80 percent and surface the real 20 percent.

Selector renames, layout shuffles, copy edits, all of those should heal automatically. Removed buttons, broken flows, regressed feature flags, all of those should reach a human inbox within the same day. The platform's job is not to decide which is which; the platform's job is to flag the deviation, run the tentative path, and put the call in front of someone who knows the product. Co-create is the architecture that makes that possible.

Assrt is one platform built on these principles: scenarios in your repo, every auto-decision logged, one overrides file, model and tool surface readable in agent.ts. There are others. The principles outlast any single tool.

1 file

“Configurable means a setting before the run. Reviewable means a record after the run. The QA platform you can trust six months from now is the one that gives you both, in one file, in your repo.”

Co-create design notes, 2026

Want a co-create review of your current QA platform?

Bring the dashboard, we will walk through the seven structural choices in 30 minutes. No pitch, you keep the notes.

Frequently asked questions

What does 'co-create a QA platform' actually mean in practice?

It means you own the parts of the platform whose value compounds (the scenarios, the overrides, the result history) and you lease the parts that change quickly and benefit from a vendor's investment (the runtime, the model integration, the browser automation). The litmus test is portability: if the vendor disappears, can you keep the work? Co-creating means yes. Pure buy means no. Pure build means you own everything including the parts that you should not be writing yourself.

Why is self-heal a leak rather than a feature?

Self-heal becomes a leak when it patches a real regression and reports green. Imagine the checkout button was renamed from 'Checkout' to 'Buy now'. A self-heal can flexibly find the new button and click it, the test passes, the deploy lands. Now imagine the checkout button was removed because of a bug in a feature flag and the only visible button is 'Save for later'. The self-heal still finds 'something close enough' and clicks it. The test passes. The deploy lands. The user-facing checkout flow is broken. The leak is the green check that did not need to be green.

Is configurable self-heal enough?

No. Configurable means there is a setting you can tweak before the run. Reviewable means there is a record after the run that names every decision the platform made and lets you accept or reject each one. Most platforms offer the first; almost none offer the second. Configurable is necessary, not sufficient. The decision needs to be readable in retrospect by a human who is not the person who set the config.

What does a readable auto-decision look like?

A JSON record with the scenario name, the step number, the intent, the previous value, the new value, the reason the platform made the change, and a path to override it. Specifically not 'AI healed your test', specifically not a green check with no detail. The point of the record is that anyone on the team can read it next morning and say 'that was a real regression' or 'that was a UI rename, codify it'.

Does this slow down adoption of the platform?

It changes the adoption story. Instead of 'install and your tests heal themselves', you get 'install and your tests heal themselves, with a queue of decisions to review at the cadence of your team'. Most teams find that pace easier to adopt than fully automatic, because it preserves the trust contract: green means green. Pending means the platform did something, here is what, please confirm.

How does Assrt actually implement this?

Scenarios live in your repo as Markdown. The agent (an LLM-driven loop with 18 tools, declared in agent.ts lines 16 to 196) executes them against a real Playwright browser. Every retry, every selector swap, every fuzzy match, and every timeout extension is recorded with the step number and the reason. The override file lives next to the scenario in the repo, and the assrt review CLI walks through pending decisions for batch acceptance or rejection. The default model is claude-haiku-4-5-20251001, named on line 9 of agent.ts, and you bring your own API key so the cost lands on your Anthropic invoice rather than a per-seat SaaS line.

Where should the overrides file actually live?

In the same directory as the scenario file, committed to git, reviewed in pull requests like any other source. .assrt/overrides/checkout.json or checkout/.overrides.json depending on layout taste. The point is that overrides are part of the codebase, not part of the vendor's database. They survive a vendor migration. They survive a tool change. They survive an LLM upgrade.

What is the right CI failure semantic for a pending heal?

There are three reasonable choices. Strict: pending heals fail the build, force daily review. Lenient: pending heals show as yellow on the dashboard, allow merge but block release until reviewed. Batched: pending heals accumulate over a window (a day, a sprint), get reviewed together, are then enforced strictly. Most teams pick lenient for early adoption and migrate to strict once the heal queue is small. The wrong choice is treating pending heals as green, which silently erodes the meaning of a passing test.

How does this compare to traditional record and replay tools that 'heal selectors'?

Traditional self-heal patches the test file. The selector changes from #checkout-btn to #buy-now-btn and the next run uses the new value. There is rarely a record, almost never a review queue, and usually no override path. The trust model is 'we know better than you'. Co-create platforms invert this: 'we noticed something, we made a tentative call, you decide if it sticks'. The cost is one human moment per heal. The benefit is that green stays meaningful.

Is build vs buy still a real debate, or is the answer always co-create?

Build still wins for teams with extreme platform surface area or extreme privacy constraints; you maintain the runtime because the alternatives do not exist for your workload. Buy still wins for teams with no QA engineering capacity at all and a willingness to trust the vendor's defaults; the platform does what it does, you accept the trade. Co-create wins everywhere in between, which is most teams. The question is not whether to co-create but on which axes to draw the line. Scenarios, overrides, and results stay yours. Runtime, model integration, and tooling stay vendor.

What if our platform vendor refuses to print the model and tool list?

That is a strong signal that you cannot co-create with them, because you cannot reason about what the platform will do tomorrow. A model swap on the vendor side, a new tool added to the agent, a prompt change to the system prompt, all of these change the behavior of every test you run, and you have no audit trail. Pick a vendor whose source you can read or whose disclosures you can pin to specific lines.

How do we measure whether co-create is working?

Three numbers. First, the heal queue depth: how many pending decisions are waiting for review. Should be small and stable, not growing. Second, the override-to-scenario edit ratio: how often does a rejected heal trigger a scenario update vs a code fix. Tells you whether the platform is catching real regressions. Third, the time from heal to decision: a few hours is healthy, weeks means the queue is being ignored and green is silently degrading.