AI rewrites and testing

Test coverage gaps after an AI rewrite: the three places they actually come from

Direct answer (verified 2026-05-18)

After an AI rewrite, coverage gaps come from three specific places: new branches the rewrite added that no test reaches, old branches it removed leaving orphaned tests that still light up green against unrelated code, and reshaped branches where the lines look covered but the side-effect ordering quietly changed. Line and branch coverage tools (Istanbul, c8, JaCoCo) only surface the first one. The only check that catches all three is to re-derive the test set from the live running app, not from the source the AI just rewrote. Authoritative reference for runtime behavior assertions: playwright.dev/docs/test-assertions.

M
Matthew Diakonov
8 min read

The conversation about AI and testing is mostly stuck on one case: the AI wrote new code, the new code has no tests, write some. That case is real but easy. The case nobody writes about is harder and more dangerous: the AI rewrote code that already had tests, the tests still pass, and the new behavior is silently wrong. Green checkmark. Shipped to staging. Three days later a user reports something subtle that nobody can repro from the test names.

This page is about why that happens. Three specific gap sources, the reason coverage tools miss two of them, and the only check that actually surfaces all three.

The trap: tests as constraint, not contract

When you ask a coding agent to rewrite a file, the agent loads the file, the surrounding files, and the test file. The test file is input, not oracle. The agent reads the assertions, understands them as constraints on the output it has to produce, and writes code that satisfies those constraints. That is not the same as understanding the behavior the tests were trying to capture. The assertions are a finite, lossy projection of intent; the rewrite is fitted to the projection, not the intent.

So when the rewrite passes all the tests, you have learned exactly one thing: the new code satisfies the same finite set of assertions the old code did. You have not learned that the new code preserves the rest of the old behavior. The rest of the old behavior is, by definition, the part the tests did not pin down. That is where the gaps live.

There are three categories of gap. They look similar in a code review and they sit in completely different blind spots in your tooling.

Gap 1: new branches the rewrite added that no test reaches

The rewrite is rarely a pure structural restatement. Coding agents tend to add safety: a retry on a network call that did not retry before, a fallback when a field is missing, a guard against an empty list. Sometimes the additions are correct and welcome. Sometimes they change observable behavior. Either way, the branches are new. No test was written for them because they did not exist when the test suite was written.

This is the one gap that line and branch coverage tools will actually catch. The new branch shows up as uncovered, coverage drops a point or two, and a reviewer who reads the diff carefully can ask the obvious question: what tests do we need for the new behavior? The bad news is that most teams either do not look at coverage deltas on a rewrite (the file is mostly new, so the delta is noisy) or they treat the gap as cosmetic and merge anyway.

Gap 2: branches the rewrite removed, with orphaned tests that still pass

This is the dangerous one. The old code had a defensive guard, an obscure error path, or a feature flag branch. The rewrite did not preserve it. The test that covered it still exists, still runs, and still passes. It passes because the input it constructs no longer reaches the removed branch; it hits some other path in the new code that happens to satisfy the same assertion. The test name says it tests behavior X. Behavior X is gone. The test reports green and nobody is the wiser.

Coverage tools cannot help here. From their perspective the test file runs to completion and asserts something that is true about the new code. Coverage is a measurement of what executes, not a measurement of whether the right thing executes for the right reason. A green orphaned test reports as covered code; the gap is invisible.

The only way to find these is to read every test that touches the rewritten file and ask, with the new code in front of you, whether the test still tests what its name claims. On a non-trivial rewrite that is hours of work, and it is the kind of work AI agents themselves are bad at, because the agent that just wrote the rewrite is the most biased reader you could put on the task.

Gap 3: reshaped branches where the lines look covered but the timing shifted

The rewrite kept the same inputs and the same outputs, but it reordered the steps in between. A call that used to happen before state X was set now happens after. A side effect that used to fire once per request now fires once per retry. A read that used to be synchronous is now wrapped in a promise. The unit tests still pass because they pin inputs to outputs and the outputs are unchanged. Coverage still reports 100% on the file. Anything downstream that depended on the old ordering is now wrong.

This gap is invisible to anything that measures the code in isolation. The lines run, the assertions hold, the file looks green. The only place the gap is detectable is from outside the file, at the level of the running system, where the ordering actually matters. A user reproduction of the regression will hit it. A unit test usually will not.

What "all tests pass" actually means before and after a rewrite

The phrase reads the same. The information content is different.

The same green checkmark, two very different claims

The tests were written against the code as it exists. Every assertion was authored by a human (or AI) who could see the implementation it targeted. The set of assertions reflects what someone thought was worth pinning. If the tests are green, the code matches the assertions and the assertions were intentional.

  • Assertions and implementation were authored in conversation with each other
  • Coverage tools measure a meaningful surface
  • Green = code matches what someone thought mattered

Why line and branch coverage is a 33% signal here

Of the three gap sources above, coverage tools (Istanbul, c8, JaCoCo, Coverage.py) detect one. They detect new uncovered branches because that is what they measure: which lines and branches your test runs walked. New code with no test produces a visible uncovered region. Useful.

They do not detect orphaned tests because the test still runs and the assertion still passes. The branch the test was originally trying to exercise is gone, but the coverage tool only sees that some code path ran end to end. Whether the path that ran is the path the test was named after is invisible to it.

They do not detect reshaped paths because the same lines run, in the same module, with the same final return value. The reordering that broke a downstream dependency does not change any of the numbers a coverage tool collects.

You can build mutation-testing setups (Stryker, Pitest) that catch more of this, and they help, but they are slow, noisy, and they test the same source the AI rewrote against the same assertions the AI fit to. The deeper problem is that all source-anchored measurement is downstream of the rewrite. You need a signal that does not start from the source.

The fix: re-derive the test set from the running app, not the rewritten source

If the problem is that all three gaps live in places no source-anchored measurement can see, the fix is to anchor somewhere else: the live, deployed, running application. Walk it the way a user would. Record what actually happens. Generate assertions against the observed behavior. The assertions now reflect the running system, not the rewritten file.

Two practical consequences. First, any new branch the rewrite added that produces a new user-visible state shows up as a new scenario; the gap moves from invisible to nameable. Second, any old behavior the rewrite removed that used to be reachable from the UI either still works (the test is real, the rewrite preserved it) or it fails to reproduce (the test is genuinely orphaned, and you now know). The third gap, the reordering one, is the one this approach is best at: any side-effect timing that mattered to a downstream surface (network calls, DOM transitions, error messages) gets recorded and asserted, because the test was authored against the live trace, not against the file.

This is the mechanism Assrt's discovery loop implements, and it is worth looking at the actual code, because there is a real difference between a tool that claims to do this and a tool whose source you can read.

The anchor fact: Assrt's discovery walks the live app

In ~/assrt-mcp/src/core/agent.ts, lines 593 to 666, there is a method called queueDiscoverPage and a partner method generateDiscoveryCases. The loop is short. For each new URL the agent encounters, it calls browser.snapshot() against the live page, captures the accessibility tree as it currently renders, and asks a model to produce #Case scenarios against that visible state.

The discovery system prompt at agent.ts:264 reads, verbatim: "Reference ACTUAL buttons/links/inputs visible on the page." The cap on the walk is set by MAX_DISCOVERED_PAGES = 20 at agent.ts:278. The skip list (/logout, /api/, javascript:, etc.) is two lines below.

The point is that there is no AST parse of the rewritten file anywhere in this loop. The discovery model never sees the source. It sees what a user sees. The test set it produces is a function of the deployed behavior, not the file the AI just wrote.

The output is real Playwright code, written to your repo. You can read it, edit it, commit it, run it in any CI pipeline. No YAML, no proprietary format, no vendor lock-in. The full source is open at github.com/assrt-ai/assrt-mcp.

A practical playbook for a rewrite review

Concrete order of operations when an AI agent has just rewritten a file or a module and the existing test suite reports green.

After an AI rewrite, before you merge

  • Run the old test suite. Note any tests that flip from pass to fail. These are the obvious regressions.
  • Run coverage. Look at the new uncovered branches. These are gap-source 1; ask for tests on the new behavior.
  • Re-derive the test set from the running new version on staging. Walk the user flows the rewrite touches. Generate fresh scenarios.
  • Diff the fresh scenarios against the existing test names. Any scenario only in the new set is a gap-source 2 or 3 candidate.
  • For each delta, decide intentional or unintentional. Intentional changes get a new test or an updated test name. Unintentional ones revert.
  • Commit both the new generated tests and the updated old tests. The old suite stays as a regression boundary, not as proof of safety.

Honest counterargument: this is more work, not less

The whole appeal of an AI rewrite is that it is fast. The whole appeal of green tests is that they let you ship. This page is arguing that on a rewrite specifically, you should treat green tests as one input and add a separate runtime-anchored signal on top. That is more work than "tests pass, merge".

The pragmatic version is to scope this work to the rewrites that matter. A 30-line helper function does not need this treatment. A payment flow rewrite does. A migration from one framework to another, where every file is rewritten, absolutely does. The cost of running a discovery pass against a staging deploy is in minutes, not hours, and you only do it when the rewrite touches surface that humans interact with.

The version that does not work is treating the existing green suite as sufficient evidence on a non-trivial rewrite, then patching the post-deploy regression as a separate ticket. That is the path that produces the "the AI passed all the tests but broke the checkout" story your team will tell at the next retro.

So: where do the gaps come from, and what catches them?

Recap, because the three gap sources are the load-bearing claim of this page and they tend to blur together in conversation.

  • Gap 1 (new branches). Caught by line and branch coverage if you actually look at the delta after a rewrite. Easy to spot, easy to fix, often ignored.
  • Gap 2 (orphaned tests). Invisible to coverage. Found only by reading every test that touches the rewritten code against the new implementation, or by re-deriving the test set from runtime behavior and noticing what is missing from the new derivation.
  • Gap 3 (reshaped timing). Invisible to coverage and to most unit tests. Found by asserting against the live system, where ordering and side effects are observable. This is where runtime-anchored discovery does the most work.

The throughline: source-anchored measurement is the wrong tool after an AI rewrite, because the source is the artefact you are trying to verify. Move the anchor to the running app and most of these gaps stop being silent.

Just shipped an AI rewrite and not sure what you missed?

Bring the file (or the module, or the whole migration). We will run discovery against your staging deploy, diff the generated scenarios against your existing tests, and walk through what the gap actually contains before it ships to prod.

Common questions

Why are tests passing after an AI rewrite not enough?

Because the AI saw the tests as part of the input. When a coding agent is asked to rewrite a module, it reads the surrounding test file as a constraint, then writes code that makes those exact assertions pass. That is not the same as preserving behavior. Any behavior the tests did not pin down is fair game for the rewrite to change, and any behavior the tests indirectly relied on (call ordering, side effects, intermediate state) can shift without flipping a single assertion red. The green checkmark proves the assertions still match the implementation; it does not prove the implementation still matches what users do.

Where do new coverage gaps come from specifically after an AI rewrite?

Three places. One, paths the rewrite quietly added (a new fallback, a new retry branch, a new auth code path) that no test was written for because no test existed when the old version did not have them. Two, paths the rewrite quietly removed (an obscure error case, a defensive guard, a feature flag branch) so the tests that covered them are now orphaned but still pass because the code they target was deleted or replaced. Three, paths the rewrite kept but reshaped (same inputs, same outputs, different sequence of side effects) so coverage tools count them as covered but the real-world behavior diverges.

Will a coverage tool like Istanbul or c8 catch these gaps?

Only the first one, partially. Line and branch coverage tools count which statements your tests execute. If the rewrite added a new branch that no test reaches, coverage drops and the dashboard flags it. The other two gaps are invisible to coverage: orphaned tests covering removed paths register as covered code (the test runs and the assertion passes against new code that happens to match), and reshaped paths register at the same coverage number because the same lines run. Coverage is a necessary but very partial signal here, and it is the signal teams overweight after a rewrite.

What is the actual fix for AI-rewrite coverage gaps?

Re-derive the test set from the running application, not from the source the AI just rewrote. Walk the real user flows on the deployed app, record what actually happens (network calls, DOM transitions, error states), and turn those into assertions. Anything the rewrite changed that the user can observe will show up in the diff. Anything the rewrite added that the user can reach will get a new scenario. The test surface follows real behavior instead of inheriting the old implementation's shape.

How does Assrt help with this?

Assrt has a discovery loop that walks the live deployed app, not the source code. The loop is in ~/assrt-mcp/src/core/agent.ts around lines 593 to 666. It calls browser.snapshot() against each page, captures the accessibility tree of what is actually rendered right now, and asks a model to generate Playwright test cases against that visible state. The discovery prompt at agent.ts:264 reads, literally: 'Reference ACTUAL buttons/links/inputs visible on the page.' The constant MAX_DISCOVERED_PAGES = 20 at agent.ts:278 caps the walk so it is finite. After an AI rewrite, you point Assrt at the new staging deploy and it generates a test set anchored to the new live behavior, not the AI's view of the old source.

Should we keep the old tests after an AI rewrite, throw them out, or both?

Keep them as a regression boundary, run them, and then assume they cover less than they did before. The old suite is still useful for spotting cases where the rewrite broke a user-visible contract the AI was supposed to preserve. It is not useful as evidence that the rewrite is safe. Pair it with a freshly-discovered scenario set against the running new version and treat any test that exists only in one of the two suites as a question to answer, not a result to ignore.

Does this same problem apply to AI refactors, not just full rewrites?

Yes, more often actually, because refactors are smaller, so the team trusts the green checkmark more. The same three gap types appear: a refactor that consolidates two helpers into one can add an implicit branch (one of the original behaviors becomes the new default, the other becomes a flag), a refactor that 'simplifies' an error path can remove a guard, a refactor that reorders operations changes when side effects happen. Coverage moves a percent or two and CI stays green. The fix is the same: verify against runtime behavior, not source structure.

How is this different from regular AI-generated-code testing advice?

Most writing on AI and testing assumes the AI wrote new code that has no tests yet, so the advice is 'generate tests alongside the code'. The rewrite case is harder because the code already had tests, the tests still pass, and that combination is the trap. The advice has to be different: do not just generate more tests, regenerate the test set from the new running behavior so the old test shape stops being the ceiling on what you can detect.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.