Testing Guide

Catching Bugs in AI-Generated Code: Why Automated Testing Beats Manual Review

The real skill that erodes when AI writes your code is not typing speed or syntax recall. It is the intuition for knowing whether code is correct. When you write code yourself, you build a mental model of where things are likely to break. When AI writes it, you get something that looks right, passes a quick review, and then fails in production three weeks later in a way you never would have introduced yourself. The solution is not to stop using AI. It is to build an automated verification layer that catches what your eyes cannot.

$0/mo

Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.

Assrt vs competitors

1. The Intuition Problem with AI-Generated Code

When you write code by hand, each decision leaves a trace in your memory. You remember choosing between two approaches and picking the one that handles null values correctly. You remember adding that extra check because the API sometimes returns an empty array instead of null. These micro-decisions accumulate into a gut feeling for what is likely to break.

AI-generated code bypasses this process entirely. The code arrives fully formed. You did not make the decisions, so you do not have the memory of making them. You cannot feel that something is off because you never had the context that would make it feel off. This is not a flaw in AI coding tools. It is a fundamental consequence of outsourcing decision-making.

The danger is subtle. The code looks correct. It often is correct for the happy path. But the failure modes are invisible to you because you never reasoned through the edge cases that would have surfaced them. A developer who wrote the same function manually would have at least considered the error path, even if they chose not to handle it. With AI-generated code, the consideration never happened.

2. Why Code Review Fails for AI Output

Traditional code review works by pattern matching. Reviewers scan for patterns they recognize as problematic: SQL injection, missing null checks, incorrect loop boundaries, leaked secrets. These patterns are learned through experience and are mostly visual. You see the pattern, you flag it.

AI-generated code rarely triggers these visual patterns. The code is syntactically clean, follows conventions, and handles the obvious cases. The problems are in what the code does not do, and missing behavior is much harder to spot in a review than incorrect behavior. A reviewer can see a wrong comparison operator. A reviewer cannot see the missing validation function that should have been called before the database write.

Bug typeCaught by review?Caught by automated E2E?
Syntax errorYesYes (build fails)
Missing error handlingSometimesYes (test the error path)
Race conditionRarelyYes (concurrent test scenarios)
Integration mismatchNoYes (tests real user flows)
State leakage across sessionsNoYes (multi-session tests)
Works locally, breaks in productionNoYes (tests against real environments)

The pattern is clear: code review catches syntactic and stylistic issues. Automated E2E testing catches behavioral issues. For human-written code, the balance between these two is roughly even. For AI-generated code, the balance shifts dramatically toward behavioral issues, which means the testing layer matters far more than the review layer.

3. The New Hire Framework: Treating AI Code with Healthy Skepticism

One of the most effective mental models for working with AI-generated code is to treat it exactly like code from a capable but unproven new hire. A new hire writes code that is often technically correct but may not account for the specific constraints, edge cases, and tribal knowledge of your codebase. You would not merge their first PR without thorough testing. The same standard should apply to AI output.

This means every AI-generated change gets an automated test run before anyone even looks at the code. Not a unit test run, because the AI probably generated the unit tests too and they share the same blind spots. An E2E test run that exercises the real user flows the change is supposed to affect. If the E2E tests pass, the code is worth reviewing. If they fail, the code goes back for revision regardless of how clean it looks.

This inverts the traditional PR workflow. Normally you review first, then test. With AI code, you test first, then review only what passes. The reasoning is simple: the tests catch more bugs per minute than your eyes do when reviewing AI-generated code, because the bugs are behavioral, not visual.

Run automated tests on every AI-generated change

Assrt auto-discovers test scenarios from your running app. Point it at your URL and it generates Playwright tests for every user flow it finds. No codebase understanding required.

Get Started

4. Building an Automated Testing Layer for AI Code

The testing layer for AI-generated code needs to be automated because the volume of AI output makes manual testing impossible to sustain. A developer using AI coding tools might generate 10x the code they would write manually. That 10x increase in output requires a corresponding increase in verification capacity, and the only way to get that is automation.

Start with user flow coverage, not code coverage

Code coverage metrics are misleading for AI-generated code. An AI can generate code with 90% line coverage where the tests pass but the application is broken for real users. User flow coverage, the percentage of real user journeys that have at least one E2E test, is a far better metric. If a user can sign up, create a project, invite a collaborator, and export their data, each of those flows needs a test. The fact that the underlying code has 45% line coverage is irrelevant if all four flows work correctly.

Auto-discover flows instead of manually mapping them

For codebases where you did not write the code (which is increasingly true when AI writes it), auto-discovery tools that crawl the running application and map its user flows are invaluable. They find flows you did not know existed, including ones the AI added without you noticing. Running a discovery pass after each significant AI generation session is a good practice to catch scope creep and unintended functionality.

Use self-healing selectors

AI-generated frontends often change their DOM structure between generation sessions. A test that targeted a specific CSS class yesterday may break today because the AI chose a different class name. Self-healing selectors that adapt to structural changes without manual updates keep your test suite stable even when the AI regenerates components with slightly different markup.

5. Test Before Review: Inverting the Traditional Workflow

The traditional development workflow is: write code, review code, test code, deploy. For AI-generated code, a more effective workflow is: generate code, run automated tests, review only what passes, deploy. This reordering saves significant time because most AI-generated regressions are caught by tests, not by reviewers. Having humans review code that is already known to break user flows is a waste of reviewer time.

In practice, this means your CI pipeline runs the full E2E suite before the PR is even marked as ready for review. If tests fail, the PR stays in draft. The developer (or the AI) fixes the issues and pushes again. Only when all tests pass does the PR enter the review queue. Reviewers then focus on architecture, security, and maintainability rather than functional correctness, because functional correctness has already been verified.

Teams that adopt this workflow report that review times drop by 40 to 60 percent because reviewers are not debugging functional issues in their heads while trying to assess code quality. The separation of concerns, tests verify behavior while reviewers verify design, makes both activities more effective.

6. Tools for Automated AI Code Verification

The tooling landscape for testing AI-generated code is evolving rapidly. Here is how the major approaches compare for the specific challenge of verifying AI output.

ApproachStrengthsLimitationsCost
Manual PlaywrightFull control, precise assertionsSlow to write, brittle selectorsFree (time cost)
AssrtAuto-discovery, self-healing, real Playwright outputBest for web appsFree, open-source
QA WolfManaged service, dedicated QA teamClosed-source, vendor lock-in~$7,500/mo
MomenticAI-powered, visual testingProprietary YAML, Chrome onlyPaid plans
Unit tests (AI-generated)Fast, high coverage numbersShare blind spots with AI codeFree

The key insight is that tools which auto-discover test scenarios from the running application are uniquely suited to AI-generated codebases. You did not write the code, so you may not know all the flows that exist. Discovery tools find them for you and generate tests that document current behavior, correct or not. From there you can decide what to keep, what to fix, and what to remove.

7. Building the Habit: Making Testing Automatic

The hardest part of testing AI-generated code is not the tooling. It is the discipline. When AI generates code ten times faster, the temptation is to skip testing and move to the next feature. The developers who avoid production incidents with AI code are the ones who treat every AI-generated change as untrusted until proven otherwise.

Practical steps to build the habit: add a pre-commit hook that runs your E2E suite. Configure CI to block merges on test failure. Set a team rule that no AI-generated code ships without at least one E2E test covering the affected user flow. Make it easier to test than to skip testing, and the habit will follow.

AI coding tools are not going away. The developers who thrive with them will be the ones who pair fast generation with fast verification. The intuition you lose by not writing the code yourself can be replaced by an automated testing layer that is more thorough and more consistent than gut feeling ever was. The key is to build that layer before the first production incident forces you to.

Verify AI-Generated Code Without Reading Every Line

Assrt auto-discovers test scenarios from your running app and generates real Playwright tests for every flow it finds. Run it on every AI-generated PR and catch behavioral bugs before they reach production.

$npx @m13v/assrt discover https://your-app.com