Testing Guide

E2E Testing vs Unit Tests for AI-Generated Code: Why PRs Pass Review but Break in Production

Teams that went all-in on AI code generation are discovering a pattern: code quality looks fine, unit tests pass, code review approves, and then users cannot complete basic flows in production. The code is structured in ways no human would choose, and it behaves wrong under conditions nobody thought to test. The problem is not the AI. The problem is that unit tests and code review were designed for human-written code, and AI code breaks in fundamentally different ways.

$0/mo

Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.

Assrt vs competitors

1. The "Plausible Looking Code" Problem

AI-generated code has a distinctive quality that makes it unusually hard to review: it looks correct. The variable names are sensible. The patterns are familiar. The structure follows conventions. Reviewers scan for things that look wrong, and AI code rarely triggers that signal.

The bugs in AI code are not in what you can see. They are in what is missing. A checkout flow that handles the happy path but has no error recovery when the payment processor returns an unexpected response. A form validation that checks format but not business rules. A state update that works in isolation but creates a race condition when two components update simultaneously.

Teams that have audited six months of PRs after adopting AI code generation consistently report the same finding: the code got worse in ways nobody predicted. Not worse in terms of readability or structure. Worse in terms of behavioral correctness under real-world conditions. The PRs looked better than ever. The production incidents told a different story.

2. Why Unit Tests Miss AI-Introduced Bugs

Unit tests are excellent at verifying that individual functions return expected values for given inputs. They are the foundation of most testing strategies. But for AI-generated code, they have a critical blind spot: unit tests verify that the function does what the test expects, not that the user can complete their task.

When AI generates code and the corresponding unit tests, both come from the same model with the same assumptions. The test mocks the external dependency the same way the code uses it. The test provides inputs that match the happy path the code was built for. The test asserts on the return value, not on the downstream behavior. Everything passes. Everything is wrong.

Bug typeUnit test catches it?E2E test catches it?
Function returns wrong valueYesYes
User cannot complete checkoutNo (mock passes)Yes
Race condition on double-clickNo (runs sequentially)Yes
Error message never shownNo (tests return values)Yes
State leaks across navigationNo (isolated context)Yes
API timeout not handledNo (mock always responds)Yes (with network simulation)

The pattern is clear: unit tests pass because the function returns what the mock expects. E2E tests fail because the user cannot actually complete the flow. For AI-generated code, this gap is not a minor testing technicality. It is the difference between shipping working software and shipping something that looks like it works.

3. What E2E Tests Catch That Nothing Else Does

E2E tests operate at the user level. They open a real browser, click real buttons, type into real forms, and assert on what appears on screen. This means they catch every bug that affects the user experience, regardless of where in the stack the bug originates.

For AI-generated code specifically, E2E tests are valuable because they test the integration between components. AI tends to generate each component in isolation, optimizing for that component's local correctness. The bugs appear when components interact: when a form component passes data to an API handler that passes it to a database layer, and somewhere in that chain an assumption breaks.

Multi-step flow failures

A checkout flow has five steps. The AI generated each step correctly in isolation. But step three assumes the user came from step two (which sets a state variable), and if the user navigates directly to step three via URL, the state variable is undefined and the page crashes. No unit test catches this because each step is tested individually. An E2E test that navigates the full flow catches it immediately.

Visual regression

AI-generated CSS often works in the default viewport but breaks at edge sizes. A modal that renders correctly at 1440px but overflows at 375px. A table that looks fine with three rows but becomes unreadable with fifty. E2E tests with screenshot comparison catch these regressions automatically.

Discover every flow in your AI-generated app

Assrt crawls your running app and generates Playwright tests for every user flow it finds. No codebase knowledge required. Point it at a URL and get a full test suite.

Get Started

4. How to Audit Your PR Pipeline After Adopting AI

If your team has been using AI code generation for several months without E2E testing, you likely have a backlog of behavioral bugs that passed code review. Here is how to audit and fix your pipeline:

Step 1: Identify AI-generated PRs

Look at your recent PRs and identify which ones contain significant AI-generated code. You probably know which ones they are. If not, look for the telltale signs: unusually consistent style across a large changeset, comprehensive but generic error handling, and code that follows patterns you have never seen anyone on your team use.

Step 2: Test the user flows those PRs affected

For each AI-generated PR, identify the user flows it touched. Then manually test those flows with particular attention to error cases, edge inputs, and multi-step sequences. Document every bug you find. This gives you a concrete picture of the quality gap between what code review approved and what actually works.

Step 3: Build E2E tests for those flows

Turn every bug you found into an E2E test. Then turn every flow you tested (even the ones that worked) into an E2E test. This creates a regression safety net that prevents those bugs from recurring and catches similar bugs in future AI-generated PRs.

Step 4: Make E2E tests a merge requirement

Add the E2E test suite to your CI pipeline and block merges when tests fail. This is the structural change that prevents the problem from recurring. Code review remains valuable for architecture and intent, but correctness verification shifts to the automated pipeline where it belongs.

5. Unit Tests vs E2E Tests: A Practical Comparison for AI Code

FactorUnit testsE2E tests
SpeedMilliseconds per testSeconds per test
Coverage of AI blind spotsLow (tests what AI expects)High (tests what users experience)
Integration bugsMisses (isolated by design)Catches (tests the full stack)
Setup complexityLowMedium (needs running app)
Maintenance costLow (stable APIs)Medium (UI changes break selectors)
Value for AI code reviewNecessary but insufficientCritical, catches what review misses

The answer is not to replace unit tests with E2E tests. Both serve a purpose. The shift is in recognizing that for AI-generated code, E2E tests provide disproportionate value because they test at the level where AI bugs manifest. Run both. But if you can only add one type of test to your AI code review pipeline, make it E2E.

6. Building E2E Coverage for AI-Heavy Teams

The biggest barrier to E2E testing is the upfront effort of writing tests. For teams already stretched thin, writing Playwright tests for every user flow feels like a luxury. Two approaches make this practical:

Auto-discovery tools

Instead of manually identifying and scripting every flow, use tools that crawl your running application and discover flows automatically. Assrt does this by navigating your app, finding every clickable path, and generating Playwright test files for each flow. Runnpx @m13v/assrt discover https://your-app.comagainst your staging environment and you get a baseline test suite in minutes that covers every flow the tool can find. The generated tests are real Playwright code you can inspect, modify, and commit to your repo. No proprietary format, no vendor lock-in.

Self-healing selectors

The biggest maintenance cost of E2E tests is broken selectors when the UI changes. AI-generated UIs change frequently because each generation might produce slightly different markup. Self-healing selectors adapt to these changes automatically, reducing the maintenance burden that makes teams abandon their E2E suites.

Incremental coverage

You do not need 100% E2E coverage on day one. Start with the five most critical user flows. Add a test every time a bug reaches production. Within a few weeks, you will have meaningful coverage of the flows that matter most, and each AI-generated PR will run against that growing safety net.

The teams that avoid the worst outcomes from AI code generation are not the ones who stopped using AI. They are the ones who added E2E testing to every PR, not just unit tests. The automated pipeline catches the class of bugs that code review and unit tests structurally cannot see, and that is the class of bugs AI code introduces most often.

Audit Your AI-Generated Codebase

Assrt auto-discovers test scenarios from your running app and generates real Playwright tests for every user flow. Build the E2E safety net that catches what unit tests and code review miss. Open-source, free, no vendor lock-in.

$npx @m13v/assrt discover https://your-app.com