Why E2E Testing Catches What Code Review Misses with AI-Generated Code
A developer on r/webdev recently shared the results of auditing six months of PRs after their team went all-in on AI code generation. The findings were striking: code review caught almost none of the bugs that made it to production. The code was clean, well-structured, and logically sound in isolation. But when real users interacted with it, flows broke. Checkout failed on declined cards. Auth tokens expired mid-session without re-prompting. Data saved on one page disappeared on another. These are the bugs that only end-to-end testing catches.
“After 6 months of AI PRs, 78% of production bugs were in flows that unit tests never touched.”
Engineering team audit of AI-generated code
1. Why AI Code Passes Review but Fails in Production
Code review is a pattern-matching exercise. Reviewers look at the code and ask: does this look right? Is it structured well? Does it follow our conventions? AI-generated code is optimized for exactly this kind of evaluation. It uses good variable names, follows conventions, and handles the main case correctly. It looks like code a senior developer would write.
The problem is that code review evaluates code in isolation. A reviewer reads a function and decides whether it is correct. But production bugs rarely come from individual functions being wrong. They come from interactions between functions, between services, between the frontend and the backend. AI-generated code is especially vulnerable to these interaction bugs because each piece was generated independently.
Consider a simple example: the AI generates a form submission handler and a separate API endpoint. The handler sends data as JSON. The endpoint expects form-encoded data. Both pieces of code are individually correct and pass code review. But the form silently fails when submitted because the content type mismatch causes the server to receive an empty body.
2. Unit Tests vs. E2E Tests: What Each Actually Catches
Unit tests verify that individual functions work correctly with controlled inputs. They use mocks and stubs to isolate the code under test from its dependencies. This is useful for catching logic errors, but it fundamentally cannot catch integration problems because the mocks replace the exact components where integration bugs hide.
E2E tests verify that real user flows work from start to finish. They open a browser, interact with the application the way a user would, and verify the results. No mocks. No stubs. The real frontend talks to the real backend, which talks to the real database. Every integration point is exercised.
| Bug category | Unit tests | E2E tests |
|---|---|---|
| Logic errors in pure functions | Catches reliably | Catches indirectly |
| API contract mismatches | Misses (mocked away) | Catches reliably |
| Auth flow failures | Misses (mocked tokens) | Catches reliably |
| Checkout/payment errors | Misses (mocked payment API) | Catches reliably |
| Silent data loss on save | Misses (mocked database) | Catches reliably |
| CSS/layout regressions | Does not test | Catches with visual assertions |
| Multi-step form workflows | Misses step interactions | Catches reliably |
For AI-generated code, the rightmost column matters most. The bugs that escape to production are overwhelmingly in the categories where unit tests say "Misses" and E2E tests say "Catches reliably." This is not a coincidence. AI generates code module by module, and the bugs hide at the seams between modules.
E2E tests catch the bugs that code review misses
Assrt auto-generates Playwright E2E tests from plain English descriptions. Test checkout flows, auth, and multi-step forms without writing browser automation code. Free and open-source.
Get Started →3. AI-Generated Bug Patterns That Only E2E Catches
Certain bug patterns show up consistently in AI-generated codebases. Knowing them helps you write targeted E2E tests.
Checkout flow: success UI on payment failure
AI-generated checkout code frequently shows a success message to the user regardless of whether the payment API actually succeeded. The happy path is coded thoroughly. The error callback either does not exist or catches the error without updating the UI. An E2E test that submits a test card configured to decline immediately catches this. A unit test with a mocked payment API will pass because the mock returns whatever you tell it to.
Auth flow: token refresh race condition
AI often generates auth code that handles initial login correctly but fails on token refresh. When a token expires mid-session, the refresh logic might fire multiple times, use a stale refresh token, or redirect to login unnecessarily. An E2E test that logs in, waits for the token to expire (or manipulates the token expiry), and then continues using the application catches this pattern.
Form data: saves on the client but not the server
A common AI-generated pattern: the form updates local state optimistically and shows a success toast, but the API call to persist the data either fails silently or was never implemented. The user sees "Saved!" but the data is gone on page reload. An E2E test that fills a form, submits it, reloads the page, and checks that the data persisted catches this immediately. Unit tests cannot catch this because they test the form component in isolation with mocked persistence.
Navigation: broken back button and deep links
AI-generated single-page applications frequently break browser navigation. The back button might skip pages, deep links might render blank screens, and refreshing the page might lose application state. These bugs only appear when you test in a real browser with real navigation, which is exactly what E2E tests do.
4. Setting Up E2E Testing in CI for Every PR
E2E tests are only effective if they run on every PR. Running them manually means they get skipped under time pressure, which is exactly when AI-generated bugs are most likely to be merged.
GitHub Actions setup
Create a workflow that starts your application, waits for it to be ready, and runs the Playwright test suite. Playwright provides official GitHub Actions support with browser caching that keeps CI times under five minutes for most applications. Set the workflow as a required check for PR merges so that no AI-generated code can be merged without passing E2E tests.
Handling flaky tests
Flaky tests undermine the entire E2E testing effort because developers learn to ignore failures. Use Playwright's built-in retry mechanism to handle intermittent failures, but investigate any test that fails more than once in ten runs. Flakiness usually indicates a real timing issue in the application, not a test problem. Fix the application, not the test.
Test parallelization
Playwright supports sharding tests across multiple CI workers. For large test suites, this keeps the total CI time reasonable. Start with a single worker and add sharding when your suite grows beyond ten minutes. The goal is to keep E2E tests fast enough that developers do not see them as a bottleneck.
5. Comparing E2E Testing Tools for AI-Generated Code
| Tool | AI features | Output | CI integration | Cost |
|---|---|---|---|---|
| Playwright | None (manual authoring) | Standard JS/TS files | Excellent (native GH Actions) | Free |
| Assrt | AI test generation from English | Standard Playwright files | Excellent (standard Playwright) | Free (open-source) |
| Cypress | Limited AI suggestions | Cypress-specific JS | Good (Cypress Cloud for parallelism) | Free tier, paid for cloud |
| QA Wolf | Fully managed AI testing | Managed Playwright | Managed (they run tests for you) | From $7,500/mo |
| Momentic | AI test creation, self-healing | Proprietary YAML | Via Momentic platform only | From $200/mo |
For teams using AI code generation, the most important factor is how quickly you can build E2E coverage for new features. Manual Playwright authoring gives maximum control but takes the most time. AI-assisted tools like Assrt generate initial tests faster, and the output is still standard Playwright code that runs in your existing CI pipeline. Managed services like QA Wolf remove the authoring burden entirely but at significant cost and with platform dependency.
6. Building E2E Coverage for an Existing AI-Generated Codebase
If your team has been using AI code generation without E2E tests, you have a coverage gap to close. Here is a practical approach that does not require stopping feature development.
Week 1: Cover revenue-critical flows
Start with the flows that cost money when they break. Checkout, signup, subscription management, and payment processing. These are the flows where AI-generated bugs are most damaging and where E2E tests provide the most immediate value. Write one test for the happy path and one for the most common error case for each flow.
Week 2: Cover user-facing CRUD flows
Add tests for creating, reading, updating, and deleting the core entities in your application. These tests catch the "saves on client, not on server" pattern that is pervasive in AI-generated code. Each test should create data, reload the page, and verify the data persisted.
Week 3 and beyond: expand coverage with AI assistance
Once the critical flows are covered, use AI test generation tools to expand coverage to secondary flows. Describe the user flow in plain English, generate the test, review the output, and add it to your suite. This is where tools like Assrt, which generates standard Playwright files, save significant time. The generated tests integrate directly into your existing test suite and CI pipeline.
7. The Team Workflow That Actually Works
After working with teams that use AI code generation at scale, a clear workflow pattern has emerged for keeping quality high without slowing down.
Step one: the developer uses AI to generate the feature implementation. Step two: before opening a PR, the developer writes (or generates and reviews) at least one E2E test for the new feature. Step three: the PR includes both the implementation and the test. Step four: CI runs the full E2E suite and blocks merge if any test fails. Step five: code review focuses on the test first, because the test describes what the feature should do, and then reviews the implementation.
This workflow inverts the traditional approach. Instead of reviewing the code and hoping it works, you verify it works (via E2E tests) and then review the code to understand how. The E2E tests become the source of truth for what the feature does. The code review becomes an exercise in verifying that the implementation is reasonable, not in predicting whether it will work.
Teams that adopt this workflow consistently report fewer production bugs from AI-generated code, faster code reviews (because the test explains the feature), and higher confidence in shipping. The upfront cost is writing E2E tests for every feature. The return is that production does not become the testing environment.
Stop Reviewing AI Code and Start Verifying It
Assrt generates real Playwright E2E tests from plain English. Catch checkout bugs, auth failures, and silent data loss before your users do. Free and open-source.