Why E2E Tests Catch What Code Review Misses in AI-Generated PRs
Teams that go all-in on AI code generation discover a pattern within a few months: the code looks increasingly polished in PRs, but production incidents involving subtle behavioral bugs start climbing. Code reviewers scan for patterns they recognize as wrong, and AI-generated code rarely looks wrong. It just behaves wrong under conditions nobody thought to test manually. The fix is not better reviewers. It is E2E tests that exercise real user flows on every PR, catching the class of bugs that human review fundamentally cannot detect.
“Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.”
Assrt vs competitors
1. The Plausible Code Problem
AI-generated code has a unique quality that makes it dangerous in code review: it is plausible. The variable names are sensible. The structure follows established patterns. The logic reads like it should work. Reviewers trained on years of spotting human errors, things like off-by-one loops, incorrect null checks, and missing await keywords, do not find what they are looking for because the AI rarely makes those mistakes.
The mistakes AI makes are different. The code handles the input it was designed for and silently does the wrong thing with inputs the prompt never described. The function returns the correct result in isolation but produces incorrect state when called in the sequence the application actually uses. The API integration passes local testing because the mock and the real API have diverged in ways the mock does not capture.
These are not the kinds of bugs a reviewer spots by reading a diff. They are the kinds of bugs that surface when a real user clicks through a real workflow in a real browser. That is exactly what E2E tests do.
2. The Unit Test Trap with AI-Generated Code
When teams adopt AI code generation, unit test coverage often improves dramatically. AI is excellent at generating unit tests: it reads the function, understands the expected behavior, and writes a test that passes. The coverage number goes up. Everyone feels good.
The trap is that unit tests pass because the function returns what the mock expects. This is a tautology when the same AI generated both the function and the test. The AI's understanding of the expected behavior is encoded in both artifacts. If the AI's understanding is wrong, both the code and the test are wrong in the same way, and the test passes.
| Test type | What it verifies | AI code blind spot |
|---|---|---|
| Unit test | Function returns expected output | Mock and code share the same wrong assumption |
| Integration test | Components work together | Still uses mocked external services |
| E2E test | User can complete real workflows | Catches real behavioral failures |
The E2E test is the only level that breaks the tautology. It does not care what the function is supposed to return according to the AI. It cares whether the user can actually complete checkout. That is the test that matters.
Break the mock tautology
Assrt generates E2E tests from your running app, not from your code. It tests what users actually experience, catching the integration failures that unit tests and code review miss.
Get Started →3. How E2E Tests Catch Integration Failures
Integration failures are the signature bug class of AI-generated codebases. Each component works in isolation. The checkout form validates inputs correctly. The payment API processes charges correctly. The order confirmation page displays data correctly. But when a user goes through the full flow, something breaks: the payment processes but the confirmation page shows the wrong order, or the form submits successfully but the backend receives different data than what was displayed on screen.
These bugs happen because each component was generated in a separate conversation with the AI. The AI had no global view of how data flows through the system. It made locally reasonable decisions in each component that are globally inconsistent. Field A is namedamount in the form andprice in the API payload andtotal on the confirmation page. Each component handles its own naming correctly. The mismatches only surface when data crosses component boundaries.
E2E tests catch these by definition: they exercise the full flow from input to output, crossing every component boundary along the way. A test that fills out the checkout form, submits it, waits for the confirmation page, and asserts that the displayed amount matches what was entered will catch any data transformation bug in the pipeline.
4. Testing Real User Flows, Not Function Signatures
The most common mistake teams make when adding tests to AI-generated codebases is testing what the code does instead of what users do. An AI-generated function that sorts a list might have ten unit tests, all passing. But if the sorted list is rendered in a table that a user needs to paginate, and the pagination resets to page 1 after every sort, the user experience is broken despite all tests passing.
Real user flow testing means writing tests that describe a user goal and the steps to achieve it. "User sorts the product table by price and navigates to page 3 of results, then sorts by name and remains on page 3." This test does not care about the sort function implementation. It cares about whether the feature works as a user expects it to.
For AI-generated codebases, this approach has an additional benefit: the test descriptions serve as executable documentation of product requirements. When the AI regenerates a component and the test fails, you know exactly which user expectation was violated, not which function changed its return value.
5. A PR Testing Strategy for AI-Heavy Teams
Teams that successfully integrate AI code generation into their workflow without sacrificing quality follow a consistent pattern:
- Every PR triggers the full E2E suite before it is marked ready for review. Failed suites block the review queue entirely.
- New features require new E2E tests written against the running app, not against the code. The test is part of the PR, not a follow-up task.
- Reviewers focus on design, not behaviorbecause behavioral correctness is already verified by tests. This reduces review time by 40 to 60 percent.
- Weekly discovery runs auto-scan the app for new flows that do not have test coverage. This catches functionality the AI added that nobody explicitly asked for.
This strategy works because it separates concerns cleanly. Tests verify that the app works. Reviewers verify that the code is maintainable. Neither job is diluted by trying to do the other.
6. Tools That Make This Practical
Writing E2E tests manually for every AI-generated PR is not scalable. The tools that make this practical are the ones that reduce the effort per test to near zero while maintaining meaningful coverage.
| Tool | Auto-discovery | Self-healing | Output format | Price |
|---|---|---|---|---|
| Assrt | Yes | Yes | Standard Playwright | Free (open-source) |
| Manual Playwright | No | No | Standard Playwright | Free |
| QA Wolf | Yes | Managed | Proprietary | ~$7,500/mo |
| Momentic | Partial | Yes | Proprietary YAML | Paid |
The deciding factor for most teams is output format. Tools that generate standard Playwright code give you tests you can inspect, modify, and run in any CI pipeline without vendor lock-in. Tools that use proprietary formats create a dependency that becomes painful if you ever need to switch.
7. Measuring Whether Your Testing Is Working
The metric that matters is not test count or code coverage. It is the ratio of bugs caught in CI versus bugs found in production. A healthy AI-heavy codebase has a ratio of at least 10:1, meaning for every bug that reaches production, the E2E suite caught ten others in the PR pipeline. If your ratio is below 3:1, your test suite has significant gaps in user flow coverage.
Track this ratio monthly. When it drops, run a discovery scan to find flows that have been added or changed without corresponding test updates. When a bug reaches production, write the E2E test that would have caught it before fixing the bug. Over time, this feedback loop makes the test suite an increasingly accurate model of what real users do and what can go wrong.
AI code generation is a permanent shift in how software gets written. The teams that thrive with it are not the ones that write the most code. They are the ones that verify the most code. E2E testing is the verification layer that makes high-volume AI code generation safe, sustainable, and production-ready.
Catch the Bugs Code Review Misses
Assrt auto-discovers test scenarios from your running app and generates real Playwright tests for every user flow. Run it on every AI-generated PR to catch behavioral bugs that code review cannot detect.