AI Code Review & Testing
How to Audit AI-Generated PRs: E2E Testing Strategies That Catch What Code Review Misses
Six months of AI-assisted PRs taught us something uncomfortable: the bugs are not obvious. They look right. They pass review. Then a real user hits a specific sequence of actions, and nothing works. Here is how to build a testing process that catches the failure modes code review cannot.
“I audited 6 months of PRs after my team went all-in on AI code generation. The code got worse in ways none of us predicted. It looked right at a glance but behaved wrong under conditions nobody tested.”
r/ExperiencedDevs
2. Why Code Reviewers Miss AI Bugs
Code review is pattern recognition. Experienced reviewers scan for patterns they have learned to associate with bugs: off-by-one errors, missing null checks, incorrect async handling, improper state mutation. They catch these reliably because they have seen the failure modes before.
AI-generated code tends to be structurally sound. The patterns that reviewers have learned to flag are largely absent. The null checks are there. The async handling is correct. The variable names are descriptive and consistent. This is exactly why review feels productive on AI code: there is a lot less to complain about on the surface.
The problem is that the bugs in AI code are not structural. They are behavioral. They live in the interaction between this function and the state that existed before it ran. They live in the assumption that certain preconditions will always hold, without any guard for when they do not. A reviewer reading the function in isolation cannot see the behavioral bug because the function itself is not where the bug is. The bug is in the gap between what the function assumes and what the system actually guarantees.
Code review is not designed to find this class of bug. It was never intended to simulate a user navigating a complex stateful application through a sequence of unexpected actions. That is what tests are for. Specifically, that is what end-to-end tests are for.
3. The Specific Class of Bugs AI Code Produces
After auditing a significant number of AI-generated PRs, a clear taxonomy of failure modes emerges. Understanding these categories helps you design tests that are actually likely to catch them.
Happy path completeness. AI models are trained on examples that demonstrate success. They generate code that handles the successful case fluently. Error paths, partial states, and edge inputs receive less attention. The result is code that works when everything goes right and fails silently when anything goes wrong.
Missing state precondition guards. AI code often assumes that the caller has set up the required state. When that assumption is violated, because a user navigated differently or because another piece of code changed state unexpectedly, the function fails in an unhandled way. The code is correct for the assumed precondition. The precondition is just not always true.
Structurally unusual approaches. Sometimes AI generates code that achieves the correct result through an unusual mechanism. This code passes tests because the output is right, but the approach has unexpected side effects that only appear under specific conditions. Reviewers miss these because the output matches expectation. The implementation detail that will cause problems later goes unremarked.
Interaction with existing state. AI code is generated in context windows that have limits. The model that wrote the function may not have had full visibility into how similar functions elsewhere in the codebase manage shared state. Conflicts between new AI-generated code and existing state management logic produce bugs that only appear when both paths are exercised in the same session.
4. Unit Tests vs. E2E Tests for AI-Generated Code
The instinct when a team identifies a testing gap is to add more unit tests. This is the wrong response for AI code, and understanding why is important for building an audit process that actually works.
Unit tests verify that a function produces the expected output for a given input. If AI code generates a function with the wrong behavior, the unit test that was generated alongside it will test for the wrong output and pass. The AI wrote both the function and the test with the same misunderstanding. The test suite goes green. Users cannot complete checkout.
This is not hypothetical. It is the pattern that emerges when teams rely on AI-generated unit tests to verify AI-generated logic. The tests confirm that the code does what the code does. They do not confirm that the code does what users need it to do. The unit test suite becomes a confidence artifact rather than a safety mechanism.
End-to-end tests are different in a fundamental way. They do not care how the code is structured. They start a real browser, log in as a user, add items to a cart, apply a discount code, and attempt to complete a purchase. Either the purchase completes or it does not. The test does not know that the discount code logic was AI-generated. It only knows whether a real user could accomplish the task.
This is why E2E tests catch what unit tests miss for AI-generated code. The unit test validates the AI's implementation against the AI's expectation. The E2E test validates the user experience against reality. For auditing AI PRs, E2E tests are the layer that actually matters.
Run E2E tests on every AI-generated PR
Assrt generates and runs real browser tests against your application. Open-source, built on Playwright, no signup required.
Get Started →5. Setting Up E2E Testing on Every PR
The goal is to run a set of E2E tests against a preview deployment of every PR before it can be merged. This creates the feedback loop that catches behavioral regressions before they reach production. Here is the practical setup.
Preview deployments. Most modern hosting platforms (Vercel, Netlify, Railway, Render) create automatic preview URLs for each PR. These are the targets for your E2E tests. The test suite needs a URL to run against, and preview deployments provide one automatically. If your infrastructure does not support this natively, setting up ephemeral deployments via Docker in CI is the standard alternative.
Test execution in CI. Playwright runs in GitHub Actions, GitLab CI, CircleCI, and every other major CI platform. The standard setup installs Playwright, installs browser binaries, runs your test suite against the preview URL, and posts results back to the PR as a status check. Failed tests block merge. This is the behavior you want. A green PR gate means real user flows were verified in a real browser.
Test scope. Running your full E2E suite on every PR can be slow. The practical approach is a tiered strategy: run a fast smoke suite on every PR (covering the five to ten most critical user flows, completing in under three minutes) and run the full regression suite nightly or on merges to main. The smoke suite catches the regressions that matter most. The full suite catches everything else before it ships to users.
Authentication and test data. E2E tests against preview deployments need to handle authentication. The cleanest approach is seeded test accounts with known credentials that are available in your CI environment. Playwright's storage state feature lets you log in once and reuse the authenticated session across test files, which keeps the suite fast and avoids flakiness from repeated login flows.
6. Comparison: Code Review Only, Unit Tests, Integration Tests, E2E
Each layer of the testing strategy catches different things. This table summarizes what each approach finds and what it misses for AI-generated code specifically.
| Approach | What It Catches | What It Misses (for AI Code) |
|---|---|---|
| Code review only | Structural antipatterns, obvious logic errors, style issues | Behavioral bugs, state interaction failures, untested paths |
| Unit tests | Function-level correctness for specified inputs | AI-generated tests verify AI assumptions; misses user flow gaps |
| Integration tests | Service boundaries, API contract mismatches, data layer issues | Frontend state, UI behavior, multi-step user flows |
| E2E against real user flows | Full user journeys, stateful sequences, happy path gaps, UI regressions | Deep security audits, performance regressions (use separate tooling) |
The practical conclusion is that code review and unit tests are necessary but not sufficient for AI-generated code. Integration tests add value at service boundaries. E2E tests are the layer that validates whether a user can actually accomplish what the product promises. For AI PRs specifically, E2E is the highest-leverage addition to an existing review process.
7. Building an AI Code Audit Process That Actually Works
A functioning audit process for AI PRs combines the layers above into a workflow that runs automatically without slowing down development velocity. Here is what the process looks like in practice.
Start with user flow inventory. Before writing any tests, list the user flows that matter most to your product. For an e-commerce app, this is signup, product search, add to cart, checkout with payment, order confirmation, and account management. For a SaaS tool, it is onboarding, core feature usage, settings changes, and billing. This inventory becomes the specification for your E2E smoke suite.
Require tests for AI PRs. If a PR was generated with an AI assistant, require that the PR description includes an explicit statement of which user flows were verified. This surfaces the question every reviewer should be asking: did anyone check whether real users can still accomplish the core tasks? You do not need a policy document. You need a PR template field that asks the question.
Flag stateful changes for deeper review. The AI bugs that cause the most damage are the ones involving shared state: changes to authentication logic, session handling, cart state, user preferences. Establish a convention where changes to these areas trigger an automatic request for E2E test evidence. The developer runs the relevant E2E suite against their branch and attaches the results to the PR.
Automate what you can, review what you cannot. The goal of the automated E2E suite is to reduce the surface area that human reviewers need to think about. When the E2E suite passes, reviewers can focus their attention on the areas that tests cannot cover: architectural decisions, security implications, performance characteristics, and long-term maintainability. This makes review more valuable because it is focused on judgment rather than behavioral verification.
The teams that have successfully adapted to AI code generation share a common characteristic: they stopped treating code review as the primary safety mechanism and started treating it as one layer in a system where automated behavioral testing carries most of the safety load. This shift is not about trusting AI less. It is about building systems that catch the specific failure modes that AI-generated code produces, regardless of who wrote the code.
Add E2E Testing to Your AI PR Workflow
The process above works. The bottleneck is usually test authoring: writing Playwright tests for all your critical user flows takes time that most teams do not have when they are actively shipping.
Assrt is one option that addresses this specifically. It is an open-source AI testing framework built on Playwright that discovers testable scenarios from your running application and generates test files you own and can modify. There is no proprietary test format and no vendor lock-in. The output is standard Playwright code that runs in any CI pipeline.
Other options include writing tests manually using the Playwright docs and Playwright Codegen for initial scaffolding, or tools like Octomind and QA Wolf if you prefer a managed service. The specific tool matters less than the decision to make E2E testing a required part of your AI PR process. Start with your five most important user flows. Get those running in CI. Then expand coverage incrementally.