AI Code Review & Testing

How to Audit AI-Generated PRs: E2E Testing Strategies That Catch What Code Review Misses

Six months of AI-assisted PRs taught us something uncomfortable: the bugs are not obvious. They look right. They pass review. Then a real user hits a specific sequence of actions, and nothing works. Here is how to build a testing process that catches the failure modes code review cannot.

6mo

I audited 6 months of PRs after my team went all-in on AI code generation. The code got worse in ways none of us predicted. It looked right at a glance but behaved wrong under conditions nobody tested.

r/ExperiencedDevs

1. The Hidden Problem With AI-Generated PRs

When a developer writes code manually, they develop a mental model while writing it. They consider the edge cases because they are thinking through the logic. They catch subtle interactions because they recently touched the related code. The understanding and the implementation arrive together.

AI-generated code arrives fully formed. It is syntactically correct, it handles the obvious input, and it passes any tests that were written for the obvious input. What it often lacks is the consideration of everything else: what happens when the user is logged in with an expired session, what happens when they have items in their cart from a previous visit, what happens when a third-party API returns a 429 instead of a 200.

The code looks finished because it is complete in structure. It is missing depth of consideration that developers build during the process of writing code slowly. This produces a specific pattern in PR reviews: everything checks out, nobody flags a problem, and two weeks later a user reports that they cannot complete checkout after applying a discount code.

The discount code path was never exercised. Not in development, not in review, not in any automated test. The happy path worked. The rest was untested assumption.

2. Why Code Reviewers Miss AI Bugs

Code review is pattern recognition. Experienced reviewers scan for patterns they have learned to associate with bugs: off-by-one errors, missing null checks, incorrect async handling, improper state mutation. They catch these reliably because they have seen the failure modes before.

AI-generated code tends to be structurally sound. The patterns that reviewers have learned to flag are largely absent. The null checks are there. The async handling is correct. The variable names are descriptive and consistent. This is exactly why review feels productive on AI code: there is a lot less to complain about on the surface.

The problem is that the bugs in AI code are not structural. They are behavioral. They live in the interaction between this function and the state that existed before it ran. They live in the assumption that certain preconditions will always hold, without any guard for when they do not. A reviewer reading the function in isolation cannot see the behavioral bug because the function itself is not where the bug is. The bug is in the gap between what the function assumes and what the system actually guarantees.

Code review is not designed to find this class of bug. It was never intended to simulate a user navigating a complex stateful application through a sequence of unexpected actions. That is what tests are for. Specifically, that is what end-to-end tests are for.

3. The Specific Class of Bugs AI Code Produces

After auditing a significant number of AI-generated PRs, a clear taxonomy of failure modes emerges. Understanding these categories helps you design tests that are actually likely to catch them.

Happy path completeness. AI models are trained on examples that demonstrate success. They generate code that handles the successful case fluently. Error paths, partial states, and edge inputs receive less attention. The result is code that works when everything goes right and fails silently when anything goes wrong.

Missing state precondition guards. AI code often assumes that the caller has set up the required state. When that assumption is violated, because a user navigated differently or because another piece of code changed state unexpectedly, the function fails in an unhandled way. The code is correct for the assumed precondition. The precondition is just not always true.

Structurally unusual approaches. Sometimes AI generates code that achieves the correct result through an unusual mechanism. This code passes tests because the output is right, but the approach has unexpected side effects that only appear under specific conditions. Reviewers miss these because the output matches expectation. The implementation detail that will cause problems later goes unremarked.

Interaction with existing state. AI code is generated in context windows that have limits. The model that wrote the function may not have had full visibility into how similar functions elsewhere in the codebase manage shared state. Conflicts between new AI-generated code and existing state management logic produce bugs that only appear when both paths are exercised in the same session.

4. Unit Tests vs. E2E Tests for AI-Generated Code

The instinct when a team identifies a testing gap is to add more unit tests. This is the wrong response for AI code, and understanding why is important for building an audit process that actually works.

Unit tests verify that a function produces the expected output for a given input. If AI code generates a function with the wrong behavior, the unit test that was generated alongside it will test for the wrong output and pass. The AI wrote both the function and the test with the same misunderstanding. The test suite goes green. Users cannot complete checkout.

This is not hypothetical. It is the pattern that emerges when teams rely on AI-generated unit tests to verify AI-generated logic. The tests confirm that the code does what the code does. They do not confirm that the code does what users need it to do. The unit test suite becomes a confidence artifact rather than a safety mechanism.

End-to-end tests are different in a fundamental way. They do not care how the code is structured. They start a real browser, log in as a user, add items to a cart, apply a discount code, and attempt to complete a purchase. Either the purchase completes or it does not. The test does not know that the discount code logic was AI-generated. It only knows whether a real user could accomplish the task.

This is why E2E tests catch what unit tests miss for AI-generated code. The unit test validates the AI's implementation against the AI's expectation. The E2E test validates the user experience against reality. For auditing AI PRs, E2E tests are the layer that actually matters.

Run E2E tests on every AI-generated PR

Assrt generates and runs real browser tests against your application. Open-source, built on Playwright, no signup required.

Get Started

5. Setting Up E2E Testing on Every PR

The goal is to run a set of E2E tests against a preview deployment of every PR before it can be merged. This creates the feedback loop that catches behavioral regressions before they reach production. Here is the practical setup.

Preview deployments. Most modern hosting platforms (Vercel, Netlify, Railway, Render) create automatic preview URLs for each PR. These are the targets for your E2E tests. The test suite needs a URL to run against, and preview deployments provide one automatically. If your infrastructure does not support this natively, setting up ephemeral deployments via Docker in CI is the standard alternative.

Test execution in CI. Playwright runs in GitHub Actions, GitLab CI, CircleCI, and every other major CI platform. The standard setup installs Playwright, installs browser binaries, runs your test suite against the preview URL, and posts results back to the PR as a status check. Failed tests block merge. This is the behavior you want. A green PR gate means real user flows were verified in a real browser.

Test scope. Running your full E2E suite on every PR can be slow. The practical approach is a tiered strategy: run a fast smoke suite on every PR (covering the five to ten most critical user flows, completing in under three minutes) and run the full regression suite nightly or on merges to main. The smoke suite catches the regressions that matter most. The full suite catches everything else before it ships to users.

Authentication and test data. E2E tests against preview deployments need to handle authentication. The cleanest approach is seeded test accounts with known credentials that are available in your CI environment. Playwright's storage state feature lets you log in once and reuse the authenticated session across test files, which keeps the suite fast and avoids flakiness from repeated login flows.

6. Comparison: Code Review Only, Unit Tests, Integration Tests, E2E

Each layer of the testing strategy catches different things. This table summarizes what each approach finds and what it misses for AI-generated code specifically.

ApproachWhat It CatchesWhat It Misses (for AI Code)
Code review onlyStructural antipatterns, obvious logic errors, style issuesBehavioral bugs, state interaction failures, untested paths
Unit testsFunction-level correctness for specified inputsAI-generated tests verify AI assumptions; misses user flow gaps
Integration testsService boundaries, API contract mismatches, data layer issuesFrontend state, UI behavior, multi-step user flows
E2E against real user flowsFull user journeys, stateful sequences, happy path gaps, UI regressionsDeep security audits, performance regressions (use separate tooling)

The practical conclusion is that code review and unit tests are necessary but not sufficient for AI-generated code. Integration tests add value at service boundaries. E2E tests are the layer that validates whether a user can actually accomplish what the product promises. For AI PRs specifically, E2E is the highest-leverage addition to an existing review process.

7. Building an AI Code Audit Process That Actually Works

A functioning audit process for AI PRs combines the layers above into a workflow that runs automatically without slowing down development velocity. Here is what the process looks like in practice.

Start with user flow inventory. Before writing any tests, list the user flows that matter most to your product. For an e-commerce app, this is signup, product search, add to cart, checkout with payment, order confirmation, and account management. For a SaaS tool, it is onboarding, core feature usage, settings changes, and billing. This inventory becomes the specification for your E2E smoke suite.

Require tests for AI PRs. If a PR was generated with an AI assistant, require that the PR description includes an explicit statement of which user flows were verified. This surfaces the question every reviewer should be asking: did anyone check whether real users can still accomplish the core tasks? You do not need a policy document. You need a PR template field that asks the question.

Flag stateful changes for deeper review. The AI bugs that cause the most damage are the ones involving shared state: changes to authentication logic, session handling, cart state, user preferences. Establish a convention where changes to these areas trigger an automatic request for E2E test evidence. The developer runs the relevant E2E suite against their branch and attaches the results to the PR.

Automate what you can, review what you cannot. The goal of the automated E2E suite is to reduce the surface area that human reviewers need to think about. When the E2E suite passes, reviewers can focus their attention on the areas that tests cannot cover: architectural decisions, security implications, performance characteristics, and long-term maintainability. This makes review more valuable because it is focused on judgment rather than behavioral verification.

The teams that have successfully adapted to AI code generation share a common characteristic: they stopped treating code review as the primary safety mechanism and started treating it as one layer in a system where automated behavioral testing carries most of the safety load. This shift is not about trusting AI less. It is about building systems that catch the specific failure modes that AI-generated code produces, regardless of who wrote the code.

Add E2E Testing to Your AI PR Workflow

The process above works. The bottleneck is usually test authoring: writing Playwright tests for all your critical user flows takes time that most teams do not have when they are actively shipping.

Assrt is one option that addresses this specifically. It is an open-source AI testing framework built on Playwright that discovers testable scenarios from your running application and generates test files you own and can modify. There is no proprietary test format and no vendor lock-in. The output is standard Playwright code that runs in any CI pipeline.

Other options include writing tests manually using the Playwright docs and Playwright Codegen for initial scaffolding, or tools like Octomind and QA Wolf if you prefer a managed service. The specific tool matters less than the decision to make E2E testing a required part of your AI PR process. Start with your five most important user flows. Get those running in CI. Then expand coverage incrementally.

E2E testing for every AI-generated PR

Assrt discovers your user flows, generates Playwright tests, and runs them in CI on every pull request.

$npx @assrt-ai/assrt discover https://your-app.com