How to Review AI-Generated Code with Automated Testing
The skill that atrophies fastest when you use AI coding tools is not writing code. It is knowing whether the code is correct. When you type it yourself, you build intuition for what will break. When AI writes it, you get something that looks right, passes a quick glance, and then fails in production three weeks later in a way you never would have introduced yourself. The solution is not to stop using AI. It is to treat every AI-generated change the way you would treat code from a talented but untested new hire: run automated tests before you even look at it.
“Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.”
Assrt vs competitors
1. The Trust Problem with AI-Generated Code
AI-generated code has a unique failure mode that human-written code rarely exhibits: it looks plausible. A junior developer writing bad code usually writes obviously bad code. The variable names are confusing, the structure is awkward, the logic has visible gaps. You catch these issues in code review because they look wrong.
AI code is different. It uses good variable names, follows consistent patterns, and reads like production-quality code. But underneath that surface quality, it often contains subtle logic errors, missing edge cases, and assumptions about state that do not hold in your specific application context. These bugs pass code review because reviewers scan for patterns that look wrong, and AI code rarely looks wrong at a glance.
The result is a new category of technical debt: code that everyone on the team approved in code review, that nobody fully understood when it shipped, and that breaks under conditions nobody tested manually. This is not a failure of AI or a failure of the developers. It is a gap in the process that automated testing fills.
2. The New Hire Mental Model
The most useful way to think about AI-generated code is as code from a talented new hire on their first week. The new hire is smart, writes clean code, and knows the language well. But they do not know your codebase, your business rules, your edge cases, or the history of why certain things are done a certain way.
You would never merge a new hire's first PR without running the test suite. You would never skip CI checks because the code "looks good." You would review their work more carefully than a senior team member's work, not less. The same discipline should apply to AI-generated code.
| Scenario | Senior dev code | AI-generated code |
|---|---|---|
| Happy path | Works, battle-tested patterns | Works, clean implementation |
| Edge cases | Handled from experience | Often missing entirely |
| Error handling | Specific, actionable messages | Generic catch blocks, silent failures |
| Business logic | Knows the constraints | Implements what was asked, not what was meant |
| Integration bugs | Rare, caught by experience | Common, invisible in code review |
This mental model changes your workflow in a concrete way. Instead of reviewing AI code and then running tests, you run tests first and only review the code if the tests pass. The automated test suite becomes your first reviewer. Your human code review becomes the second pass, focused on architecture and intent rather than correctness.
3. Why Code Review Alone Fails for AI Code
Code review has a well-documented attention bias: reviewers focus on what looks unusual or unfamiliar. AI-generated code exploits this bias unintentionally because it always looks familiar. The patterns are common, the naming is conventional, and the structure follows best practices. There is nothing for the reviewer's eye to snag on.
Consider a real scenario. An AI generates a checkout flow that validates payment details, processes the order, and sends a confirmation email. The code looks correct. It follows the same patterns as the existing checkout code. But the AI assumed that the payment processor always returns a status field, and in the edge case where the processor times out and returns an empty response, the code crashes silently instead of showing an error to the user. No reviewer catches this because the code reads correctly line by line. The bug is in what the code does not do, not in what it does.
Teams that have audited their PRs after adopting AI code generation consistently report the same finding: the bugs are not in the code quality, they are in the behavioral gaps. The function returns what the mock expects, but the user cannot actually complete the flow. This is precisely the class of bug that automated E2E testing catches and code review does not.
Catch what code review misses
Assrt auto-discovers test scenarios from your running app and generates real Playwright tests. No need to read the codebase to verify AI-generated changes actually work.
Get Started →4. Building an Automated Test Pipeline for AI Changes
The most effective pipeline for reviewing AI-generated code puts automated tests before human review. Here is how to structure it:
Step 1: Run existing tests immediately
Before looking at the diff, run your full test suite against the AI-generated changes. This catches regressions instantly. If the AI broke something that was working, you know within minutes instead of discovering it in production days later. Make this automatic: configure your CI to run on every commit, not just on PR creation.
Step 2: Run E2E tests against user flows
Unit tests verify individual functions. E2E tests verify that users can complete actual workflows. For AI-generated code, E2E tests are more valuable because they catch the integration-level bugs that the AI introduced without realizing it. A unit test will pass because the function returns the expected value. An E2E test will fail because the user cannot actually complete checkout, and that is closer to the real class of bugs AI introduces.
Step 3: Human review focused on intent
Once the tests pass, your code review can focus on higher-level concerns: does this change match the intended behavior? Is the architecture appropriate? Are there security implications? You are no longer wasting human attention on correctness checking that a machine does better and faster.
Step 4: Block merges on test failure
This is the critical discipline. No AI-generated change merges unless the full test suite passes. No exceptions for "it's just a small change." No overrides because "the tests are flaky." If the tests are flaky, fix the tests. The pipeline is only as strong as your commitment to not bypassing it.
5. What to Test and What to Skip
Not all tests provide equal value when reviewing AI-generated code. Focus your testing effort on the areas where AI code is most likely to fail:
| High priority | Why AI gets this wrong |
|---|---|
| User flow completion | AI optimizes individual steps, not end-to-end journeys |
| Error recovery paths | AI focuses on happy paths, error handling is often placeholder |
| State across page navigations | AI generates code for single page context, not multi-step flows |
| Concurrent operations | AI writes sequential logic that breaks under parallelism |
| Boundary values | AI tests with typical values, not edge cases (empty, max, null) |
Lower priority for AI-specific review: pure utility functions, straightforward CRUD operations with no business logic, and styling changes. These are areas where AI code quality is typically fine and your existing tests (plus visual review) are sufficient.
6. Tools and Workflow for Automated AI Code Review
The practical workflow for treating AI code like new hire code requires three things: a test suite that covers user flows, a CI pipeline that blocks on failure, and a discovery tool that helps you find gaps in coverage.
Playwright for E2E test execution
Playwright is the standard for browser-based E2E testing. It supports multi-browser testing, has excellent async handling, and produces human-readable test files. The key advantage for AI code review is that Playwright tests operate at the user level: they click buttons, fill forms, and assert on what the user sees. This is exactly the level where AI-introduced bugs manifest.
Auto-discovery for test coverage gaps
One of the hardest parts of testing AI-generated code is knowing what to test when you did not write the code yourself. Auto-discovery tools help here by crawling your running application and finding every user flow that exists. Assrt, for example, takes a URL, discovers the flows users can navigate, and generates Playwright tests for each one. You runnpx @m13v/assrt discover https://your-app.comand get a test suite that covers every flow the tool can find, including flows you might not have known existed in the AI-generated codebase.
The review workflow in practice
- AI generates code: Accept the change into a branch
- CI runs automatically: Unit tests, E2E tests, linting
- All tests pass: Human reviewer focuses on architecture and intent
- Tests fail: Fix the issue before any human review time is spent
- New bug found later: Write a test, add it to the suite, it catches the bug next time
This workflow catches more bugs than code review ever did because it tests behavior, not appearance. The AI code that looks perfect but breaks checkout? The automated test tries to check out and fails. The race condition that no reviewer would spot by reading the diff? The E2E test clicks the button twice and catches the duplicate. Building this habit is the single most effective response to the trust problem with AI-generated code.
Build Your AI Code Review Pipeline
Assrt auto-discovers test scenarios from your running app and generates real Playwright tests. Use it to build the automated safety net that catches what code review misses. Open-source, free, no vendor lock-in.