Testing Guide

How to Review AI-Generated Code with Automated Testing

The skill that atrophies fastest when you use AI coding tools is not writing code. It is knowing whether the code is correct. When you type it yourself, you build intuition for what will break. When AI writes it, you get something that looks right, passes a quick glance, and then fails in production three weeks later in a way you never would have introduced yourself. The solution is not to stop using AI. It is to treat every AI-generated change the way you would treat code from a talented but untested new hire: run automated tests before you even look at it.

$0/mo

Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.

Assrt vs competitors

1. The Trust Problem with AI-Generated Code

AI-generated code has a unique failure mode that human-written code rarely exhibits: it looks plausible. A junior developer writing bad code usually writes obviously bad code. The variable names are confusing, the structure is awkward, the logic has visible gaps. You catch these issues in code review because they look wrong.

AI code is different. It uses good variable names, follows consistent patterns, and reads like production-quality code. But underneath that surface quality, it often contains subtle logic errors, missing edge cases, and assumptions about state that do not hold in your specific application context. These bugs pass code review because reviewers scan for patterns that look wrong, and AI code rarely looks wrong at a glance.

The result is a new category of technical debt: code that everyone on the team approved in code review, that nobody fully understood when it shipped, and that breaks under conditions nobody tested manually. This is not a failure of AI or a failure of the developers. It is a gap in the process that automated testing fills.

2. The New Hire Mental Model

The most useful way to think about AI-generated code is as code from a talented new hire on their first week. The new hire is smart, writes clean code, and knows the language well. But they do not know your codebase, your business rules, your edge cases, or the history of why certain things are done a certain way.

You would never merge a new hire's first PR without running the test suite. You would never skip CI checks because the code "looks good." You would review their work more carefully than a senior team member's work, not less. The same discipline should apply to AI-generated code.

ScenarioSenior dev codeAI-generated code
Happy pathWorks, battle-tested patternsWorks, clean implementation
Edge casesHandled from experienceOften missing entirely
Error handlingSpecific, actionable messagesGeneric catch blocks, silent failures
Business logicKnows the constraintsImplements what was asked, not what was meant
Integration bugsRare, caught by experienceCommon, invisible in code review

This mental model changes your workflow in a concrete way. Instead of reviewing AI code and then running tests, you run tests first and only review the code if the tests pass. The automated test suite becomes your first reviewer. Your human code review becomes the second pass, focused on architecture and intent rather than correctness.

3. Why Code Review Alone Fails for AI Code

Code review has a well-documented attention bias: reviewers focus on what looks unusual or unfamiliar. AI-generated code exploits this bias unintentionally because it always looks familiar. The patterns are common, the naming is conventional, and the structure follows best practices. There is nothing for the reviewer's eye to snag on.

Consider a real scenario. An AI generates a checkout flow that validates payment details, processes the order, and sends a confirmation email. The code looks correct. It follows the same patterns as the existing checkout code. But the AI assumed that the payment processor always returns a status field, and in the edge case where the processor times out and returns an empty response, the code crashes silently instead of showing an error to the user. No reviewer catches this because the code reads correctly line by line. The bug is in what the code does not do, not in what it does.

Teams that have audited their PRs after adopting AI code generation consistently report the same finding: the bugs are not in the code quality, they are in the behavioral gaps. The function returns what the mock expects, but the user cannot actually complete the flow. This is precisely the class of bug that automated E2E testing catches and code review does not.

Catch what code review misses

Assrt auto-discovers test scenarios from your running app and generates real Playwright tests. No need to read the codebase to verify AI-generated changes actually work.

Get Started

4. Building an Automated Test Pipeline for AI Changes

The most effective pipeline for reviewing AI-generated code puts automated tests before human review. Here is how to structure it:

Step 1: Run existing tests immediately

Before looking at the diff, run your full test suite against the AI-generated changes. This catches regressions instantly. If the AI broke something that was working, you know within minutes instead of discovering it in production days later. Make this automatic: configure your CI to run on every commit, not just on PR creation.

Step 2: Run E2E tests against user flows

Unit tests verify individual functions. E2E tests verify that users can complete actual workflows. For AI-generated code, E2E tests are more valuable because they catch the integration-level bugs that the AI introduced without realizing it. A unit test will pass because the function returns the expected value. An E2E test will fail because the user cannot actually complete checkout, and that is closer to the real class of bugs AI introduces.

Step 3: Human review focused on intent

Once the tests pass, your code review can focus on higher-level concerns: does this change match the intended behavior? Is the architecture appropriate? Are there security implications? You are no longer wasting human attention on correctness checking that a machine does better and faster.

Step 4: Block merges on test failure

This is the critical discipline. No AI-generated change merges unless the full test suite passes. No exceptions for "it's just a small change." No overrides because "the tests are flaky." If the tests are flaky, fix the tests. The pipeline is only as strong as your commitment to not bypassing it.

5. What to Test and What to Skip

Not all tests provide equal value when reviewing AI-generated code. Focus your testing effort on the areas where AI code is most likely to fail:

High priorityWhy AI gets this wrong
User flow completionAI optimizes individual steps, not end-to-end journeys
Error recovery pathsAI focuses on happy paths, error handling is often placeholder
State across page navigationsAI generates code for single page context, not multi-step flows
Concurrent operationsAI writes sequential logic that breaks under parallelism
Boundary valuesAI tests with typical values, not edge cases (empty, max, null)

Lower priority for AI-specific review: pure utility functions, straightforward CRUD operations with no business logic, and styling changes. These are areas where AI code quality is typically fine and your existing tests (plus visual review) are sufficient.

6. Tools and Workflow for Automated AI Code Review

The practical workflow for treating AI code like new hire code requires three things: a test suite that covers user flows, a CI pipeline that blocks on failure, and a discovery tool that helps you find gaps in coverage.

Playwright for E2E test execution

Playwright is the standard for browser-based E2E testing. It supports multi-browser testing, has excellent async handling, and produces human-readable test files. The key advantage for AI code review is that Playwright tests operate at the user level: they click buttons, fill forms, and assert on what the user sees. This is exactly the level where AI-introduced bugs manifest.

Auto-discovery for test coverage gaps

One of the hardest parts of testing AI-generated code is knowing what to test when you did not write the code yourself. Auto-discovery tools help here by crawling your running application and finding every user flow that exists. Assrt, for example, takes a URL, discovers the flows users can navigate, and generates Playwright tests for each one. You runnpx @m13v/assrt discover https://your-app.comand get a test suite that covers every flow the tool can find, including flows you might not have known existed in the AI-generated codebase.

The review workflow in practice

This workflow catches more bugs than code review ever did because it tests behavior, not appearance. The AI code that looks perfect but breaks checkout? The automated test tries to check out and fails. The race condition that no reviewer would spot by reading the diff? The E2E test clicks the button twice and catches the duplicate. Building this habit is the single most effective response to the trust problem with AI-generated code.

Build Your AI Code Review Pipeline

Assrt auto-discovers test scenarios from your running app and generates real Playwright tests. Use it to build the automated safety net that catches what code review misses. Open-source, free, no vendor lock-in.

$npx @m13v/assrt discover https://your-app.com