AI Verification Engineering
How to Verify AI-Generated Code Actually Works Before It Reaches Your Users
The question keeps coming up in developer communities: are we losing our ability to actually understand the code we ship? The real answer is not about coding skills atrophying. It is about verification. AI writes plausible code. Plausible is not the same as correct.
“68% of developers report merging AI-generated code they could not fully reason through, citing time pressure and the code looking syntactically correct on review.”
Developer Productivity and AI Survey, 2025
1. The Plausibility Problem: Why AI Code Looks Right But Breaks
Large language models are trained to produce text that looks correct to the human reader. In code, that means proper indentation, sensible variable names, recognizable patterns, and no obvious syntax errors. The output has the surface characteristics of correct code regardless of whether the underlying logic is actually correct.
This is a fundamental property of how these models work. They are optimizing for the next token being plausible given everything that came before. Plausibility and correctness overlap heavily in most cases, which is why AI-generated code often works fine. But they diverge exactly in the cases that are hardest to test manually: edge cases, race conditions, off-by-one errors in boundary conditions, incorrect assumptions about API response formats, and subtle logic inversions that pass smoke tests but fail under specific inputs.
A concrete example: an AI-generated authentication check that readsif (!user.role === "admin")instead ofif (user.role !== "admin")looks reasonable at a glance, passes TypeScript compilation, and even passes a quick manual test if you happen to test with an admin user. The bug only surfaces when a non-admin reaches the protected route and is incorrectly admitted.
The thread title asks whether we are losing coding ability because of AI. A more useful framing: we are losing the natural checkpoints that came from writing code slowly. When you type every character yourself, you reason through the logic in real time. When AI generates a hundred lines in seconds, that reasoning does not happen automatically. It has to be deliberately reintroduced through systematic verification.
2. Why Code Review Fails for AI-Generated Output
Code review was designed for a world where code was written character by character by a human who understood every decision. The reviewer's job was to catch mistakes that the author missed because they were too close to the problem. This model assumes the code reflects deliberate, reasoned choices.
AI-generated code undermines the premise. The code is syntactically correct and stylistically clean, which means the reviewer's pattern-matching brain sees what looks like well-written code and approves it faster than it would approve hand-typed code. There is research on this: humans are significantly worse at spotting bugs in code that looks polished than in code that looks rough. AI code always looks polished.
There is also a volume problem. If your team's code output increases fivefold because of AI tools, but the number of reviewers stays constant, each reviewer now handles five times the surface area. Reviewers adapt by spending less time per line. The bugs that require careful reading are exactly the ones that get through.
Code review remains valuable for design decisions, architectural consistency, and catching obviously wrong logic. It is not a substitute for automated testing when verifying that AI-generated code does what it is supposed to do in all the cases that matter.
3. The New Hire Mental Model: Untrusted Until Tested
The most practical mental model for working with AI-generated code is to treat it exactly as you would treat code from a talented but new team member. This is not an insult to the AI. It is a realistic framing of the trust relationship.
When a new hire submits their first pull request, you do not rubber stamp it because their resume was impressive. You review it carefully. You ask questions about their reasoning. You make sure the tests exist and actually pass. You verify that the edge cases are handled. If the tests are missing, you ask for them before merging. The new hire may write excellent code, but trust is earned through a track record, not assumed from competence signals.
AI tools never accumulate a track record in your codebase. Every generation is effectively a first pull request. The model does not know your application's invariants, your production environment's quirks, or the specific ways your users interact with the system. It makes reasonable guesses based on patterns from training data. Those guesses need verification the same way a new hire's guesses do.
In practice this means establishing a policy: no AI-generated code merges without automated test coverage for the paths it introduces. Not coverage in the abstract sense of a percentage number, but actual tests that exercise the feature in a way that would catch the most likely failure modes. The policy does not need to be draconian. It needs to be consistent.
Teams that have adopted this model report an interesting side effect. It forces them to articulate what "working correctly" means for every feature before or at the time of implementation, rather than after a production incident. That discipline improves the quality of the AI prompts, the quality of the review conversations, and the quality of the resulting code.
4. Which Tests Catch AI-Specific Bugs (E2E vs Unit)
Not all tests are equally effective at catching the bugs that AI code tends to introduce. Understanding which test types address which failure modes helps you allocate testing effort where it actually matters.
Unit tests verify isolated functions in isolation. They are fast, stable, and excellent for pure logic. But AI-generated code rarely fails in pure logic. The function itself usually does what the docstring says. The failure is in how that function interacts with the rest of the system. A unit test for a payment calculation function will not catch that the function is never called because an AI-generated event handler has the wrong event name.
Integration tests verify that components interact correctly at the API level. They catch more AI-specific bugs than unit tests, particularly around incorrect assumptions about API contracts and data formats. They are slower and require more setup, but they test the integration layer where LLM-generated code most often has subtle errors.
End-to-end testsin real browsers are the most effective verification layer for AI-generated frontend and full-stack code. They exercise the actual user paths through the running application, including DOM rendering, network requests, state management, and UI interactions. This is the layer where AI code that "looks correct in isolation" most often falls apart. A component that renders correctly in Storybook can still fail when it depends on context that is only present in the full application.
The hierarchy for AI-generated code verification is roughly: E2E tests for user-facing flows, integration tests for API boundaries, and unit tests for complex pure logic. The typical testing pyramid (many units, some integration, few E2E) inverts for AI-generated code because the failure modes concentrate at the integration and interaction layer, not in individual functions.
Property-based testing is worth adding for any AI-generated code that handles input validation, data transformation, or numerical computation. Tools like fast-check (JavaScript) or Hypothesis (Python) generate hundreds of randomized inputs automatically, finding the edge cases that neither you nor the AI thought to test explicitly. This directly addresses the class of AI bugs that only surface under specific inputs.
Automate verification for every AI code change
Assrt generates real Playwright tests from plain English. Catch the bugs that look correct on review before they reach production.
Get Started →5. Setting Up Automated Verification in Your Workflow
The goal is to make verification automatic rather than optional. If testing requires a developer to remember to run it, it will be skipped under deadline pressure. The verification system needs to run whether or not anyone thinks about it.
PR-level verification gates. Every pull request containing AI-generated code should trigger an automated test run before merge is allowed. Configure your CI system (GitHub Actions, GitLab CI, CircleCI, or similar) to run your E2E test suite against a preview deployment of the branch. The PR cannot merge if tests fail. This is not optional and it is not a suggestion in the PR template. It is a hard gate.
Pre-merge smoke tests against staging. After a PR merges to main but before it deploys to production, run a targeted smoke test against your staging environment. This catches the class of issues that only appear when the full application is assembled: environment variable differences, database schema mismatches, and service dependency conflicts. Keep this test suite fast (under five minutes) so it does not create a deployment bottleneck.
Post-deploy verification runs. After every production deployment, run a verification suite against your live environment. This catches the subset of issues that only appear in production conditions: CDN caching behavior, third-party script loading, SSL certificate handling, and real API rate limits. If the post-deploy verification fails, trigger an immediate rollback rather than waiting for user reports.
Keeping tests current with the codebase. The biggest practical challenge is test maintenance. As AI tools accelerate code changes, the surface area that tests need to cover grows rapidly. Tests that target specific CSS selectors or DOM structure break whenever the UI changes. Write tests against stable attributes (data-testid, aria-label, semantic HTML roles) rather than structural selectors. Some teams use AI-assisted test generation to keep tests in sync with the codebase, generating new tests for new features in the same pass that generates the feature code.
The workflow described here is not novel. It is standard CI/CD practice applied consistently to AI-generated code. The difference from typical team practice is the word "consistently." Most teams have these pipelines configured but allow exceptions. With AI-generated code, exceptions compound quickly because the volume of changes is higher than manual development produced.
6. Comparison: Manual Review, Unit, Integration, and E2E Testing
Each testing approach has a different cost, speed, and coverage profile. For AI-generated code specifically, some approaches are significantly more valuable than others. Here is how they compare across the dimensions that matter most.
| Approach | Speed | AI Bug Coverage | Maintenance Cost | Scales with AI Velocity? |
|---|---|---|---|---|
| Manual Code Review | Slow | Low | High (human time) | No |
| Unit Tests | Fast | Low to Medium | Low | Partially |
| Integration Tests | Medium | Medium | Medium | Partially |
| E2E Tests (real browser) | Slower | High | Medium (stable selectors) | Yes, with AI generation |
| Property-based Tests | Medium | High for logic bugs | Low | Partially |
The table shows why E2E testing is the highest-leverage investment for teams using AI coding tools heavily. It has the highest coverage of the failure modes that AI code tends to introduce, and with AI-assisted test generation, the maintenance cost no longer scales linearly with the test surface area.
Manual code review is not eliminated by this framework. It still catches architectural problems, security issues, and design decisions that automated tests cannot evaluate. But treating it as a verification mechanism for correctness is expecting it to do a job it was not designed to do, especially at the velocity that AI-assisted development creates.
A practical starting configuration for most teams: require E2E tests for every user-facing flow that involves more than a single page view. Require unit tests for any pure logic with more than two conditional branches. Use property-based tests for any AI-generated input validation or data transformation. Everything else gets covered by integration tests as part of the normal PR process.
Several tools exist to help with each layer. For E2E testing, Playwright and Cypress are the dominant frameworks, with Playwright becoming the more common choice for new projects due to its auto-wait behavior and multi-browser support. Tools like Assrt can generate Playwright tests automatically from plain English descriptions or by crawling your application to discover user flows. For property-based testing, fast-check is the most actively maintained option for JavaScript and TypeScript. The specific tools matter less than the discipline of running them automatically on every significant code change.