Auditing AI-Generated Code: Why E2E Tests Are the Best Documentation You Can Write
Someone built an app with AI. Maybe they want to know if it actually works. Maybe they are handing it to a dev team. Maybe they are charging money for it and a user just reported a bug they cannot reproduce. You have been hired to audit it. The codebase was generated across dozens of AI prompts, organized by whatever structure the AI chose, and the original developer may not fully understand what any of it does. Static code review will take forever and miss the most important questions. Here is a better approach.
“Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.”
Assrt vs traditional QA platforms
1. Framing the Audit: Behavior Verification Over Code Comprehension
Reading AI-generated code line by line is often a trap. The code may be internally consistent within each file but architecturally inconsistent across files, because each file was generated in a separate prompt context. State management patterns might differ between modules. Error handling might be thorough in one service and completely absent in another, depending on whether the developer happened to ask for it.
Spending days trying to comprehend this kind of codebase often produces a detailed map of how it is organized without answering the more important question: does it actually do what its creator thinks it does, across all the scenarios that matter?
A more effective approach starts with behavior. Define what the app is supposed to do, write automated tests that verify each behavior, and run them. The tests will immediately surface the most important gaps. Broken flows are visible in minutes instead of days of code reading. You end up with documentation that is both accurate (because it runs against the actual code) and durable (because it keeps working after you leave).
This is not an argument against code review. Security vulnerabilities, dependency issues, and architectural problems still require reading the code. But for an AI-generated codebase, behavior verification via E2E tests should come first. It shows you where to focus your code review rather than forcing you to review everything equally.
2. What AI-Generated Code Typically Gets Wrong
Certain failure patterns show up consistently in AI-generated codebases. Knowing them before you start the audit tells you where to look first.
Error handling is an afterthought
AI models optimize for the happy path. They generate code that works for the expected case. Error states, empty states, timeout scenarios, and invalid input are frequently missing or superficial. The app looks great in a demo and breaks immediately when anything goes slightly wrong.
Silent failures at the data layer
LLM-generated code sometimes swallows errors silently. A form submission returns a success response even when the backend write failed. A file sync skips entries without logging which ones were skipped. A search returns an empty result when the underlying query threw an exception. These failures are invisible during manual testing and extremely damaging in production.
Inconsistent state across prompt boundaries
Each AI generation session is internally coherent, but the overall application may have multiple conflicting approaches to the same problem. Two features might each implement their own session token handling. Three components might each maintain their own copy of user preferences. These inconsistencies create race conditions and stale state bugs that only appear under specific sequences of user actions.
Security defaults that are permissive
Input validation, CORS configuration, and authentication checks require explicit prompting to appear in AI-generated code. If the developer did not ask for them, they are likely missing. Sensitive data in local storage, hardcoded API keys in frontend code, and missing rate limiting are common findings.
Phantom features
The UI often includes buttons, settings, and navigation items whose underlying logic is incomplete or entirely stubbed. The developer may not know these features do not work because the AI generated plausible-looking placeholder code and the UI did not throw an obvious error.
E2E tests surface phantom features and silent failures immediately
Assrt generates real Playwright test code from plain English descriptions of user flows. Audit faster, document accurately, and hand off a regression net. Free and open-source.
Get Started →3. Using E2E Tests to Discover What the App Actually Does
When you start writing E2E tests for an unknown codebase, something useful happens: the tests fail in ways that tell you exactly what is broken and what never worked. This is more efficient than manual exploration because it forces precision. You cannot write a test for "login works" without specifying what "works" means, and that precision surfaces ambiguities immediately.
Start by listing every user flow the app is supposed to support. For a media player application like an audiobook client, this might include: account login and registration, library browsing and search, playback with chapter navigation, progress syncing, bookmark creation, offline download, and settings changes. Write smoke tests for each flow before going deeper.
What smoke test failures reveal
A smoke test failure is always informative. If the login test fails because the form does not submit, you have found a silent error handler swallowing a network error. If the playback test fails because the play button does not change state, you have found a phantom feature. If the progress sync test fails because the API returns 200 but the database write never happens, you have found silent data loss. These are the most important bugs to document.
Using test failures as a code review guide
Once you have a set of failing smoke tests, you know exactly which parts of the codebase to read. The stack traces and network logs from Playwright show you which functions and API calls are involved in each failure. This transforms code review from a "read everything" exercise into a targeted investigation of the specific code paths that are broken.
4. Building a Regression Net During the Audit
The most valuable artifact you can leave behind after an audit is not a report. It is a running test suite. A good regression net means the next developer who touches the codebase, whether human or AI, will know within minutes if they have broken something that previously worked.
Layer 1: Smoke tests for all critical flows
Cover every flow that matters with a minimal test that verifies the flow completes at all. Smoke tests are fast to write and fast to run. They form the foundation of the regression net and should be the first thing a new developer runs when setting up the project.
Layer 2: Negative tests for the failure modes you found
Every bug you find during the audit should have a corresponding test that reproduces it. Write the test, confirm it fails against the current code, fix the bug, and confirm the test now passes. This ensures the bug cannot silently reappear in a future change. It also creates a useful record of every known failure mode.
Layer 3: Data integrity tests for persistence operations
AI-generated code is especially prone to silent failures at the persistence layer. Write tests that verify data actually saves and loads correctly: create a record, reload the page, confirm the record is still there. Update a record, reload, confirm the update persisted. These tests are simple to write and catch the most damaging class of bugs in AI-generated codebases.
Wiring the suite to CI
A regression net is only useful if it runs automatically. Set up a GitHub Actions workflow (or equivalent) that runs the full test suite on every push to the main branch. The setup takes under an hour for a Playwright-based suite and ensures the regression net actually catches regressions instead of sitting unused in a scripts folder.
5. Tools and Tradeoffs for Audit-Phase Testing
Audit engagements have time pressure. You need to build useful test coverage quickly. The tool tradeoffs are different than for a long-running project.
| Tool | Audit fit | Tradeoffs |
|---|---|---|
| Playwright (manual) | Excellent for custom flows | Slower to write, most flexible output |
| Assrt (AI-assisted) | Fast coverage, real Playwright output | Generated tests need review; free and open-source |
| QA Wolf | Managed service, fast setup | $7,500/mo, proprietary elements in output |
| Momentic | No-code creation, Chrome-only | Proprietary YAML format, not portable |
| Cypress | Good DX, real-time debugging | Chromium-only in most setups, slower CI |
The portability rule
For an audit engagement, portability matters above almost everything else. You are building a test suite that will be handed off to the client or the next dev team. If your tests only run inside a third-party platform, the handoff is incomplete. Tools that generate standard Playwright or Selenium code ensure the tests are fully portable: they can be run locally, in GitHub Actions, in any CI system, and without any ongoing paid subscription.
For most audit engagements, the right combination is Playwright for manual test authoring where you need full control, and an AI-assisted tool like Assrt for generating initial coverage across a large number of flows quickly. The AI-generated tests should always be reviewed before being included in the handoff suite. They are a starting point, not a finished product.
6. Documenting Findings and Handing Off the Test Suite
An audit produces two categories of output: a findings report and a test suite. Both matter, but the test suite is the more durable artifact. A report gets read once and filed. A test suite runs on every subsequent change and protects the codebase for as long as it is maintained.
Structuring the findings report
Organize findings by severity and type. Critical findings are flows that are broken or missing entirely. High findings are behaviors that work but are fragile or likely to break under realistic conditions. Medium findings are code quality issues, inconsistencies, or missing non-critical features. Low findings are style issues, minor inefficiencies, or documentation gaps.
For each finding, link to the specific E2E test that demonstrates it. This makes the report concrete and actionable: the client can run the test themselves to see the failure, and the next developer can run it after fixing the issue to confirm it is resolved.
The test suite as living documentation
The test suite you build during the audit is, in effect, the most accurate documentation of what the app does. It is more accurate than any written spec because it runs against the actual code and reflects what actually happens, not what someone thinks should happen. Treat the test file names and test descriptions as user-readable documentation: name tests after the user behavior they verify, not after the implementation details they test.
Handoff checklist for the test suite
- All tests run in headless mode without manual intervention
- Tests use dedicated test accounts and clean up after themselves
- A README explains how to run the suite locally and in CI
- A GitHub Actions workflow file is included and tested
- Tests that cover known bugs are clearly marked with the bug description
- No proprietary formats or external platform dependencies
7. Complete AI-Generated Code Audit Checklist
Use this checklist as a starting point for any AI-generated codebase audit. Not every item applies to every project, but the majority do.
Functional verification (E2E tests)
- All advertised features work as described in a real browser
- Error states are handled gracefully and shown to the user
- Form validation exists on client and server sides
- Multi-step workflows complete without data loss
- Data created in one session persists correctly in subsequent sessions
- The app handles network failures without silent data loss
Security review
- No API keys, secrets, or passwords in frontend code or version control
- Authentication and authorization checks on all sensitive endpoints
- Input is validated and sanitized server-side
- Dependency vulnerabilities scanned with
npm auditor equivalent - Sensitive data not stored in local storage or exposed in URLs
- CORS configured appropriately, not wildcard
Code quality
- No obvious dead code or completely unused dependencies
- Consistent patterns within each major feature area
- Environment variables used for all configuration values
- Logging exists for critical operations and all error paths
- No duplicate implementations of the same functionality across modules
Infrastructure and deployment
- Build process is documented and reproducible by a new developer
- CI/CD pipeline exists or can be set up with the handoff
- Database migrations are versioned and tested
- Backup and recovery processes exist for user data
Test suite handoff
- Smoke tests cover all critical user flows
- Negative tests exist for all known failure modes
- Data persistence tests cover all create and update operations
- Tests run in CI on every push to main
- Test output is readable and actionable without debugging tools
- The suite uses standard formats (Playwright, not proprietary) for full portability
Build a Regression Net While You Audit
Assrt generates real Playwright tests from plain English descriptions of user flows. Document what the app actually does while building the safety net that protects it after you leave.