App Audit Guide

Auditing an AI-Generated App: Write Regression Tests as You Explore

There is a growing category of software that nobody fully understands: applications built primarily by AI, sometimes called vibe-coded apps. The developer directed the AI, reviewed the output, and shipped it. But no single person has a complete mental model of how it works. When these apps go to real users or get handed off to new developers, the first thing that needs to happen is an audit. The question is not just what is broken, but what is there in the first place. And the most valuable thing an auditor can do is leave behind a regression test suite, not just a findings document.

$0/mo

Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors.

Assrt vs competitors

1. Why Vibe-Coded Apps Need Audits

A traditional application built by a team of developers has a specific kind of understandability: the people who built it can explain how it works, what edge cases they handled, and where the tricky parts are. The knowledge may not be documented, but it exists in someone's head and can be extracted with time and conversation.

AI-generated applications have a different knowledge structure. The developer who directed the AI made high-level decisions: what features to build, which approaches to take, when to accept versus reject generated code. But the detailed implementation, including error handling, edge case behavior, and security assumptions, was written by a model following patterns from its training data. The developer reviewed it, but review is not the same as designing. The result is software that works for the cases it was pointed at but may have unexpected behavior in cases nobody explicitly considered.

This is not a failure of AI coding tools. It is a structural characteristic that audit processes need to account for. Auditing a vibe-coded app requires exploring the behavior of the system with fresh eyes, not just reviewing the code with the original developer.

The scenarios where audits become important include: hiring an external developer to take over maintenance of an AI-built app, preparing an app for acquisition due diligence, scaling an app from personal use to production with real user data, or debugging a category of failures where the root cause is unclear because nobody fully understands the system.

2. Common Failure Patterns in AI-Generated Code

Knowing what tends to go wrong in AI-generated applications helps focus the audit on the highest-risk areas rather than covering everything equally.

Incomplete error handling

AI models generate code for the happy path extremely well and generate error handling that looks plausible but often does not cover real failure modes. Forms that submit successfully but silently fail when the API is down. Async operations that resolve happily when they should reject. Error boundaries that catch exceptions but swallow the error message instead of surfacing it. These are all common and all invisible during normal testing.

Authentication and authorization gaps

AI-generated auth code frequently implements authentication correctly while missing authorization checks. The login flow works perfectly. But unauthenticated requests to protected endpoints, or requests from User A trying to access User B's data, may succeed when they should fail. This is especially common in API routes and server actions where the auth check was generated as a template and the specific resource ownership check was omitted.

Race conditions in async flows

AI-generated code tends to handle single-threaded happy paths cleanly but miss race conditions in concurrent operations. A form that can be submitted twice if the user clicks quickly. An optimistic UI update that conflicts with a server response arriving out of order. State that gets overwritten when two async operations complete in the wrong sequence.

Inconsistent data validation

Validation logic is often present on the frontend and absent on the backend, or present in one API route and missing in another that handles the same data. AI models implement validation where they see it in context but do not always propagate it consistently.

Unexpected behavior at scale

AI-generated code is tested during development with small data sets. Queries without indexes, pagination that loads all records, and client-side filtering over large arrays are all patterns that work fine in development and fail visibly in production. These are usually not bugs in a strict sense but are design choices made implicitly when nobody was thinking about scale.

3. The Audit-as-You-Test Approach

Traditional software audits produce findings documents: lists of issues, risk ratings, and recommendations. These are valuable, but they leave the burden of fixing and verifying entirely on the original developer. A better approach for AI-generated applications is audit-as-you-test: writing automated regression tests as you explore the application, so that the output of the audit is both a findings document and a running test suite.

The logic is straightforward. When you explore a new application and discover that a particular flow works correctly, you have two choices: write it down in a document, or encode it as a test. A test is better because it continues to verify the behavior after every future change. If a later developer modifies the code and accidentally breaks the flow you audited, the test fails and they know immediately. A findings document does not do that.

When you discover a bug, the same principle applies. Fix the bug and write a test that verifies the fix. The test becomes permanent documentation of the specific failure mode and permanent protection against regression.

This approach requires more skill than a traditional audit because the auditor needs to write test code, not just explore and document. But the output is dramatically more valuable. A codebase that emerges from an audit with a suite of 30 regression tests covering the discovered behavior is in a much better position than one with a 20-page findings document that nobody reads after the first week.

Turn your audit into a regression safety net

Assrt generates real Playwright tests from plain English descriptions of user flows. Write tests as fast as you can describe behavior. Free and open-source.

Get Started

4. Writing Regression Tests During Exploration

The practical challenge of audit-as-you-test is pace. If writing a test takes 30 minutes, and you discover 40 behaviors worth testing during an audit, you have spent 20 hours just on test authoring. That is not realistic for most audit engagements.

Two things change this math. First, test authoring gets faster with practice and with good tooling. A well-structured Playwright test for a basic user flow takes 10-15 minutes once you are familiar with the patterns. Second, AI-assisted test generation tools can turn a plain English description of a flow into a working test in under a minute, shifting the authoring time to review and refinement.

What to test first

During an audit, prioritize test coverage in this order: authentication and authorization flows (highest security risk), core value delivery flows (what the app actually does), any flow involving money or sensitive data, and then supporting flows like settings, notifications, and account management. This ordering ensures that if you run out of time, the highest-risk areas are covered.

Testing negative cases

AI-generated apps often have strong happy-path behavior and weak negative-case behavior. Your regression tests should explicitly cover what happens when things go wrong: invalid inputs, missing required fields, expired sessions, unauthorized access attempts, and network failures. These are the cases most likely to reveal the incomplete error handling that is common in AI-generated code.

Documenting in tests, not docs

Tests serve as living documentation when they have clear names and descriptions. Name your tests in plain language that describes the scenario: "user with expired subscription sees upgrade prompt instead of dashboard," "admin can delete any user account but regular users cannot delete other accounts." A test suite with descriptive names is more useful documentation than a prose findings document because it remains accurate after the code changes.

5. Choosing Audit Tools

The tools you choose for an audit affect both the quality of the audit and the portability of the regression tests you leave behind.

Browser automation framework

Playwright is the current standard for E2E test automation in web applications. It has strong TypeScript support, built-in parallelization, a trace viewer for debugging failures, and good documentation. Cypress is a viable alternative with a different architecture and a larger existing community around component testing. For a new audit, Playwright is the safer default because of its broader browser support and more active development.

AI-assisted test generation

For audit work where you need to write many tests quickly, AI generation tools are worth evaluating. Assrt generates real Playwright code from plain English descriptions of flows. You describe the behavior you observed, it generates the test, and you review and refine. This significantly accelerates the pace of audit-as-you-test. Other tools like QA Wolf and Mabl offer similar generation capabilities with different tradeoffs around pricing, output format, and customization.

The critical thing to check when choosing an AI test generation tool for audit work is whether it generates portable code you can hand off. If the tests only run inside the vendor's platform, you are not leaving the client with an independent regression suite. You are leaving them with a subscription dependency. Tools that generate standard Playwright or Cypress files that can be committed to the repo and run anywhere are preferable for audit deliverables.

Security scanning

For audits that include security review, automated scanners like OWASP ZAP can supplement manual testing by checking for common vulnerabilities across all endpoints systematically. These are not a substitute for manual authorization testing but provide useful coverage for common patterns like SQL injection, XSS, and misconfigured headers.

6. Building a Regression Safety Net

The regression tests written during an audit are only valuable if they continue to run. A test suite that lives on a contractor's laptop and never gets integrated into the project is not a safety net. Here is how to make the audit output durable.

Structure the test suite for maintainability

Organize tests by feature area or user role, not by audit date. Use page object models or fixture functions to encapsulate selectors so that UI changes only require updates in one place. Write tests that are independent of each other and can run in any order. A test suite that is well-structured will outlast the audit engagement; one that is written as a one-off script often does not.

Integrate with CI before handing off

Part of the audit deliverable should be a working CI configuration that runs the regression tests on every pull request. If the client does not have CI set up, setting it up is a valuable part of the engagement. A GitHub Actions workflow that runs Playwright tests on every PR takes less than an hour to configure and has enormous ongoing value.

Document which tests cover which risks

Include a brief mapping of tests to the risks they cover in the audit report. This helps future developers understand why specific tests exist and evaluate the risk of removing them. It also helps the client understand what is and is not covered by the test suite so they can prioritize expanding coverage over time.

7. What to Include in an Audit Report

The audit report is the primary artifact the client takes away from a traditional audit. In the audit-as-you-test model, it supplements the test suite rather than replacing it.

Architecture overview

Document what you discovered about how the application is structured: what framework it uses, how authentication is implemented, how data is stored, and how the major components interact. This is often the most valuable section for a client who has an AI-generated app they do not fully understand. Make this a working document the team can reference, not a formal deliverable that goes into a drawer.

Findings with severity ratings

List each finding with a brief description, the severity (critical, high, medium, low), and a recommended fix. Critical and high findings should include enough detail for a developer to reproduce the issue without the auditor present. For each finding that has a corresponding regression test, include a reference to the test.

Test suite summary

Include a summary of the regression test suite: how many tests, what flows they cover, how to run them, and how they are integrated into CI. Note any significant gaps in coverage so the client knows where to invest in additional test authoring.

Maintenance recommendations

Provide specific, actionable recommendations for maintaining quality going forward. This should include a policy for writing tests when bugs are found, a recommended schedule for expanding test coverage, and guidance on which areas of the codebase require the most careful review when making changes.

The goal of a good audit is not to produce a document that gets filed and forgotten. It is to leave the application in a better state: better understood, better tested, and with a foundation that makes future development safer. For AI-generated apps, the regression test suite is the most tangible part of that foundation.

Build Your Audit Test Suite

Assrt generates real Playwright tests from plain English descriptions of user flows. Write tests as you explore, hand off a working regression suite with the audit report.

$npx assrt plan && npx assrt test