Vibe Coding & Testing
Why Vibe-Coded Apps Break: The Case for Automated E2E Test Coverage
AI coding assistants let teams ship features in minutes. But speed without validation creates a growing pile of invisible regressions. Automated end-to-end testing is the guardrail most teams are still missing.
“An AI-powered security scan of a single vibe-coded quiz app uncovered 22 vulnerabilities, from exposed API keys to broken auth flows. Security is one gap, but regression testing is the other gap nobody talks about.”
r/AgentsOfAI
1. The Vibe Coding Speed Trap: Ship Fast, Break Silently
Vibe coding has changed how software gets built. Point an AI assistant at a problem, describe what you want in natural language, and watch working code materialize in minutes. Tools like Cursor, Claude Code, Bolt, and Lovable have made it possible for solo developers to build full-stack applications over a weekend. Teams that used to ship one feature per sprint now ship several per week.
The problem is not the speed itself. The problem is what gets skipped when everything moves this fast. When a developer hand-writes code over several days, they build a mental model of how each piece connects. They think about edge cases as they write conditional logic. They notice when a new feature might conflict with an existing one because they recently touched that code.
Vibe coding compresses this process so dramatically that the mental model never forms. The developer prompts, reviews the output, sees that it works for the immediate use case, and moves on to the next feature. Each individual change looks fine in isolation. But nobody is tracking how change number seven affects the behavior introduced in change number two. The result is a codebase where features silently break other features, and nobody notices until a user reports something weeks later.
A recent thread on r/AgentsOfAI illustrated this perfectly. Someone pointed an AI pentesting agent at a vibe-coded quiz application and found 22 distinct vulnerabilities. Exposed API keys, broken authentication flows, unvalidated inputs. The app worked fine on the surface. It passed every manual check the developer had performed. But underneath, it was full of problems that only systematic automated scanning could find.
2. Security vs. Regression: Two Different Failure Modes
That Reddit thread focused on security, and rightfully so. Exposed API keys and broken auth are urgent problems. But security vulnerabilities represent only half of the risk profile for vibe-coded applications. The other half is regression: features that used to work and no longer do.
Security and regression testing address fundamentally different failure modes. Security testing asks: “Can someone exploit this?” Regression testing asks: “Does this still work the way it did yesterday?” Both questions matter, but they require different tools and different strategies.
Security scans are well understood. Tools like OWASP ZAP, Snyk, and AI-powered pentesters can audit a codebase for known vulnerability patterns. These scans are valuable, and every team shipping vibe-coded apps should run them. But they will not tell you that the checkout flow broke after the last prompt session, or that the search results page now shows duplicates because an AI refactored a deduplication function without understanding why it existed.
Regression failures are sneakier than security holes. A security vulnerability can be detected by scanning code patterns. A regression can only be detected by knowing what the application is supposed to do and verifying that it still does it. This is precisely why end-to-end tests exist: they encode the expected behavior of real user workflows and check that behavior continuously. For vibe-coded apps, where changes happen fast and without deep understanding of the full system, E2E regression testing is not optional. It is the only reliable way to catch the breakage that speed creates.
3. Why Manual Testing Cannot Keep Up with AI-Speed Shipping
The traditional response to quality concerns is manual QA. Build the feature, hand it to a tester, wait for them to click through the flows, file bugs, iterate. This worked when the pace of development was measured in days or weeks per feature.
Vibe coding has broken this model. When a developer can generate and ship three features before lunch, no QA team can keep up by clicking through screens manually. The math simply does not work. A thorough manual test of a single feature might take an hour. A regression pass through the entire application might take a full day. If the application changes meaningfully several times per day, manual testing becomes a bottleneck that either slows down shipping or gets skipped entirely.
Most teams choose to skip it. They rely on a quick manual check of the specific feature they just built, confirm it works, and deploy. The problem is that regressions almost never appear in the feature you just changed. They appear in some other part of the application that you did not think to check because you did not realize the connection existed.
This is why automation is not just a nice-to-have for vibe-coded projects. It is a structural necessity. Only automated tests can run a full regression suite in minutes, every time code changes, without human bottlenecks. The investment in setting up automated E2E tests pays for itself the first time it catches a regression that would have reached production.
4. What Good E2E Coverage Looks Like for AI-Generated Apps
E2E test coverage for a vibe-coded application needs to focus on a few key areas. Not every page needs exhaustive testing. But the critical paths that users depend on need to be covered thoroughly enough that you will know immediately when something breaks.
Core user journeys. Every application has a handful of flows that represent the primary value to users: signing up, completing a purchase, submitting a form, viewing a dashboard. These flows should have end-to-end tests that run on every code change. If the signup flow breaks, you need to know within minutes, not when a user tweets about it.
State transitions. Vibe-coded apps are especially prone to state management bugs because AI assistants often generate state logic that works in isolation but conflicts with other state in the application. Tests should verify that moving between states (logged out to logged in, empty cart to checkout, free tier to paid tier) works correctly and that the UI reflects the correct state at each step.
Integration boundaries. Anywhere your application talks to an external service (payment processors, authentication providers, APIs) is a common failure point. AI-generated code often handles the happy path of these integrations but misses error handling, timeout behavior, or changes in the external API response format. E2E tests that exercise these boundaries catch issues that unit tests cannot.
Visual regression. When AI refactors a component, it can inadvertently change the layout, break responsive behavior, or introduce visual glitches that functional tests would not catch. Visual regression testing captures screenshots of key pages and compares them against baselines, alerting you to unintended visual changes. For consumer-facing applications, visual integrity is part of the product quality that automated testing should cover.
5. Tools and Approaches for Automated Test Generation
The good news is that the tooling for automated E2E testing has matured significantly. Teams building vibe-coded apps have several options depending on their stack, budget, and comfort level with test infrastructure.
Playwright (manual authoring). Playwright is the current standard for browser automation and E2E testing. It supports Chromium, Firefox, and WebKit, handles modern web features like shadow DOM and iframes, and has excellent auto-wait capabilities that reduce flakiness. Writing Playwright tests by hand gives you the most control, but it requires significant time investment, especially for large applications. For teams that already have Playwright expertise, manually authoring tests for critical paths is a solid approach.
Cypress. Cypress remains popular for its developer-friendly experience and real-time browser preview during test development. It is particularly well-suited for teams that want a gentler learning curve. The trade-off is that Cypress runs tests in a single browser engine and has some limitations with multi-tab and cross-origin scenarios that Playwright handles more naturally.
AI-powered test generation. A newer category of tools uses AI to generate E2E tests automatically. Assrt, for example, crawls your application, discovers testable scenarios, and generates standard Playwright test files that you can inspect, modify, and run in CI. The advantage of this approach is speed: instead of spending days writing tests manually, you can get baseline coverage in minutes. The generated tests are regular Playwright code, so there is no vendor lock-in. Other tools in this space include Octomind, Momentic, and QA Wolf (though some of these carry significant price tags compared to open-source alternatives).
Record-and-replay tools. Tools like Playwright Codegen and Chrome DevTools Recorder let you record browser interactions and export them as test scripts. This is a fast way to generate initial test coverage, though the resulting scripts often need cleanup and tend to rely on brittle selectors. They work well as a starting point that you refine over time.
The best approach for most teams is a combination. Use AI-powered generation (whether through Assrt, Copilot, or manual prompting) to get broad baseline coverage quickly. Then invest manual effort in hardening the tests for your most critical flows. Run everything in CI on every push. The goal is not perfect coverage from day one. The goal is catching the regressions that vibe coding creates before your users do. Even a small suite of automated E2E tests, covering your core signup, authentication, and primary feature flows, will catch more issues than any amount of manual spot-checking.