Quality Engineering

Testing Vibe-Coded Apps: From "Does It Render" to Production Ready

AI-generated code ships fast but breaks in ways that traditional development rarely does. The gap between "it works on my screen" and "it handles failure" is wider than ever. Here is how to close it.

40%

40% of AI-generated code contains at least one correctness issue that only surfaces under non-happy-path conditions.

GitClear Code Quality Report, 2025

1. What "Vibe Coding" Actually Produces

The term "vibe coding" emerged in 2024 to describe the practice of generating entire applications through AI prompts. You describe what you want in natural language, an LLM generates the code, and you iterate by describing changes rather than writing code yourself. Tools like Cursor, Bolt, Lovable, and v0 have made this workflow accessible to millions of people.

The code that comes out of these sessions has a recognizable pattern. It tends to be verbose (AI models err on the side of explicit over concise). It handles the happy path well. It often uses popular libraries correctly for the main use case. But it routinely misses edge cases, error handling, accessibility, and performance considerations.

This is not a criticism of the tools. LLMs generate code that matches the most common patterns in their training data, and the most common patterns in training data are happy-path tutorials and documentation examples. If you prompt an LLM to "build a login form," you will get a form that works when you type a valid email and password. You probably will not get proper handling for network timeouts, rate limiting, concurrent sessions, or password manager autofill edge cases.

2. The Render vs. Resilience Gap

The fundamental quality gap in vibe-coded applications is the distance between "does it render" and "does it handle failure." A component that renders correctly in Storybook or in a dev server can still break catastrophically in production when an API returns an unexpected shape, when the user navigates away mid-request, or when a third-party script loads slowly.

Traditional development workflows catch some of these issues through experience. A senior developer writing a fetch call will instinctively add error handling, loading states, and cancellation logic because they have been burned before. AI-generated code does not have this scar tissue. It produces what was asked for, not what was needed.

The practical consequence is that vibe-coded applications need more testing, not less. And the testing needs to focus specifically on the areas that AI generation tends to miss: error states, loading states, concurrent operations, browser compatibility, and accessibility. This is the opposite of what most teams assume. They think AI-generated code is "good enough" and spend less time testing it.

End-to-end testing is particularly important here because it catches integration issues that unit tests miss. A component might handle errors correctly in isolation but fail when the error propagates through three layers of parent components. Only a real browser test that exercises the full stack will catch this.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. Error State Testing for AI-Generated Code

The most effective testing strategy for vibe-coded apps is systematic error state injection. For every user flow in your application, you should have tests that verify what happens when things go wrong, not just when they go right.

Start with network failures. Every API call in your application should be tested with at least three failure modes: complete network failure (the request never completes), server error (500 response), and malformed response (200 status but unexpected JSON shape). Playwright makes this straightforward with its route interception API. You can intercept any request and return a custom response or abort the connection entirely.

Next, test timing issues. What happens when a user double-clicks a submit button? What happens when they navigate away during a form submission? What happens when a request takes 30 seconds? These are the scenarios that vibe coding almost never handles, and they are the scenarios that cause the most user frustration in production.

Tools like Assrt can help here by automatically discovering user flows through crawling and then generating Playwright tests that cover the happy path. From there, you can extend the generated tests with error injection to cover the failure modes. This hybrid approach (AI discovers the flows, humans define the failure scenarios) works better than either fully manual or fully automated testing.

4. Requirements-First Development, Even with AI

One of the most counterintuitive lessons from vibe coding is that requirements matter more when AI writes the code, not less. When a human developer writes code, they unconsciously fill in gaps in the requirements based on experience and domain knowledge. When an AI writes code, it takes the prompt literally. If you did not specify error handling, you will not get error handling.

This means that the best way to improve vibe-coded application quality is to improve the prompts, and the best way to improve the prompts is to write explicit requirements before generating code. Requirements do not need to be formal documents. Even a bullet list of "this feature must handle these failure modes" dramatically improves the output.

Behavior-driven development (BDD) fits naturally here. If you write Gherkin-style scenarios before generating code, you have both a specification for the AI and a test plan for verification. Several tools support generating Playwright tests from Gherkin scenarios, including Cucumber.js and Playwright's own test runner with custom fixtures.

The workflow becomes: write requirements as testable scenarios, generate code from those requirements, run the scenarios as tests, fix failures, repeat. This is essentially test-driven development with AI as the code generator instead of a human developer.

5. The Irony: AI Can Also Generate the Tests

There is a satisfying irony in the vibe coding quality problem: the same AI models that generate code with missing error handling can also generate the tests that catch those missing error handlers. The key is to use different prompting strategies for code generation and test generation.

When generating application code, LLMs optimize for functionality (make it work). When generating tests, you can prompt LLMs to optimize for adversarial coverage (find ways it breaks). These are different modes that produce different outputs. An LLM prompted to "write tests that try to break this login form" will reliably suggest testing empty fields, SQL injection, extremely long inputs, special characters, and concurrent submissions.

Automated test discovery tools take this further. Instead of manually prompting for test ideas, tools like Assrt crawl your running application, identify interactive elements and user flows, and generate Playwright test code automatically. This catches the class of bugs where a feature exists but is not tested at all, which is the most common failure mode in vibe-coded applications.

The broader pattern here is using AI to check AI. Code review tools powered by LLMs (CodeRabbit, Ellipsis, Graphite's reviewer) can catch issues that the original code generation missed. Static analysis tools flag patterns that are technically valid but practically risky. End-to-end test generators verify behavior in a real browser. No single tool catches everything, but the combination creates a safety net that makes vibe-coded applications viable for production use.

The bottom line: vibe coding is not going away. The developers and teams that succeed with it will be those who invest in testing infrastructure proportional to the speed of their code generation. Faster code generation means more tests needed, not fewer.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk