Verification Engineering
Still Building Faster Horses: Why AI Needs to Verify, Not Just Generate
AI coding tools let us ship features in minutes. But generating code faster without generating the verification layer is just building faster horses. The real paradigm shift is automated proof that the code works.
“72% of developers using AI coding assistants report shipping code without adequate test coverage, citing speed pressure and the assumption that AI-generated code is correct by default.”
State of AI Development Survey, 2025
1. The Faster Horses Problem in AI Development Tools
Henry Ford supposedly said that if he had asked people what they wanted, they would have said “a faster horse.” The quote is almost certainly apocryphal, but the insight endures. Real innovation does not come from doing the same thing faster. It comes from changing what you do entirely.
AI coding tools, as they exist in 2026, are faster horses. Cursor, Claude Code, GitHub Copilot, Windsurf, and dozens of other assistants have made it possible to generate features at extraordinary speed. A developer can describe a feature in natural language and have working code in minutes. Sprint velocity has gone through the roof. The horse is faster than ever.
But the fundamental process has not changed. A human describes what they want. A system (now AI, previously the developer's own hands) writes the code. The code gets deployed. Users encounter bugs. The team scrambles to fix them. The cycle repeats, just at a higher tempo. The problems have not been solved; they have been accelerated.
The question from the r/vibecoding thread captures this perfectly: what if we are still just building faster horses? The answer, for most teams, is yes. And the reason is that the entire industry has focused AI on the generation side of software development while largely ignoring the verification side. We have turbocharged the ability to write code. We have barely touched the ability to prove that code is correct.
2. Why AI-Generated Code Ships Without Proper Validation
There is a psychological effect that happens when you watch an AI produce clean, well-structured code in seconds. You trust it. The code looks right. It uses sensible variable names, follows common patterns, and compiles without errors. The natural human impulse is to ship it and move on to the next feature.
This implicit trust is reinforced by the speed of the process. When writing code manually took hours, there was a natural pause for reflection, review, and testing. That pause was partly born of exhaustion and partly from the sense that significant effort deserved significant validation. When generating code takes thirty seconds, that psychological pause disappears. The effort feels trivial, so the validation feels optional.
Team dynamics amplify this. Product managers see faster delivery and push for more features. Engineering managers see higher velocity metrics and reward output volume. Nobody is explicitly saying “skip the tests,” but the incentive structure makes testing the first thing that gets squeezed when deadlines tighten. And with AI making code generation so fast, deadlines tighten constantly because stakeholders recalibrate their expectations upward.
The result is a codebase that grows rapidly with thin or nonexistent test coverage. Each untested feature is a liability. Each deployment is a gamble. The team is shipping faster than ever, but confidence in what they ship has actually declined. They have optimized for the wrong metric.
3. The Verification Gap: Code Generation vs. Test Generation
Here is the core asymmetry. AI is remarkably good at generating code from descriptions because code is the training data it has seen the most of. Stack Overflow answers, GitHub repositories, documentation examples, and tutorials are overwhelmingly about writing implementation code. Tests are underrepresented in training data, and the tests that do exist tend to be simple unit tests that verify happy paths.
When you ask an AI to write tests, it does something subtly wrong. It reads the implementation and generates tests that confirm the implementation does what it does. This is circular. If the implementation has a bug, the test will verify the buggy behavior as correct. The tests become a mirror of the code rather than an independent specification of correct behavior.
There is also a structural problem with how tests are generated. AI tools work at the file or function level. They see one component at a time. But the bugs that cause production incidents are rarely contained within a single component. They emerge from interactions between components, from race conditions between services, from state that leaks across boundaries. Generating meaningful end-to-end tests requires understanding the whole application, not just the function you are currently editing.
This is the verification gap. We have AI that can generate any piece of code you describe. We do not have AI that can independently determine what “correct” means for a given application and then prove that the code meets that standard. Closing this gap is the actual paradigm shift, not generating code 10x faster.
4. How Automated Test Discovery Changes the Equation
The breakthrough happens when AI stops waiting for a developer to ask for tests and starts discovering what needs to be tested on its own. This is the difference between a code completion tool and a verification system. Code completion responds to prompts. A verification system proactively identifies what could go wrong.
Automated test discovery works by crawling a live application, mapping its routes, interactions, and state transitions, and then generating tests that cover the real user flows. This approach sidesteps the circular problem of AI reading its own code to generate tests. Instead, it treats the application as a black box and tests what users actually experience. Tools like Assrt take this approach: they auto-discover your application's pages and flows, then generate real Playwright tests in standard JavaScript or TypeScript. No proprietary test format, no vendor lock-in, just tests that run in your existing CI pipeline.
The key distinction is that discovery-based testing generates verification from the outside in, starting from user behavior and working back toward the code. Prompt-based test generation works from the inside out, starting from the code and generating tests that confirm it runs. Outside-in verification catches an entirely different class of bugs: broken navigation, missing form validations, incorrect redirects, visual regressions, and integration failures between components that each work correctly in isolation.
Property-based testing frameworks like fast-check and Hypothesis complement this approach at the unit level by generating hundreds of randomized inputs to verify invariants that should always hold. Mutation testing tools like Stryker verify that your tests actually detect real faults. Together, these approaches form a verification layer that is genuinely different from “AI writes the tests too.”
5. Practical Steps to Add AI Verification to Your Workflow
Moving from faster horses to a genuine paradigm shift does not require a complete overhaul of your workflow. It requires adding a verification layer alongside your existing generation tools. Here is how to do it incrementally.
Separate generation from verification. The single most important principle is that the system generating code should not be the only system validating it. Use your AI coding assistant to write features. Then use a separate tool or process to verify those features. This can be a dedicated test discovery tool like Assrt that crawls your app and generates Playwright tests independently. It can be a manual QA pass by a team member who did not write the feature. It can be a property-based testing framework that generates inputs the developer never considered. The point is independence.
Make verification automatic, not optional. If testing depends on a developer remembering to run it, it will not happen consistently. Add verification to your CI/CD pipeline as a required gate. Run end-to-end test suites on every pull request. Run discovery-based crawls on staging after every deployment. Make it impossible to ship code that has not been independently verified.
Start with your critical paths. You do not need to verify everything at once. Identify the five to ten user flows that would cause the most damage if they broke: authentication, payment, onboarding, core feature usage. Get automated verification running on those flows first. Expand from there as you build confidence in the approach.
Use standard test formats. Any verification tool that locks you into a proprietary format is creating a new problem while solving an old one. Insist on tests that run in Playwright, Cypress, or your existing test framework. Standard formats mean you can review the tests, modify them, and run them in any CI system. They also mean your team can learn from the generated tests and improve their own testing practices.
Measure verification coverage, not code coverage. Code coverage tells you which lines were executed during testing. Verification coverage tells you which user-facing behaviors have been independently confirmed to work. Track how many of your critical flows have automated end-to-end tests. Track how many of your API endpoints have property-based tests. Track how many of your deployments pass verification gates without human intervention. These metrics tell you whether your verification layer is real or just decorative.