AI Test Generation

From User Story to Playwright Test in Minutes

The gap between writing a user story and having a test that verifies it has traditionally been days or weeks. AI-powered test generation is compressing that to minutes. Here is how it works, where it excels, and where it still needs human oversight.

0

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Assrt SDK

1. The User Story to Test Gap

In most development teams, the path from a user story to a passing test involves multiple handoffs and significant elapsed time. A product manager writes the story. A developer implements it. A QA engineer (or the same developer) writes tests for it. Each handoff introduces delay and information loss. By the time the test is written, the implementation may have already changed.

This gap creates a practical problem: tests lag behind features. In fast-moving teams, features ship to production before comprehensive tests exist. The team accumulates test debt, promising to "add tests later," which often means never. When a regression is eventually caught, it is caught by users, not by tests.

AI test generation aims to close this gap by converting natural language specifications (user stories, acceptance criteria, feature descriptions) directly into executable test code. The generated tests are not perfect, but they provide immediate baseline coverage that can be refined over time. Having an imperfect test immediately is almost always better than having a perfect test never.

2. How AI Translates Specs into Test Code

AI test generation from user stories works through several mechanisms depending on the tool. LLM-based approaches (using models like GPT-4 or Claude) take a natural language description and generate test code that exercises the described behavior. For example, given "As a user, I can reset my password via email," the model generates a Playwright test that navigates to the login page, clicks "forgot password," enters an email, and verifies the confirmation message.

The quality of the generated test depends heavily on the specificity of the input. A vague story like "users can manage their profile" produces vague tests. A detailed acceptance criteria list ("user can change display name, user sees validation error for names under 2 characters, user can upload a profile picture under 5MB") produces specific, useful tests.

Some tools enhance LLM generation with application context. They feed the model your existing test files, component structure, or page screenshots alongside the user story. This helps the model generate tests that use your actual selectors, page URLs, and test patterns rather than generic placeholders. The result is test code that often runs on the first attempt with minimal modification.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. LLM-Based vs. Crawl-Based Generation

There are two fundamentally different approaches to AI test generation, and they complement each other. LLM-based generation starts from a specification (user story, feature doc, or plain English description) and produces test code that should exercise the described behavior. It works well when you know what you want to test and need the automation code written for you.

Crawl-based generation takes the opposite approach. Instead of starting from a spec, it starts from the running application. A tool like Assrt navigates your app, discovers all interactive elements and user paths, and generates tests for what it finds. This catches features that exist but are not documented in any user story, which is common in applications with accumulated tech debt or features added directly by developers without formal specs.

The ideal workflow uses both. Start with LLM-based generation for new features where you have clear acceptance criteria. Supplement with crawl-based discovery to catch gaps and verify that the full application works end-to-end. The LLM-generated tests verify intent (does it do what we specified). The crawl-generated tests verify reality (does the application actually work when a user interacts with it).

4. Feedback Loops: When Generated Tests Fail

Generated tests fail for two distinct reasons: the test is wrong, or the application is wrong. Distinguishing between these is the critical human judgment step that AI cannot reliably perform. When an LLM generates a test that clicks a "Submit" button but the actual button says "Save Changes," the test is wrong. When a test verifies that a success message appears after form submission and no message appears, the application might be wrong.

Effective feedback loops handle both cases. When a generated test fails because it uses wrong selectors or incorrect page structure, the generation tool should be able to self-correct. Some tools run the generated test immediately, observe the failure, and regenerate with the error context. This try-run-fix cycle often resolves selector issues within two or three iterations.

For genuine application bugs caught by generated tests, the feedback loop should produce actionable bug reports. The best tools capture screenshots at the point of failure, record the network requests, and include the full Playwright trace. This turns a test failure into a complete reproduction case that a developer can investigate immediately, rather than a vague "test failed" notification that requires manual investigation.

5. Building a Sustainable Generation Workflow

The temptation with AI test generation is to generate everything at once and end up with hundreds of tests that nobody maintains. A more sustainable approach is to integrate test generation into your existing development workflow rather than treating it as a separate activity.

In practice, this means generating tests as part of the feature development process. When a developer picks up a user story, they generate tests from the acceptance criteria before (or alongside) writing the implementation. The generated tests run in CI alongside manually written tests. When they fail, they are triaged like any other test failure. When they pass, they become part of the regression suite.

Maintenance is the hidden cost that makes or breaks any test generation strategy. Generated tests need the same care as manually written tests: they should be reviewed, refactored when they become brittle, and deleted when they test deprecated features. Self-healing selectors (available in tools like Assrt) reduce maintenance overhead significantly, but they do not eliminate the need for periodic test suite curation.

The goal is not to replace the test writing process entirely. It is to eliminate the blank page problem. Starting from a generated test and refining it is faster than starting from nothing, and it ensures that every feature has at least basic coverage from day one.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk