Testing Guide
AI-Driven BDD Framework Generator: From User Stories to Executable Gherkin
Behavior-Driven Development bridges the gap between business requirements and test automation. AI is now capable of generating Gherkin scenarios and their underlying step definitions directly from user stories, but the results require careful human oversight.
“AI-generated BDD scenarios reduce the time from user story to executable test specification by 5x, but product manager review remains essential for validating business intent accuracy.”
BDD automation studies
1. The Promise and Pain of BDD
Behavior-Driven Development was supposed to solve one of the oldest problems in software: the gap between what the business wants and what engineers build. By writing requirements as executable Gherkin scenarios (Given/When/Then), teams could create a shared language that both product managers and developers understand. The scenarios serve double duty as documentation and automated tests.
In practice, the adoption story has been mixed. Writing good Gherkin is harder than it looks. Scenarios need to be specific enough to be executable but abstract enough to remain readable by non-technical stakeholders. Step definitions (the code that implements each Given/When/Then line) accumulate complexity over time, especially when scenarios involve conditional logic, data setup, or multi-step workflows. Many teams that adopted BDD found themselves spending more time maintaining the Gherkin layer than they saved in communication clarity.
The bottleneck was always the translation step. Someone with both domain knowledge and technical skill had to sit down and convert user stories into well-structured Gherkin, then implement the step definitions in code. This person was often the QA engineer, and the translation work consumed a significant portion of their time. AI is now automating this translation, and the results are changing how teams approach BDD.
2. How AI Generates Gherkin from User Stories
Modern AI models can take a user story (or a Jira ticket, or a product requirements document) and generate structured Gherkin scenarios that cover the happy path, edge cases, and error conditions. The generation process works by analyzing the intent of the requirement, identifying the actors and actions involved, and structuring the output in the standard Given/When/Then format with appropriate scenario outlines for data-driven variations.
The quality of the generated scenarios depends heavily on the quality of the input. A vague user story like "users should be able to check out" produces generic scenarios. A well-structured story with acceptance criteria, edge cases, and business rules produces scenarios that are much closer to what a skilled BDD practitioner would write. This is an important insight: AI amplifies the quality of your requirements rather than compensating for poor ones.
Beyond the Gherkin itself, AI can also generate the step definition code that implements each scenario. Given access to your application's page objects, API endpoints, or component structure, the AI can produce step definitions that interact with the actual application. Tools like Assrt take this further by combining BDD generation with application discovery, so the generated steps reference real selectors and navigation paths rather than placeholder code.
The output typically needs refinement. AI-generated Gherkin tends to be more verbose than what an experienced practitioner would write, sometimes including redundant steps or overly specific details that should be abstracted. The step definitions may use suboptimal locator strategies or miss important assertions. But as a first draft that captures 80% of the intent correctly, the time savings are substantial.
3. Handling Complex Interactions in Generated Steps
Simple form submissions and page navigations are straightforward for AI to generate. The challenge emerges with complex interactions: drag and drop operations, multi-step form wizards with conditional branching, real-time collaborative editing, file uploads with progress indicators, and interactions that depend on precise timing or animation completion.
Drag and drop is a particularly good example. A Gherkin scenario might say "When the user drags the task card to the Done column," but the step definition needs to handle mouse down events, element coordinates, smooth movement simulation, drop target detection, and verification that the DOM updated correctly. AI models can generate this code, but they often produce implementations that work in simple cases and break in real applications with custom drag libraries, nested scroll containers, or virtual lists.
Multi-step forms present a different challenge. A checkout wizard with shipping, payment, and confirmation steps requires the step definitions to maintain state across steps, handle validation errors at each stage, and manage the navigation between steps correctly. AI-generated step definitions sometimes treat each step as independent, losing the context that a real user would carry through the flow.
The practical approach is to use AI for generating the scenario structure and basic step definitions, then have engineers refine the complex interaction code. This division of labor plays to each party's strengths: AI handles the repetitive scaffolding work quickly, while engineers focus their expertise on the technically challenging interactions that require deep understanding of the application's behavior.
4. Product Manager Review Without Reading Code
One of the original goals of BDD was enabling non-technical stakeholders to read and validate test scenarios. AI-generated Gherkin actually excels at this because it tends to produce scenarios in natural, readable language without technical jargon. A product manager can read "Given a user has items in their cart / When they apply a 20% discount code / Then the total should reflect the discounted price" and immediately verify whether this matches the intended business behavior.
The key is separating the Gherkin layer (which product managers review) from the step definition layer (which engineers maintain). AI generation makes this separation more natural because the scenarios are generated from business requirements in business language. Product managers can review the generated scenarios in a pull request or a dedicated review tool, approve the business logic coverage, and leave the technical implementation details to the engineering team.
Some teams take this further by generating scenario summaries or coverage reports that map each user story to its generated test scenarios. This gives product managers a high-level view of what is being tested without requiring them to read individual feature files. The report shows which acceptance criteria have corresponding scenarios, which edge cases are covered, and where gaps exist.
The risk is that product managers approve scenarios that read well but test the wrong thing. An AI might generate a scenario that verifies the discount code input field exists and accepts text, but miss the actual business rule about how discounts compound with other promotions. This is where human review remains essential: the product manager must verify that the scenarios capture the intent of the requirement, not just the surface-level interaction.
5. Step Definition Quality and Maintenance
AI-generated step definitions face the same maintenance challenges as any generated code. They tend to be more verbose than hand-written definitions, with less reuse across scenarios. A human BDD practitioner would write a generic "the user logs in" step that accepts parameters and works across all scenarios. AI often generates slightly different login step definitions for each feature file, creating duplication that accumulates over time.
The solution is to treat AI-generated step definitions as a first draft that gets refactored into a shared step library. After generating scenarios for several features, an engineer reviews the step definitions, identifies common patterns, and consolidates them into reusable steps. This refactoring pass typically reduces the step definition codebase by 30% to 50% while making it more maintainable.
Tools like Assrt help with this by analyzing your existing step library before generating new scenarios. Instead of creating duplicate steps, the AI maps new scenario requirements to existing step definitions where possible and only generates new steps for genuinely new interactions. This context-aware generation produces output that integrates cleanly with your existing BDD infrastructure rather than creating a parallel set of definitions.
6. A Practical AI-Assisted BDD Workflow
The most effective workflow combines AI speed with human judgment. Start by feeding your user stories or requirements into an AI tool that generates Gherkin scenarios. Review the generated scenarios with your product manager to validate business intent. Have an engineer refine the step definitions, consolidate shared steps, and handle complex interactions. Run the scenarios against your application to verify they pass, then integrate them into your CI pipeline.
For teams using Assrt, the workflow begins with application discovery. Run npx @m13v/assrt discover https://your-app.com to let the AI crawl your application and identify testable flows. The discovered flows can be output as Gherkin scenarios with corresponding Playwright step definitions. Because Assrt has seen the actual application (not just the requirements), the generated scenarios reference real UI elements and navigation paths.
The key principle is that AI handles the volume and humans handle the judgment. AI generates the 80% of scenarios that are straightforward, freeing your BDD practitioners to focus on the 20% that require domain expertise, edge case knowledge, or nuanced understanding of business rules. This division of labor makes BDD practical for teams that previously found the overhead too high, while maintaining the quality that makes BDD valuable in the first place.
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.