QA Strategy

AI Test Generation for QA Teams: Cutting the Boilerplate

QA engineers spend too much time writing and maintaining E2E test scripts. AI can handle the boilerplate so your team can focus on what actually matters: test strategy and coverage.

10x

Auto-discovers test scenarios by crawling your app, then generates real Playwright tests you can own and modify

Assrt Framework

1. The dev/QA testing split and why E2E is different

In most organizations, testing responsibility splits along a clear line. Developers write unit tests and integration tests for the code they own. These tests run fast, live alongside the source code, and verify that individual functions and modules behave correctly in isolation. QA teams own end-to-end tests that verify the application works correctly from the user's perspective, across multiple services, through the actual UI.

This split makes sense in theory. Developers understand their code best and can write targeted unit tests efficiently. QA engineers understand user workflows best and can design comprehensive E2E scenarios. The problem is that the tooling and maintenance burden is radically different on each side.

Unit tests are relatively cheap to maintain. When a function signature changes, the compiler catches the broken tests immediately. The feedback loop is seconds. E2E tests, by contrast, interact with a running application through the browser. They depend on DOM structure, CSS selectors, network timing, and application state that can change without any explicit API contract. When a developer renames a CSS class or restructures a component, E2E tests break silently. The feedback loop is minutes to hours, often in a CI pipeline that the QA engineer did not trigger.

This asymmetry creates a persistent frustration for QA teams. They spend more time fixing broken selectors and flaky waits than they spend designing new test scenarios. The test suite becomes a maintenance liability rather than a quality asset.

2. Why E2E maintenance is so painful

The root cause of E2E maintenance pain is selector fragility. A typical E2E test locates elements using CSS selectors, XPath expressions, or data-testid attributes. Each approach has tradeoffs, but all share a common problem: they encode assumptions about the DOM structure that the application does not guarantee.

Consider a test that clicks a "Submit" button using the selector .form-container > .actions > button.primary. If a developer wraps the button group in a new div for layout purposes, the selector breaks. If they rename the CSS class from primary to btn-primary, the selector breaks. None of these changes affect the application's functionality, but they all break the tests.

Data-testid attributes are more resilient because they create an explicit contract between the application and the tests. But adding testids requires developer cooperation, and most teams cannot retroactively add them to every interactive element in a large application. There is also a philosophical debate about whether test infrastructure should pollute production markup.

Flaky tests compound the maintenance problem. An E2E test that passes 95% of the time and fails 5% of the time due to timing issues is worse than a test that consistently fails, because it erodes trust in the test suite. Teams start ignoring failures, re-running pipelines until they pass, and eventually disabling flaky tests entirely. Studies from Google and Microsoft have found that flaky tests are the top reason teams reduce E2E test investment.

The maintenance burden scales non-linearly. A suite of 50 E2E tests is manageable. A suite of 500 E2E tests requires a dedicated team just to keep them green. This is the pain point where AI test generation offers the most value.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. How AI generates E2E tests today

AI test generation tools fall into three categories based on their approach. The first category uses large language models to convert natural language descriptions into test code. You write "test that a user can log in with valid credentials" and the tool generates a Playwright or Cypress test. This is essentially code generation with testing-specific prompts and context.

The second category uses crawling and exploration to automatically discover test scenarios. The tool visits your application, maps out the navigation structure, identifies interactive elements, and generates tests that exercise each workflow. Assrt takes this approach: you run assrt discover https://your-app.com and it crawls your application, identifies key user flows, and generates Playwright tests for each one. The output is standard test files that you can inspect, modify, and commit to your repository.

The third category uses recording and replay with AI-enhanced healing. You perform actions in the browser, the tool records them, and then it uses AI to maintain the selectors when the UI changes. Some commercial tools in this space include Testim and Mabl.

A critical distinction between these tools is the output format. Some tools generate proprietary YAML or JSON test definitions that can only run inside their platform. Momentic, for example, uses a proprietary YAML format that requires their runtime. Others, like Assrt, generate standard Playwright test files that run anywhere Node.js runs. The portability of the output matters enormously because it determines your vendor lock-in risk.

None of these tools replace QA judgment. They accelerate the mechanical parts of test creation (writing selectors, structuring test files, handling setup/teardown) while leaving the strategic decisions to humans. Which scenarios to test, what edge cases matter, and how to structure the test suite for maintainability are still human responsibilities.

4. The right human + AI workflow

The most effective workflow treats AI-generated tests as a first draft, not a finished product. Here is a practical process that teams have found sustainable.

Start by letting the AI tool generate a baseline test suite. For a web application, this might mean running a crawler-based tool to discover all major user flows and generate tests for each one. This gives you broad coverage quickly. For a typical SaaS application with 20 to 30 major workflows, you can go from zero E2E tests to a working suite in an afternoon.

Next, have a QA engineer review each generated test. They should evaluate three things: Does the test cover the right scenario? Are the assertions meaningful (not just checking that the page loads, but verifying the correct data appears)? Is the test structured for maintainability (using page objects or fixtures where appropriate)?

The QA engineer then enhances the generated tests with domain knowledge. They add edge case scenarios that the AI missed, strengthen assertions to check specific business rules, and organize the suite into logical groups. This is the high-value work that only a human with business context can do.

Finally, use AI for ongoing maintenance. When a UI change breaks a test, self-healing tools can automatically update the selector. When a new feature ships, regenerate tests for the affected workflows and merge them into the existing suite. The QA engineer reviews the changes, not the code. This flips the traditional ratio: instead of spending 80% of time on maintenance and 20% on new test design, the team spends 80% on strategy and 20% reviewing AI-generated updates.

5. Evaluating AI testing tools

When evaluating AI test generation tools for your QA team, focus on these criteria rather than feature checklists.

Output portability.Can you run the generated tests outside the vendor's platform? If the tool generates standard Playwright or Cypress files, you own the output. If it generates proprietary formats, you are locked in. Assrt generates standard Playwright files. QA Wolf, which costs approximately $7,500 per month, provides a managed service with tests that run on their infrastructure. Understand what you are buying and what you own.

Self-healing quality. Self-healing sounds great in demos but varies enormously in practice. Test the tool against realistic UI changes: rename a CSS class, restructure a component, change a button label. Does the healer correctly identify the same element, or does it click something else? False healing (updating a selector to target the wrong element) is worse than a test failure because it silently reduces coverage.

Cost structure. Some tools charge per test execution, some per seat, some per test case. For a QA team that runs hundreds of tests multiple times per day, per-execution pricing can become expensive quickly. Open-source tools like Assrt (MIT licensed, free) eliminate this concern entirely. The tradeoff is that you manage the infrastructure yourself, but for teams already running Playwright in CI, the incremental setup is minimal.

CI integration. The tool must integrate cleanly with your existing CI pipeline. Tools that require a separate cloud environment or custom runners add complexity. Prefer tools that produce standard test artifacts (JUnit XML, Playwright HTML reports) that your CI system already knows how to display and track.

Learning curve. Your QA team needs to be productive with the tool within a week, not a quarter. Evaluate how quickly a QA engineer with Playwright experience can start generating, customizing, and maintaining tests. A tool that requires extensive training or a specialist to operate defeats the purpose of reducing maintenance burden.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk