Test Quality

AI-Generated Tests: Review, Ownership, and Quality Control

Letting ChatGPT, Copilot, or dedicated tools write your Playwright tests is fast. But generated code without review is a liability. This guide covers how to evaluate AI-generated tests, what to modify before committing, and how to build a sustainable review process that treats generated tests as first-class code.

0

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Open-source test automation

1. Why Generated Tests Need Review

AI-generated tests have a unique failure mode: they look correct but test the wrong thing. A generated test that navigates to the dashboard, clicks a few buttons, and asserts that the page title contains "Dashboard" appears to provide coverage but actually verifies almost nothing. It would pass even if the dashboard showed completely wrong data, because the assertion only checks the page title.

This happens because AI models generate tests by pattern matching against training data, not by understanding your application's business logic. The model knows how to write a Playwright test that navigates and asserts, but it does not know what your application should do. It generates structurally valid tests with semantically weak assertions. The test structure is correct; the verification is shallow.

Review transforms generated tests from plausible automation into genuine verification. A human reviewer who understands the application can add business-logic assertions, remove redundant steps, combine related scenarios, and catch incorrect assumptions. The goal is not to rewrite every generated test but to elevate each test from "it runs" to "it catches real bugs."

2. What to Check in AI-Generated Playwright Tests

Start with assertions. The most common weakness in generated tests is insufficient or wrong assertions. Check that each test verifies the outcome, not just the navigation. A checkout test should assert that the order confirmation contains the correct product, quantity, and total, not just that a "success" page loaded. Add assertions for data correctness wherever the generated test only checks for element visibility.

Next, check selectors. AI-generated selectors range from excellent (getByRole with accessible names) to terrible (deeply nested CSS paths with generated class names). Replace fragile selectors with Playwright's built-in resilient locators. Look for hard-coded values that should be variables: if the test types a specific email address, make it a configurable fixture value so each test run uses unique data.

Check for hard-coded waits. Generated tests sometimes includepage.waitForTimeout(2000)instead of waiting for specific conditions. Replace these with Playwright's auto-waiting assertions:await expect(locator).toBeVisible()waits for the element to appear without an arbitrary delay. Also check for missing error handling: what happens if a navigation fails or an element is not found? Generated tests often assume the happy path exclusively.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. When to Accept, Modify, or Reject

Accept a generated test as-is when it uses resilient selectors, has meaningful assertions, and tests a genuine user flow. This is common for straightforward pages: landing pages, static content, simple navigation flows. The test adds coverage without requiring modification because the page itself is simple.

Modify a generated test when the structure is correct but the details need improvement. This is the most common outcome. Keep the test's navigation and interaction steps but strengthen the assertions, improve the selectors, and add data setup. Most generated tests from tools like Assrt fall into this category because the tool discovers real scenarios and generates correct interaction sequences, while the business logic assertions need human refinement.

Reject a generated test when it tests something irrelevant, duplicates an existing test, or requires so much modification that writing from scratch would be faster. Generated tests sometimes cover trivial scenarios (verifying that a footer link exists) at the expense of important ones (verifying that the checkout calculates tax correctly). Reject the trivial test and use the time saved to write or improve the important one.

4. Establishing Ownership of Generated Tests

Generated tests need owners just like hand-written tests. Without ownership, generated tests become abandoned code that nobody maintains, understands, or trusts. The worst outcome is a test suite full of generated tests that nobody has reviewed, running in CI and occasionally failing with nobody responsible for investigating.

Assign ownership at generation time. When you generate tests for the checkout flow, assign them to the team that owns the checkout code. Use CODEOWNERS files or a similar mechanism to ensure that changes to these tests are reviewed by the owning team. The owning team is responsible for reviewing the generated tests before they merge, maintaining them when the feature changes, and investigating failures.

Treat generated tests the same as generated code in other contexts. When you use Copilot to generate application code, you review it before committing. Apply the same standard to generated test code. The AI is a drafting tool, not an author. The person who reviews and commits the test owns it, regardless of whether a human or an AI wrote the first draft.

5. Building a Sustainable Review Workflow

For teams that generate tests regularly (after each sprint, after major UI changes, or on a continuous schedule), a structured review workflow prevents generated tests from accumulating unreviewed. Create a dedicated branch or PR for generated tests. Run them in CI to verify they pass before review. Then have the feature team review the tests in the same way they review any other PR.

Batch reviews by feature area. If Assrt generates 20 tests across your application, group them by feature: 5 for authentication, 8 for dashboard, 7 for settings. Send each group to the owning team for review. This is more efficient than reviewing all 20 tests as a single undifferentiated batch because each reviewer has the context needed to evaluate the tests in their area.

Track review metrics. Measure the acceptance rate (what percentage of generated tests are committed without modification), the modification rate (what percentage need changes), and the rejection rate (what percentage are discarded). These metrics tell you whether the generation tool is improving over time and where it needs configuration adjustments. A healthy workflow has a high acceptance rate for simple pages and a moderate modification rate for complex flows, with a low rejection rate overall.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk