Test Reliability

How to Verify AI-Generated Tests Actually Catch Bugs

AI test generators can produce tests that pass because they encode the same assumptions as the code, not because the code is correct. This is the oracle problem, and solving it is essential for trustworthy automation.

Zero vendor lock-in

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline

Assrt

1. The Oracle Problem: When Tests and Code Share the Same Blind Spots

When an AI generates both your code and your tests, a subtle failure mode emerges. The AI reads the implementation, infers what “correct” means from the code itself, and writes assertions that confirm the code does what it does. The tests pass. Coverage looks great. But you have not actually verified anything meaningful, because the tests and the code share the same understanding of correct behavior.

This is the oracle problem: without an independent source of truth about what the software should do, your tests can only verify that the software does what it currently does. If the AI misunderstood a requirement, or if the implementation has a logic error, the generated tests will happily confirm that the wrong behavior is correct.

Consider a concrete example. You ask an AI to implement a function that calculates shipping costs with a 10% discount for orders over $100. The AI writes the function, but it applies the discount to the shipping cost instead of the order total threshold. Then it generates tests that verify the (incorrect) discount logic. Every test passes. The bug only surfaces when a customer complains that their $150 order did not get the expected discount.

The oracle problem is not new. It has existed in software testing for decades. But AI test generation makes it dramatically worse because it removes the natural separation between “person who wrote the code” and “person who wrote the tests.” When a human writes tests, they bring a different perspective, different assumptions, and different mistakes. When the same AI writes both, the blind spots are correlated.

2. Mutation Testing: The Gold Standard for Test Verification

Mutation testing answers a critical question: if a bug were introduced into your code, would your tests catch it? The technique works by making small, deliberate changes (mutations) to your source code and running your test suite against each mutant. If a test fails, the mutant is “killed,” meaning your tests detected the change. If all tests still pass, the mutant “survived,” meaning your tests have a gap.

Stryker is the most widely used mutation testing framework for JavaScript and TypeScript projects. A basic configuration looks like this:

// stryker.config.mjs
export default {
  mutate: ["src/**/*.ts", "!src/**/*.test.ts"],
  testRunner: "jest",
  reporters: ["html", "clear-text", "progress"],
  thresholds: {
    high: 80,
    low: 60,
    break: 50
  }
};

The thresholds setting is particularly useful. Setting break: 50 means your CI pipeline will fail if less than 50% of mutations are killed. This gives you a hard floor for test quality that line coverage alone cannot provide.

Common mutations include replacing > with >=, flipping boolean conditions, removing function calls, and replacing return values with defaults. Each mutation represents a class of bug that your tests should detect. When a mutant survives, it tells you exactly which type of bug can slip through unnoticed.

For AI-generated tests specifically, mutation testing is indispensable. Run Stryker after generating tests and you will quickly see whether those tests actually validate behavior or merely exercise code paths. A mutation score below 60% on AI-generated tests is a strong signal that the tests need human review and augmentation.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. Coverage Metrics That Actually Matter

Line coverage is the metric most teams track, and it is also the least informative. A test that calls a function and checks that it does not throw an error “covers” every line in that function without verifying a single output. AI-generated tests routinely produce 80%+ line coverage with minimal actual validation. You need better metrics.

Branch coverage measures whether every conditional path has been exercised. If your code has an if/else, branch coverage requires tests that trigger both the true and false paths. This is a meaningful step up from line coverage because it ensures your tests exercise decision logic, not just sequential code.

Assertion density is an underused metric that measures the number of meaningful assertions per test. Istanbul (the coverage tool built into most JavaScript test runners) can report coverage, but you will need custom analysis to track assertion density. A healthy test suite averages 2 to 4 assertions per test case. AI-generated tests often average closer to 1, frequently just checking that a function returns without errors.

Mutation score (discussed in section 2) is the most reliable single metric for test quality. Unlike coverage, it directly measures whether your tests can detect faults. Combine mutation score with branch coverage for a two-dimensional view of test effectiveness: branch coverage tells you what code your tests exercise, and mutation score tells you whether the exercises are meaningful.

Track these metrics over time, not just at a point in time. A declining mutation score after a sprint of AI-generated code is a leading indicator that test quality is slipping. Set up dashboard alerts for drops and treat them like you would treat a drop in uptime.

4. Independent Validation Patterns

The core fix for the oracle problem is independence: your tests need to derive their expectations from a source other than the code under test. Several practical patterns achieve this.

Specification-driven testing. Write test expectations from product requirements, API contracts, or design documents before generating implementation code. When the AI writes tests, provide it with the specification rather than the source code. This breaks the circular dependency between implementation and verification.

Cross-tool validation. Use a different AI model (or a different tool entirely) to generate tests than the one that generated the code. If Claude wrote your implementation, have a different system generate test cases. The errors will be uncorrelated, increasing the chance that one catches what the other misses.

E2E tests from user behavior. Tools like Assrt that generate real Playwright code you can inspect take a different approach entirely. Instead of deriving tests from source code, they model tests on actual user workflows and visible application behavior. Because the test generation starts from the UI rather than the implementation, the resulting tests provide genuinely independent validation. You can read every line of the generated Playwright test, modify it, and run it in your own CI pipeline with no vendor lock-in.

Property-based testing. Define invariants that should always hold regardless of input: “sorting a list twice produces the same result as sorting once,” “serializing then deserializing returns the original value.” Frameworks like fast-check generate thousands of random inputs to test these properties. Property-based tests are inherently independent because they test universal truths about your system rather than specific input/output pairs.

5. Building a Trustworthy Test Pipeline

Combining these techniques into a practical CI/CD pipeline requires deliberate layering. Here is a pattern that works well for teams using AI test generation in production.

Layer 1: Generated tests with guardrails. Let AI generate your initial test suite, but enforce minimum quality thresholds. Require branch coverage above 80%, assertion density above 2 per test, and a Stryker mutation score above 60%. Tests that do not meet these thresholds get flagged for human review before merging.

Layer 2: Mutation testing gate. Add Stryker as a CI step that runs on every pull request. Configure it to focus on changed files only (using the --incremental flag) to keep run times manageable. Block merges when the mutation score for new code drops below your threshold. This single gate catches the majority of hollow AI-generated tests.

Layer 3: Independent E2E validation. Maintain a separate suite of end-to-end Playwright tests that exercise critical user flows. These tests should be authored (or at minimum reviewed) independently from the unit tests. Run them on every deploy to staging. Because they interact with the real application through the browser, they catch integration bugs, rendering issues, and state management problems that unit tests cannot reach.

Layer 4: Production feedback loop. When a bug reaches production, trace it back through your test layers. Why did the unit tests miss it? What mutation would have caught it? Could an E2E test have detected it? Use every escaped bug as an opportunity to strengthen your test quality thresholds. Over time, this feedback loop calibrates your pipeline to catch the specific classes of bugs that AI-generated tests tend to miss.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk