AI Testing at Scale

AI Test Migration Failure Patterns: Why Plausible Code Breaks at Scale

Name: Assrt
Availability: InStock
Author: Assrt

A developer recently shared the results of running 764 Claude sessions to migrate 98 Rails models from RSpec to Minitest. Of those 98, 21 required human intervention. The failure mode was not "AI writes bad code." It was "AI writes plausible code that passes locally but breaks assumptions elsewhere." This distinction matters enormously when you are working at batch scale.

21/98

“At batch scale the failure mode isn't 'AI writes bad code,' it's 'AI writes plausible code that passes locally but breaks assumptions elsewhere.'”

Developer running 764 AI migration sessions

1. The "Plausible but Wrong" Problem

When an AI model generates test code, it produces syntactically correct, idiomatically reasonable output that looks like something a competent developer would write. This is precisely what makes it dangerous at scale. A test that is obviously broken gets caught immediately. A test that looks correct, runs correctly in isolation, and only fails under specific conditions can survive code review, pass local CI, and sit in your codebase for weeks before detonating.

The 764-session migration experiment revealed a taxonomy of failures. Some were straightforward syntax issues that the AI self-corrected on retry. But the 21 cases requiring human intervention shared a common trait: the AI made reasonable assumptions about the codebase that happened to be wrong. It assumed a factory would produce a valid record. It assumed a callback would fire synchronously. It assumed a database column had a default value. Each assumption was individually plausible. Each was also incorrect in the specific context of that codebase.

This is fundamentally different from the typical "AI hallucination" problem. The AI is not inventing nonexistent APIs or fabricating method signatures. It is making the same kind of mistakes a new developer would make on their first week, writing code that follows conventions but misses the unwritten rules that veteran team members carry in their heads.

2. Why AI-Generated Tests Pass Locally but Fail in CI

The most common failure pattern in AI test migration is the test that passes on the developer's machine but fails in CI. This happens for several interconnected reasons, and understanding them is essential to building a reliable migration pipeline.

First, local environments accumulate state. Your development database has seed data, cached records, and residual state from previous test runs. When an AI generates a test that creates a user and then queries for "the first user," it works locally because your dev database is clean. In CI, where tests run in parallel or where seed data differs, that query returns a different record entirely.

Second, timing dependencies hide in local execution. Your local machine runs tests sequentially on fast hardware. CI runners are slower, often running on shared infrastructure with variable performance. An AI-generated test that asserts a background job completes within 100 milliseconds works locally but times out in CI because the runner is handling three other builds simultaneously.

Third, environment configuration varies. AI models generate tests based on the code they can see, but they cannot see environment variables, CI-specific configurations, or infrastructure differences. A test that relies on a local Redis instance, a specific timezone setting, or a file path that exists on macOS but not on the Ubuntu CI runner will fail silently.

Catch what AI-generated tests miss

Assrt generates real Playwright tests by crawling your running app. It tests what users actually experience, not what the code looks like it should do.

3. Shared Fixture Conflicts and State Pollution

Fixtures are the most fertile ground for AI migration failures. In a typical Rails application, fixtures are shared across the entire test suite. They define baseline data that every test can reference. When an AI migrates tests from RSpec (which uses factories and database transactions) to Minitest (which traditionally uses fixtures), it has to translate between two fundamentally different approaches to test data management.

The AI typically handles this by generating factory-style setup blocks that create records at the start of each test. This works until two AI-generated tests create records with the same unique constraint. Test A creates a user with email "test@example.com." Test B, generated in a completely separate session, also creates a user with email "test@example.com." Run them individually and both pass. Run them together and one fails with a uniqueness violation.

State pollution is the subtler variant. Test A modifies a global configuration value and does not clean it up. Test B assumes the default configuration. When they run in the order A then B, test B fails. When they run in the order B then A, both pass. The AI has no way to know about this interaction because it generated each test independently, with no visibility into the shared mutable state they both touch.

The fix requires understanding the full test suite as a system, not as a collection of individual files. Each test must be isolated, must clean up after itself, and must not depend on any state created by other tests. Achieving this automatically requires tooling that runs the entire suite in randomized order and identifies tests that fail only in certain sequences.

4. Implicit Ordering Dependencies

Ordering dependencies are the hardest migration failures to diagnose because they are invisible by design. In many test suites, tests pass only because they run in a specific order. The original developer did not intend this. It happened organically as the suite grew, with later tests accidentally depending on side effects from earlier ones.

When an AI migrates these tests, it faithfully reproduces the test logic but not the ordering. The original RSpec suite might have run tests alphabetically by file name. The new Minitest suite might run them in definition order or random order. A test that always ran after the "create admin user" test now runs first, and there is no admin user in the database.

The solution is aggressive test isolation. Every test should set up its own preconditions from scratch and tear them down afterward. But enforcing this retroactively on a migrated suite is expensive. The practical approach is to run the suite with randomized ordering (using seed-based randomization so failures are reproducible), identify the tests that fail intermittently, and fix those specific tests. Tools like minitest-bisect can automate the process of finding which test combinations cause failures.

5. Validating AI-Generated Test Suites at Scale

Validation is where most AI test migration efforts fall short. Teams generate the tests, see them pass, and move on. But "passing" is a necessary condition for a good test, not a sufficient one. A test that always passes regardless of whether the code works correctly is worse than no test at all because it provides false confidence.

Mutation testing is the most effective validation technique. Tools like mutant (for Ruby) or Stryker (for JavaScript) introduce small changes to your production code and verify that your tests catch them. If you change a greater-than to a less-than and your test suite still passes, that test is not actually verifying the behavior it claims to verify. Running mutation testing against AI-generated tests consistently reveals that 10% to 30% of generated assertions are either tautological or testing the wrong thing entirely.

For browser-level testing, tools like Assrt take a different approach to validation. Instead of starting from source code and generating unit tests, they start from the running application and generate end-to-end tests based on actual user flows. This sidesteps many of the fixture and ordering problems because the tests interact with the application the same way a real user would, through the browser, with real HTTP requests and real database operations.

The most robust validation pipeline combines both approaches: AI-generated unit tests validated by mutation testing, plus AI-generated browser tests that verify the application works end-to-end. The unit tests catch logic errors quickly. The browser tests catch integration errors that unit tests miss by design.

6. Building a Reliable AI Test Migration Pipeline

The 21 out of 98 failure rate from the Rails migration experiment is actually a strong result. A 78% fully automated success rate for a complex migration task is remarkable. The question is not whether AI can handle 100% of test migration automatically. It cannot. The question is how to build a pipeline that maximizes the automated percentage and makes the remaining failures easy to identify and fix.

Start with a staging environment that mirrors CI exactly. Run every AI-generated test in this environment before merging. Do not trust local results. Set up randomized test ordering from day one and run the suite at least three times with different seeds. If a test passes with all three seeds, it is likely (though not certainly) order-independent.

Implement a quarantine system for flaky tests. When a test fails intermittently, move it to a quarantine suite that runs separately from the main CI pipeline. This prevents flaky tests from blocking deployments while still tracking them for eventual repair. Most CI systems support this through test tags or separate test commands.

Finally, layer your testing strategy. AI-generated unit test migrations handle the bulk of coverage. Browser-level test generation (using tools like Assrt or Playwright Codegen) covers the integration layer that unit tests miss. Manual testing focuses exclusively on the edge cases that both automated approaches struggle with. This layered approach accepts that no single tool handles everything and builds reliability from the combination.

Validate your AI-generated tests with real browser testing

Assrt generates Playwright tests from your running application, catching the integration failures that AI-migrated unit tests miss.

$Open-source. Real Playwright code. No vendor lock-in.

View on GitHub

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.