Testing Guide

AI-Powered Test Migration at Scale: Lessons and Failure Modes

A developer on r/rails recently shared the results of running 764 Claude sessions to migrate 98 Rails models from RSpec to Minitest. Of those 98 migrations, 21 required human intervention to complete. The lessons from that experiment reveal something important about how to think about AI test migration at scale: the failure mode is not “AI writes bad code.” It is “AI writes plausible code that passes locally but breaks assumptions elsewhere.”

21/98

In a 764-session AI migration of Rails models from RSpec to Minitest, 21 of 98 migrations required human intervention. Not because AI wrote obviously wrong code, but because it wrote plausible code that violated hidden assumptions.

1. The Plausible-but-Wrong Problem at Batch Scale

When you run a single AI session to generate a test, reviewing the output for correctness is manageable. When you run 764 sessions in parallel to migrate an entire test suite, review becomes a bottleneck. The appeal of batch migration is that you can replace hundreds of tests quickly. The danger is that plausible-but-wrong tests can slip past review at a rate that would be impossible with single-session generation.

Plausible-but-wrong tests share a specific signature. They are syntactically correct. They follow the conventions of the target framework (Minitest, in this case). They pass when run in isolation. They even pass the first few times the full suite runs. But they contain an assumption that is wrong for this specific codebase, an assumption that only surfaces under specific conditions: when run in parallel with other tests, when the CI environment differs from local, or when a particular piece of shared state is in an unexpected condition.

The 764-session experiment achieved a 78% fully automated success rate, which is genuinely impressive for a complex migration task. But the 22% that required intervention were not randomly distributed. They clustered around specific patterns: models with complex associations, tests that depended on ordering, fixtures with uniqueness constraints, and tests that made timing assumptions. Understanding these patterns in advance is the key to running batch migrations more efficiently.

The broader lesson is that batch AI migration should be designed around the expectation that a predictable percentage of output will require human review. The goal is not to achieve 100% automatic success (current AI cannot do this for complex codebases), it is to maximize the automatic success rate, make failures fast to identify, and ensure that failures are isolated so they do not block the rest of the migration.

2. Categories of Hidden Assumptions AI Cannot See

The hidden assumptions that cause batch migration failures fall into predictable categories. Knowing these categories in advance lets you set up targeted validation and focus human review where it matters most.

Database state assumptions. AI generates tests that create records assuming a clean database. In practice, test databases accumulate state from seed data, prior migrations, and the output of other tests running in parallel. A test that assumes it will find zero records matching a query, or that it will find exactly the record it just created, can fail when other tests have created records that match the same query. AI cannot see the full database state during generation because it has access only to the code, not to runtime behavior.

Callback and lifecycle assumptions. ActiveRecord callbacks, lifecycle hooks, and before/after filters are common in Rails models and frequently implicit in the code. AI may not generate test setup that triggers the necessary callbacks, or it may not expect the side effects that callbacks produce. A test that creates a user record without triggering the after_create callback that sets the default preferences will leave the model in an unexpected state for any assertion that checks preferences.

Uniqueness constraint assumptions. When 764 sessions generate test data independently, they frequently choose the same values for unique fields. Email addresses like “test@example.com” appear in dozens of independently generated tests. When those tests run concurrently or sequentially without proper transaction isolation, uniqueness violations occur.

Framework-specific convention assumptions. Minitest and RSpec handle fixtures, factories, and database transactions differently. AI models trained on both frameworks know the conventions of each, but when migrating from one to the other, they sometimes blend conventions in ways that are valid in neither framework. A test that uses an RSpec-style let block inside a Minitest class definition fails in ways that are not immediately obvious from the error message.

Validate migrated unit tests with browser-level coverage

Assrt generates Playwright tests from your running application, catching the integration failures that migrated unit tests can miss. Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Get Started

3. Why Migrated Tests Pass Locally but Fail in CI

The local-to-CI gap is one of the most frustrating aspects of large-scale AI migration. A migration session completes, the developer runs the tests locally, everything passes, and the migration is marked complete. Then CI runs the tests, three of them fail, and the developer has to debug why tests that passed locally are failing in the automated environment.

The most common cause is environment configuration differences. Local development machines have accumulated configuration that is not captured in the codebase: specific Ruby gem versions, database configurations, environment variables from dotenv files, and file system state from previous runs. AI-generated tests may rely on any of these without knowing it, because the code they are migrating also relies on them.

Timing is another common cause. Local machines run tests sequentially on fast hardware. CI runners handle multiple concurrent builds on shared infrastructure with variable performance. A test that asserts a background job completes within 50 milliseconds passes locally on a fast Mac and fails intermittently on a loaded CI runner. AI generates timing assumptions based on the code structure rather than real performance characteristics.

Parallel execution order is particularly relevant for batch migrations. When 764 sessions each generate tests that create database records, and those tests run in parallel in CI, the ordering of database operations becomes unpredictable. Tests that rely on counting records or on being the first to create a record with a particular attribute value fail when another test gets there first.

The practical solution is to run all migrated tests in a CI environment identical to production CI before marking any batch migration complete. Never trust local green to mean CI green for AI-generated tests. The validation environment must be the environment where failures are detected, not the developer's local machine.

4. Patterns That Require Human Intervention

The 21 migrations that required human intervention in the Rails experiment were not evenly distributed across failure types. They clustered around patterns that are consistently difficult for AI to handle autonomously. Identifying these patterns before running a migration lets you route them to human review from the start rather than discovering the need for intervention after failed attempts.

Complex polymorphic associations. Models with polymorphic has_many or belongs_to relationships require test data setup that correctly specifies the type column as well as the foreign key. AI frequently generates test factories for these models but sets up the association incorrectly, creating test data that the model will not find through its association methods.

Tests with business rule validation logic. Validations that encode business rules (a subscription cannot be active if the billing date is in the past without a payment on file) require understanding the business rule to generate meaningful test assertions. AI can generate test cases that verify the validation runs, but it often cannot generate test cases that verify the business rule is correctly implemented without that domain understanding.

Tests that interact with external services. Models that call external APIs, send emails, or trigger webhooks require careful mocking to avoid side effects during testing. AI generates mocks that are structurally correct but sometimes fails to mock all the code paths that a given test exercises, leaving some external calls live during test execution.

Routing these high-risk migration categories to human review from the beginning, while letting AI handle the straightforward models autonomously, is a more efficient approach than running all models through automated migration and then triaging failures afterward.

5. Building a Validation Pipeline for Batch-Migrated Tests

A systematic validation pipeline is the difference between a batch migration that delivers value and one that introduces a large number of plausible-but-wrong tests into your codebase. The pipeline should catch failures before they merge, not after.

Stage one is isolated CI validation. Each migrated test file runs in isolation in a clean CI environment before being considered for merge. This catches tests that fail due to environment differences or missing dependencies without the noise of tests failing due to interactions with other tests.

Stage two is full suite validation with randomized ordering. The complete set of migrated tests runs together with a randomized seed, repeated at least three times with different seeds. Tests that fail in some orderings but not others are automatically flagged as potentially order-dependent and routed to human review.

Stage three is mutation testing on a sample. Run mutation testing tools (mutant for Ruby, Stryker for JavaScript) against a 10% random sample of the migrated tests. If the mutation score is below a threshold (typically 70% for migrated tests), review the full batch for assertion quality before merging.

Stage four is parallel execution load testing. Run the full migrated suite in parallel with maximum concurrency to surface race conditions and shared state issues that only appear under load. Most CI systems support configuring the number of parallel workers; use the maximum your system supports for this validation run.

6. A Layered Strategy Combining Unit and Browser-Level Testing

AI-migrated unit tests are good at verifying model logic in isolation. They are poor at verifying that the application works correctly when all the pieces work together. Browser-level testing fills the gap that unit tests leave by design, and it does so in a way that is largely immune to the framework-specific migration problems described in this guide.

A browser test for a checkout flow does not care whether the underlying model tests are written in RSpec or Minitest. It cares whether a user can click “Add to Cart,” proceed to checkout, enter payment details, and receive a confirmation. This end-to-end verification catches integration failures that unit tests miss precisely because unit tests are designed to test in isolation.

Tools like Assrt generate browser tests by crawling your running application and identifying the critical user flows it finds. The output is standard Playwright code, open-source and free, that runs in any CI pipeline and does not require knowledge of your framework internals to maintain. Layering these browser tests over a batch-migrated unit test suite gives you two complementary safety nets: unit tests that catch logic errors quickly, browser tests that catch integration errors that unit tests miss.

The most reliable post-migration validation strategy is to run the full browser test suite against the application after unit test migration is complete. If the application behaves correctly at the browser level, the unit tests are likely correct at the model level. If browser tests catch regressions, those regressions narrow the scope of unit test review to the models that affect the failing flows. This combination makes large-scale migrations faster to validate and faster to fix when things go wrong.

Layer browser tests over your migrated unit suite

Assrt generates Playwright tests from your running app. Catches the integration failures that AI-migrated unit tests miss by design.

$Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.