AI Development

Testing in a Multi-Agent World: Verification When 12 AI Agents Write Your Code

When a dozen AI coding agents work on your codebase simultaneously, each producing valid code in isolation, the real challenge is not individual correctness. It is what happens when their changes merge.

0

Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.

Open-source test automation

1. The Semantic Conflict Problem

Traditional merge conflicts are syntactic: two changes modify the same line, and Git cannot resolve the difference. These are easy to detect and straightforward to fix. Semantic conflicts are far more dangerous. Two agents modify different files, the merge succeeds cleanly, and the resulting code compiles without errors. But the application is broken because the changes are logically incompatible.

Consider a concrete example. Agent A refactors the user authentication module to return a session token as a string. Agent B, working simultaneously, updates the dashboard to expect the session token as an object with token and expiry fields. Both changes are valid in isolation. Both pass their respective unit tests. But when merged, the dashboard crashes because it receives a string where it expects an object.

This problem scales with the number of agents. With 2 agents, you have 1 potential conflict pair. With 12 agents, you have 66 potential conflict pairs. The probability of at least one semantic conflict approaches certainty in any non-trivial codebase with concurrent development across multiple modules.

2. Per-Agent Checks: Necessary but Insufficient

Every AI coding agent should run tests against its own changes before submitting a pull request. This is the baseline: unit tests, type checking, linting, and ideally a subset of integration tests relevant to the changed code. These checks catch the obvious issues, such as syntax errors, type mismatches, and regressions in the code the agent directly modified.

The limitation is scope. Per-agent checks verify that the change works in the context of the codebase as it existed when the agent started working. They do not account for concurrent changes by other agents. By the time Agent A's PR is ready to merge, Agents B through L may have already merged changes that alter the assumptions Agent A's code relies on.

Some teams try to address this by rebasing each agent's branch onto the latest main before merging. This catches syntactic conflicts but still misses semantic ones. The rebase succeeds, the per-agent tests pass (because they only test the agent's changes), and the semantic conflict slips through into production.

Stop writing tests manually

Assrt auto-discovers scenarios and generates real Playwright code. Open-source, free.

Get Started

3. Integration Tests on Merge Commits

The only reliable way to catch semantic conflicts is to run integration tests on the merged result, not on individual branches. This means your CI pipeline needs a stage that takes the merge commit (the result of combining the PR branch with the current main), builds the entire application, and runs a comprehensive integration test suite against it.

GitHub's merge queue feature supports this pattern natively. Instead of merging PRs directly, they enter a queue. Each PR is speculatively merged with the current main and all PRs ahead of it in the queue. The full test suite runs against this speculative merge. If tests fail, the offending PR is ejected from the queue while the others proceed. This ensures that every commit on main has passed integration tests in the context of all concurrent changes.

The cost is CI compute time. Running full integration tests on every merge commit requires significant resources, especially with E2E browser tests. Teams can mitigate this by maintaining a fast integration suite (under 10 minutes) that covers critical paths, while reserving the comprehensive suite for nightly runs. The fast suite catches most semantic conflicts; the comprehensive suite catches the rest before the next business day.

4. Detecting Semantic Conflicts Automatically

Beyond integration testing, there are static analysis approaches that can flag potential semantic conflicts before tests even run. The simplest is interface change detection: if Agent A modifies a function signature, type definition, or API contract, the system checks whether any other open PR depends on the old version. Tools like TypeScript's project references and API extractor can automate this for typed codebases.

More advanced approaches use dependency graphs. When Agent A changes module X, the system identifies all modules that import from X and checks whether any other agent has open changes in those dependent modules. If so, the system flags a potential conflict and requires both agents' changes to be tested together before either can merge. This approach catches conflicts early without running the full test suite.

Some teams also implement "contract tests" at module boundaries. Each module publishes a contract (input types, output types, expected behaviors), and any change that modifies a contract triggers tests in all dependent modules. This is particularly effective in microservice architectures where agents might be working on different services that communicate via APIs.

5. Building a Multi-Agent Testing Pipeline

A practical multi-agent testing pipeline has three layers. The first layer runs per-agent: each agent executes unit tests, type checks, and focused integration tests before submitting its PR. This is fast (under 5 minutes) and catches the majority of issues. The second layer runs on merge: when a PR enters the merge queue, the full integration suite runs against the speculative merge commit. This catches semantic conflicts and cross-module regressions.

The third layer is continuous E2E verification. Tools like Assrt can maintain a living test suite that covers critical user flows and runs against every deployment. When multiple agent changes ship in the same deployment, these E2E tests verify that the user experience remains intact. Because these tests operate at the browser level, they catch integration issues that lower-level tests might miss, such as CSS conflicts, JavaScript loading order issues, or state management bugs.

The key architectural principle is that test scope should increase as code moves closer to production. Agents run narrow, fast tests. The merge queue runs broader integration tests. Production deployments trigger comprehensive E2E verification. Each layer catches a different class of issues, and together they provide confidence that multi-agent development does not compromise application quality.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests, and self-heals when your UI changes.

$npx @assrt-ai/assrt discover https://your-app.com