Test Generation

Multimodal Test Generation: Combining Source Code, Runtime Traces, and Historical Defects

Generating tests from source code alone misses the failure patterns that actually matter in production. When you combine code analysis with runtime traces, bug reports, and visual snapshots, the resulting tests catch the classes of bugs that users encounter, not just the classes of bugs that developers imagine.

2.7x

Tests generated from historical defect data catch 2.7x more real regressions than tests generated from source code analysis alone.

IEEE International Conference on Software Testing, 2025

1. Why Code-Only Test Generation Falls Short

Most AI test generation tools work by analyzing source code. They parse function signatures, trace call graphs, identify branching logic, and generate tests that exercise the paths they find. This approach produces tests that achieve high code coverage, but code coverage is a poor proxy for defect detection.

The fundamental problem is that source code describes what the application does, not what can go wrong. A function that parses user input might have three branches in the code (valid input, empty input, invalid format), and a code-only generator will dutifully create tests for all three. But the real-world failures might involve Unicode characters that the parser mishandles, concurrent requests that cause race conditions, or browser autofill that submits the form before validation runs. None of these failure modes are visible in the source code.

Research consistently shows a disconnect between code coverage and real-world reliability. Projects with 90%+ line coverage still experience production incidents at rates comparable to projects with 60% coverage. The difference lies in which lines are covered and which scenarios are tested. Code-only generators excel at covering lines but struggle with covering scenarios because scenarios emerge from the interaction between code, environment, user behavior, and external systems.

This gap has driven interest in multimodal test generation: combining multiple signals beyond source code to produce tests that map more closely to actual failure patterns. The additional signals (runtime traces, defect histories, visual snapshots) each contribute a different dimension of coverage that code analysis alone cannot provide.

2. Runtime Traces as a Test Generation Signal

Runtime traces capture what actually happens when users interact with your application: the HTTP requests made, the database queries executed, the event handlers fired, the DOM mutations that occur. This data reveals patterns that are invisible in static code analysis, particularly around timing, sequencing, and integration behavior.

Consider a search feature. The source code shows a function that takes a query string and returns results. Runtime traces reveal that users type three characters, pause, type two more, delete one, then submit. Each keystroke triggers a debounced API call. The response for the first query sometimes arrives after the response for the second query, causing stale results to flash on screen. No amount of source code analysis would generate a test for this race condition, but a trace-informed generator would recognize the pattern and create a test that simulates overlapping requests with intentionally delayed responses.

Several tools enable trace collection for test generation. OpenTelemetry provides standardized instrumentation for both backend and frontend. Session replay tools like FullStory, LogRocket, and PostHog capture user interaction sequences that can be converted into test scenarios. Browser DevTools protocol exports network and performance traces that Playwright can replay and assert against.

The challenge with trace-based generation is volume. A busy application produces millions of traces per day. The test generation system needs to cluster similar traces, identify anomalous patterns, and select representative scenarios for test creation. This is where AI models add value: they can classify traces by user intent, detect outlier sequences that suggest bugs, and generate focused tests for the most impactful scenarios rather than exhaustively testing every observed pattern.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. Mining Bug Reports for Test Scenarios

Your bug tracker is a goldmine of test scenarios that nobody is mining. Every bug report describes a failure mode that your existing tests missed. Feeding this information back into the test generation pipeline closes the loop between production incidents and test coverage.

The process starts with categorizing historical defects. Most bug trackers contain enough structured data (severity, component, reproduction steps, root cause) to build a taxonomy of failure types. Common categories include: data validation failures where unexpected input causes errors, state management bugs where the UI falls out of sync with the backend, integration failures where third-party services return unexpected responses, and race conditions where concurrent operations produce incorrect results.

Once categorized, each defect class becomes a test generation template. If your application has had five bugs related to timezone handling in the last year, the generator should produce tests that exercise date-sensitive features with different timezone settings, daylight saving transitions, and edge dates (December 31, February 29). The specific test cases come from the historical bugs, but the generator extrapolates to cover variations that have not yet been reported.

AI models are particularly good at this extrapolation step. Given a bug report that says "the discount code field accepts negative values," an LLM can generate a suite of related tests: zero values, extremely large values, decimal values, special characters, SQL injection strings, and JavaScript injection payloads. The original bug provides the seed, and the model expands it into a comprehensive boundary testing suite.

4. Visual Regression: Catching What Unit Tests Cannot

Unit tests and integration tests verify functional behavior: the function returns the right value, the API responds with the correct status code, the database contains the expected records. They are blind to visual regressions: a button that shifts 200 pixels to the right, a modal that renders behind the page overlay, a font that fails to load and falls back to a generic serif, or a responsive layout that breaks on specific viewport widths.

Visual regression testing compares screenshots of the application across builds. When a pixel-level diff exceeds a threshold, the test flags it for review. This approach catches an entire category of bugs that functional tests miss, particularly CSS regressions, z-index conflicts, layout shifts caused by content changes, and rendering differences across browsers.

Traditional visual regression tools (Percy, Chromatic, Applitools) use pixel comparison or DOM structure comparison. AI-powered visual regression takes this further by understanding semantic intent. Instead of flagging every pixel difference, an AI model can distinguish between intentional design changes and unintentional regressions. A button that changes color because the design system was updated is intentional. The same button losing its padding because a CSS specificity conflict was introduced is a regression.

Assrt incorporates visual regression into its AI-powered test discovery workflow. When crawling an application, it captures visual snapshots alongside functional test generation, creating a baseline that combines behavioral assertions with visual expectations. This dual-layer approach catches regressions that either technique alone would miss: a form that still submits correctly but renders its error messages in white text on a white background, or a navigation menu that functions properly but overlaps with the page content on mobile viewports.

5. Putting It All Together: Multimodal Test Pipelines

A multimodal test generation pipeline combines all three signals (source code, runtime traces, and defect history) into a unified system that produces tests covering functional behavior, real-world usage patterns, known failure modes, and visual consistency. Building this pipeline does not require implementing everything at once; each signal layer adds value independently.

The first layer is code-aware test generation. Tools like Codium (now Qodo), Diffblue, and various LLM-based generators analyze your source code and produce unit and integration tests. These tests cover the functional contract of each module and provide the foundation for the test suite. They are fast to generate, fast to run, and catch straightforward logic errors.

The second layer adds runtime awareness. Instrument your staging environment with OpenTelemetry or a session replay tool. Feed the collected traces into an AI model that identifies patterns worth testing: common user flows, unexpected navigation sequences, slow page loads, error spikes after deployments. The model generates end-to-end Playwright tests for the most significant patterns. Assrt's discovery approach operates in this layer, crawling the live application to find test-worthy flows that static analysis would miss.

The third layer incorporates defect intelligence. Connect your bug tracker (Jira, Linear, GitHub Issues) to the test generation pipeline. When a bug is filed, the system automatically generates a regression test that reproduces the reported behavior. When the bug is fixed, the regression test becomes part of the permanent suite. Over time, this layer builds a test suite that is shaped by the actual failure patterns of your application, not by an abstract coverage metric.

The teams getting the most value from AI test generation are those that treat it as a data problem, not just a code analysis problem. The more signals you feed into the generation process (code structure, user behavior, historical failures, visual baselines), the more closely the generated tests match the real-world scenarios where bugs actually occur. No single signal is sufficient on its own, but their combination produces test suites that catch regressions traditional approaches miss and do so with tests that remain relevant as the application evolves.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk