Regression Testing Guide
47 Releases, 23 Regression Tickets: Why Green Dashboards Lie About Quality
Your CI is green. Your coverage numbers look healthy. And yet regression tickets keep filing in. The problem is not your tests. It is the growing gap between what you test and what your codebase actually does.
“47% of engineering teams report that regression bugs are their top source of unplanned work, even with automated test suites in place.”
LinearB Engineering Benchmarks, 2025
1. The Coverage Gap Problem
Every codebase has two growth rates: the rate at which new code ships and the rate at which new tests are written. In almost every organization, the first rate exceeds the second. The result is a steadily widening gap between what the code does and what the tests verify.
This gap is the root cause of the “47 releases, 23 regression tickets” problem that engineering leaders see in their retrospectives. Each release ships with tests covering the new feature, but the interactions between new features and existing functionality go untested. A new settings page changes how user preferences are stored. The original feature that reads those preferences was tested when it launched, but nobody updated those tests to account for the new storage format. The regression appears three sprints later when a user reports that their notifications stopped working.
Coverage metrics make this problem worse by creating false confidence. A team might report 80% line coverage, which sounds healthy. But line coverage measures which code paths execute during testing, not whether the assertions are meaningful. A test that renders a component and asserts it does not throw achieves coverage of every line in that component without verifying any actual behavior. The regression hiding in the interaction between two components is invisible to the coverage report.
The coverage gap compounds over time. A team shipping 50 pull requests per sprint with an average of 0.7 new test files per PR accumulates untested surface area at a predictable rate. After six months, the untested surface is large enough that regressions become a weekly occurrence rather than a monthly surprise.
2. Why Green CI Dashboards Do Not Mean Quality
A green CI dashboard tells you one thing: the tests you have written pass. It tells you nothing about the tests you have not written. This distinction is crucial and widely misunderstood.
The green check mark creates a psychological effect called “automation bias,” the tendency to trust automated systems more than the evidence warrants. When CI is green, engineers feel confident merging. Product managers feel confident shipping. The dashboard becomes a proxy for quality, even though it is only a proxy for test suite health.
Flaky tests exacerbate this illusion. Teams that have dealt with flaky tests for months develop a tolerance for re-running failed builds. The mental model shifts from “a test failed, there might be a bug” to “a test failed, let me re-run and see if it passes.” This normalization of failure means that real regressions caught by genuinely failing tests get dismissed as flakiness and merged anyway.
The solution is not more tests for the sake of coverage numbers. It is better tests that cover the right things. This means focusing test effort on user-facing flows, integration boundaries, and recently changed code rather than chasing percentage targets. It also means treating the CI dashboard as one quality signal among many, not as the single source of truth about release readiness.
3. Root Cause Analysis Through Better Observability
When a regression does ship (and it will, no matter how good your test suite is), the speed of root cause analysis determines the severity of the impact. The difference between detecting a regression in 10 minutes versus 10 hours is often the difference between a quick hotfix and a P1 incident affecting thousands of users.
Modern observability tools (Sentry, Datadog, Honeycomb) provide the raw signals, but most teams under-invest in connecting those signals to their testing and deployment pipelines. A regression detected by an error rate spike is harder to diagnose than one detected by a failing test because the error spike does not tell you which change caused it. Connecting deployment markers to error tracking, so that you can immediately see which deploy introduced the new error pattern, reduces mean time to resolution dramatically.
Session replay tools add another dimension. When a user reports that checkout is broken, a session replay showing the exact sequence of clicks, network requests, and console errors eliminates the guesswork from reproduction. Some teams have integrated session replay data with their test generation pipeline, using real failure sessions as the basis for new regression tests.
The observability investment also feeds back into test prioritization. By tracking which code paths are most frequently involved in production errors, teams can focus their testing effort on the areas that actually break, rather than spreading test coverage evenly across the codebase. This data-driven approach to test strategy produces better outcomes than intuition-based prioritization.
4. Generating Tests Automatically as Code Changes
If the coverage gap problem is fundamentally about test creation speed lagging behind code creation speed, the most direct solution is to accelerate test creation. This is where AI-driven test generation has moved from a novelty to a practical necessity.
The auto-discovery approach, where a tool crawls your application and identifies testable scenarios, addresses the coverage gap at its source. Instead of relying on engineers to manually identify which flows need tests (and hoping they remember the edge cases), the tool discovers flows by analyzing the actual application. Assrt takes this approach by crawling your web application, identifying user scenarios, and generating real Playwright tests with self-healing selectors. Because the discovery is automated, new features get test coverage as soon as they ship, not weeks later when someone adds it to the test backlog.
Self-healing selectors deserve specific attention because they address one of the main reasons teams fall behind on test maintenance. When a developer changes a button's class name or restructures a form, traditional Playwright tests break even though the functionality is unchanged. Self-healing selectors adapt to these structural changes automatically, reducing the maintenance burden that causes test suites to rot over time.
The key to making generated tests valuable (rather than just adding noise) is curation. Not every generated test is worth keeping. The most effective workflow involves generating a batch of tests, reviewing them for relevance and correctness, promoting the valuable ones into the main suite, and discarding the rest. Over time, this process builds a comprehensive test suite faster than manual authoring could, while maintaining the intentionality that keeps test suites maintainable. Being open source and free of vendor lock-in matters here, too, because your test suite should remain portable regardless of which generation tool created it.
5. The LLM Output Testing Challenge
As more applications integrate LLM-powered features (chatbots, content generation, smart search, summarization), a new category of regression testing has emerged: verifying that non-deterministic AI outputs remain acceptable as models, prompts, and context windows change.
Traditional testing relies on deterministic assertions. Given input X, expect output Y. LLM outputs are inherently variable. The same prompt with the same model can produce different responses across runs. When you update the model version, swap providers, or modify the system prompt, the outputs shift in ways that are difficult to characterize with simple equality checks.
Evaluation frameworks (sometimes called “evals”) have emerged to address this challenge. Tools like Braintrust, Promptfoo, and custom eval harnesses define quality criteria for LLM outputs: relevance, factual accuracy, tone consistency, format compliance, and safety. Each prompt or chain is evaluated against a set of test cases with grading rubrics, producing a quality score rather than a pass/fail result.
The regression detection strategy for LLM features is fundamentally different from traditional software. Instead of asserting exact outputs, you track quality scores over time and alert when scores drop below a threshold or deviate significantly from the baseline. A model upgrade that improves average response quality by 5% but degrades quality on a specific category of queries by 20% might look like a net improvement in aggregate metrics while creating a real regression for affected users.
For teams building products with both traditional UI/API functionality and LLM-powered features, the testing strategy needs two distinct layers. The traditional layer (unit tests, E2E tests, contract tests) handles the deterministic parts of the application, where exact assertions are possible and appropriate. The eval layer handles the LLM-powered parts, where statistical quality measurement replaces binary pass/fail. Both layers run in CI, but they require different tooling, different failure thresholds, and different on-call response procedures. The teams that recognize this early avoid the trap of trying to force deterministic testing patterns onto inherently non-deterministic components.