CI/CD Reliability

CI/CD Test Reliability: Fixing Flaky Tests, Spec-Driven Development, and Production Monitoring

A flaky test suite is worse than no tests at all. It trains developers to ignore failures, erodes trust in CI, and slows down deployments as teams manually verify "known flakes." This guide covers the root causes of CI test unreliability, practical strategies for fixing them, and how to extend your testing into production with synthetic monitoring.

35%

35% of CI test failures are caused by environment differences between local development and CI runners, not by actual application bugs.

CI/CD Pipeline Reliability Survey, 2025

1. CI Environment Differences: The Hidden Flake Factory

The most common source of flaky tests is not bad test code; it is environmental differences between where tests are written and where they run. A developer writes a test on a MacBook with 16 GB of RAM, a fast SSD, and a local database. That same test runs in CI on a shared Linux container with 2 GB of RAM, network-attached storage, and a database that might be shared with other jobs. Timing-sensitive assertions that pass locally fail intermittently in CI because the CI runner is slower, busier, or configured differently.

Browser-based tests are especially vulnerable. Playwright defaults to headless mode in CI, which renders pages slightly differently than headed mode. Font rendering, animation timing, and viewport dimensions can all differ. A visual regression test that passes locally may fail in CI because the headless browser renders a font two pixels wider, causing a layout shift that triggers the pixel comparison threshold.

The fix starts with making CI the source of truth, not local development. Configure your CI environment as precisely as possible: pin browser versions, use container images with fixed system dependencies, and set explicit viewport sizes in your Playwright config. Discourage developers from running the full E2E suite locally for validation; instead, provide a fast subset of smoke tests for local use and treat CI results as authoritative.

Network is another major difference. Local tests hitting a local API server have sub-millisecond latency. CI tests hitting a staging API server have real network latency that varies by load. Every hard-coded timeout in your tests becomes a potential flake when latency spikes. Replace fixed timeouts with Playwright's auto-waiting and assertions that poll until a condition is met (expect(locator).toBeVisible() retries automatically) rather than waiting a fixed duration and then checking.

2. Self-Contained Fixtures and Test Isolation

The second most common source of flakiness is test interdependence. Test A creates a user. Test B assumes that user exists. When tests run in parallel or in a different order, Test B fails because the user does not exist yet. This pattern is pervasive in E2E suites because creating test data is expensive, so teams share it across tests to save time.

Self-contained fixtures solve this at the cost of test setup time. Each test creates its own data, runs its assertions, and cleans up after itself. In Playwright, fixtures are first-class:test.extend() lets you define custom fixtures that provision test data, create browser contexts with specific state, and tear everything down when the test completes. A login fixture can create a unique user via API, authenticate, and save the session state so the test starts logged in without going through the UI login flow every time.

The performance concern is real but manageable. API-based test data creation is fast (milliseconds per call), and Playwright's storageState feature lets you authenticate once and reuse the session across tests in the same worker. The combination of API fixtures for data creation and storage state for authentication keeps each test independent without adding significant setup time.

Generate isolated, CI-ready tests

Assrt creates self-contained Playwright tests with proper fixtures. No shared state, no interdependencies.

Get Started

3. Flaky Test Quarantining That Actually Works

When a test is flaky, the immediate instinct is to fix it or delete it. Both options have downsides: fixing takes time away from feature work, and deleting removes coverage for a scenario that presumably matters. Quarantining offers a middle path: move the flaky test to a separate suite that runs but does not block the CI pipeline. The test still executes and reports results, but its failures do not prevent merges or deployments.

Effective quarantining requires three components. First, a mechanism to tag tests as quarantined (Playwright supports test annotations and tags like @flaky that can be filtered in CI). Second, a separate CI job that runs quarantined tests and reports results to a dashboard without blocking the pipeline. Third, and most importantly, a process for reviewing quarantined tests on a regular cadence. Without this review process, quarantine becomes a graveyard where flaky tests go to be forgotten.

Set a policy: any test quarantined for more than two sprints gets a decision, fix it or remove it. Track quarantine duration as a team metric. If the quarantine list is growing, it indicates a systemic problem with test infrastructure or test design practices, not just individual test issues.

4. Spec-First Workflow: Testable Code by Design

Many flaky tests are symptoms of untestable code. A component that relies on global state, performs side effects during rendering, or couples tightly to network timing is inherently hard to test reliably. Writing tests after the fact means working around these design decisions with hacks and timing workarounds. Writing the spec first inverts this: the test defines the expected behavior, and the implementation is built to satisfy it.

In a spec-first workflow, the team writes the E2E test scenario before implementing the feature. The test describes what the user should be able to do: "User fills in the shipping form, selects express shipping, sees the updated total, and completes checkout." The developer then implements the feature knowing that these specific interactions need to work reliably in an automated browser. This naturally leads to testable architecture: proper ARIA labels (because the test uses role-based selectors), deterministic state management (because the test asserts on specific values), and clean loading states (because the test waits for specific conditions).

AI tools can accelerate spec-first workflows. Assrt can discover test scenarios from an existing application, which provides a baseline of expected behaviors. When building new features, you can use the discovered patterns as templates for new test specs. Other teams use AI to convert user story acceptance criteria into Playwright test stubs that developers flesh out as they implement the feature.

5. From CI to Production: Synthetic Monitoring

CI tests verify that your application works before deployment. Synthetic monitoring verifies that it continues working after deployment. The gap between these two is where many production incidents hide: the code is correct, the tests pass, but production behaves differently because of configuration changes, third-party service outages, or infrastructure drift.

Synthetic monitoring reuses your Playwright tests as production health checks. A subset of your E2E suite (typically the critical user paths: login, search, checkout, core feature usage) runs on a schedule against the production environment. When a test fails, it triggers an alert rather than blocking a pipeline. This catches issues that CI cannot: expired SSL certificates, CDN misconfigurations, third-party script loading failures, and database connection pool exhaustion under real load.

The key to effective synthetic monitoring is choosing the right tests. Not every CI test belongs in production monitoring. Select tests that verify complete user journeys end to end, that exercise real third-party integrations, and that complete within a reasonable timeout (under 30 seconds per test). Avoid tests that modify state in production (creating real orders, sending real emails) unless you have a dedicated test account and cleanup mechanism.

Tools like Checkly and Grafana Synthetic Monitoring run Playwright scripts on a schedule from multiple geographic locations. You can also build your own using a cron job and Playwright's CLI. The important thing is that the same test code that validates your CI pipeline also validates your production environment, closing the gap between "it passed in CI" and "it works for users."

Build a reliable test pipeline

Assrt generates CI-ready Playwright tests with proper isolation, auto-waiting, and self-healing selectors.

$npm install @assrt/sdk