Infrastructure Guide

CI/CD QA Automation: Getting Your Test Infrastructure Right

Most teams get CI/CD testing wrong by trying to do too much too soon. Here is how to build a testing pipeline that actually works, starting small and scaling deliberately.

3x

Teams with deterministic test infrastructure deploy three times more frequently than those fighting flaky CI pipelines.

DORA State of DevOps Report

1. Start with Smoke Tests

The single most effective thing you can do for your CI/CD pipeline is to establish a reliable smoke test suite that runs on every commit. Not a comprehensive regression suite. Not a full E2E test battery. A small, focused set of tests that answer one question: “Is the application fundamentally broken?”

A good smoke suite typically covers five to fifteen critical paths: the application loads, a user can log in, the main navigation works, the primary feature (whatever your app does) functions at a basic level, and critical integrations (payment processing, data submission) are operational. These tests should complete in under three minutes. If they take longer, you have too many tests in your smoke suite.

The reason to start here is psychological as much as technical. Engineers need to trust the CI pipeline before they will respect it. If the first thing they experience is a 30-minute test run that fails intermittently due to flaky tests or infrastructure issues, they will learn to ignore CI results. A fast, reliable smoke suite builds that trust. Once engineers rely on the smoke suite to catch obvious breakage, you have the political capital to expand testing coverage.

Tools like Assrt can help bootstrap your smoke suite by auto-discovering the critical user flows in your application and generating Playwright tests for them. You can also build this manually by listing the five things that would cause an immediate customer escalation if they broke, then writing one test for each.

2. Make Infrastructure Deterministic Before Scaling

Before you add more tests to your pipeline, make sure the infrastructure running those tests is completely deterministic. This means the same test, run twice against the same code, produces the same result every time. If your infrastructure is not deterministic, adding more tests only adds more noise.

The most common sources of non-determinism in CI test infrastructure are: shared test databases that accumulate state across runs, external service dependencies that respond inconsistently, time-dependent tests that fail around midnight or at month boundaries, resource contention when multiple test jobs run on the same machine, and stale Docker images or browser binaries that drift from development environments.

Address each of these systematically. Use disposable databases (spin up a fresh instance for each test run, or at minimum truncate all tables before each suite). Mock or stub external services at the network level using tools like WireMock or Playwright’s route interception. Fix time-dependent logic by injecting a clock that you control. Isolate test runners so they do not compete for CPU, memory, or network ports. Pin your Docker images and browser versions to specific digests.

Measure determinism explicitly. Run your entire test suite ten times in a row against the same commit. If any test fails intermittently, quarantine it until you identify and fix the root cause. A test suite with 99% reliability across 500 tests still produces five spurious failures per run, which is enough to erode trust in the entire pipeline.

Try Assrt for free

Open-source AI testing framework. No signup required.

Get Started

3. Test Cadence Strategy: Smoke, Regression, Full

Not every test needs to run on every commit. A well-designed test cadence matches the cost and speed of each test tier to the appropriate trigger point in your development workflow.

The first tier is smoke tests, running on every commit and every pull request. These are your fast, high-confidence checks that catch catastrophic failures. Target under three minutes for the entire tier. If a smoke test fails, the commit is blocked immediately.

The second tier is targeted regression tests, running on pull requests based on changeset analysis. If a PR modifies the authentication module, run the auth regression tests. If it touches the checkout flow, run the commerce regression tests. This tier can take 10 to 15 minutes because it only runs the relevant subset. The goal is to catch feature-specific regressions without the cost of running everything.

The third tier is the full regression suite, running nightly or on merges to main. This is where you run everything: all E2E tests, cross-browser checks, performance benchmarks, visual regression comparisons, and accessibility audits. This tier can take 30 minutes to an hour because it runs less frequently. Failures here trigger investigation the next morning, not merge blocking.

Some teams add a fourth tier for weekly or pre-release runs that include exploratory AI testing, load testing, and chaos engineering experiments. These expensive, comprehensive checks catch the issues that structured tests miss but are too costly to run daily.

4. Managing Test Data and Environments

Test data management is where most CI/CD testing strategies break down in practice. You can have perfectly written tests and a beautifully configured pipeline, but if your test data is stale, inconsistent, or shared across test runs, your results will be unreliable.

The gold standard is test isolation: each test creates exactly the data it needs, uses it, and cleans it up afterward. For API and unit tests, this is usually achievable with database transactions that roll back after each test. For E2E tests, it requires more infrastructure. Common approaches include seeding a test database from a known snapshot before each run, using factory functions that create users and records with unique identifiers, and running each test against an isolated tenant in a multi-tenant application.

Environment management follows similar principles. Each test run should get its own environment, or at minimum its own namespace within a shared environment. Containerized approaches (Docker Compose for the application stack, disposable databases, ephemeral preview deployments) make this feasible even for complex applications. Platforms like Vercel, Render, and Railway provide preview environments that automatically spin up for each pull request, giving your E2E tests a dedicated target.

For teams using Playwright, the globalSetup and globalTeardown configuration options provide natural hooks for environment preparation and cleanup. Use globalSetup to seed your database, start supporting services, and authenticate test users. Use globalTeardown to clean up resources and upload test artifacts.

5. Common Mistakes Teams Make

After working with dozens of engineering teams on their CI/CD testing setups, the same mistakes appear repeatedly. Being aware of them can save you months of frustration.

Running the full suite on every commit. This feels safe but backfires quickly. Long CI times train engineers to push code and context-switch instead of waiting for results, which defeats the purpose of continuous integration. Start with smoke tests and add targeted regression tests as your suite matures.

Tolerating flaky tests. Every flaky test is a trust tax on your entire pipeline. When engineers see spurious failures regularly, they stop taking real failures seriously. Quarantine flaky tests immediately. Fix them or delete them. Do not add retry logic as a workaround; retries mask the underlying problem and multiply your CI costs.

Ignoring test infrastructure as real engineering work. Test infrastructure deserves the same attention as production infrastructure. It needs monitoring, maintenance, capacity planning, and dedicated ownership. Teams that treat CI as an afterthought end up with brittle pipelines that nobody understands and everyone avoids.

Not measuring what matters. Track four metrics for your test pipeline: mean time to feedback (how long from commit to result), flakiness rate (what percentage of failures are not real bugs), defect escape rate (how many production bugs were not caught by tests), and infrastructure cost per test run. These metrics tell you whether your pipeline is healthy and where to invest improvement effort.

Building everything from scratch. You do not need to write your own test runner, reporter, or parallelization framework. Playwright, GitHub Actions, and tools like Assrt provide battle-tested infrastructure that handles the hard problems. Focus your engineering effort on the parts that are specific to your application: test data management, environment configuration, and the tests themselves.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.

$npm install @assrt/sdk