Scaling Playwright Tests: Docker, Parallel Execution, and Sharding
A developer on r/QualityAssurance asked: "How to scale my tests? More than 150 running daily on Docker + Playwright." Their suite took 45 minutes. Management wanted it under 15. This is the scaling playbook that gets you there.
“Generates real Playwright code, not proprietary YAML. Open-source and free vs $7.5K/mo competitors. Self-hosted, no cloud dependency. Tests are yours to keep, zero vendor lock-in.”
Assrt project philosophy
1. Understanding Where Time Goes in Large Test Suites
Before optimizing, measure. Playwright provides detailed timing information in its JSON and HTML reporters. Run your suite with the JSON reporter enabled and analyze where time actually goes. In most suites with 150+ tests, the breakdown follows a predictable pattern.
Browser startup and context creation typically accounts for 10% to 15% of total time. Each test that creates a new browser context pays a 200ms to 500ms penalty. For 150 tests, that is 30 to 75 seconds of pure overhead. This cost is largely fixed per test regardless of what the test does.
Network waiting dominates most test suites, consuming 40% to 60% of total time. Every page navigation, every API call, every asset load adds latency. Tests that navigate to a new page for each step pay this cost repeatedly. A single test might spend 3 seconds on actual interactions and 12 seconds waiting for network responses.
Test setup and teardown is the third major category, typically 15% to 25%. Tests that create users through the UI, seed data through complex API chains, or wait for asynchronous processes to complete spend significant time on preconditions. This is the category with the most optimization potential because much of it can be parallelized or eliminated.
The remaining 10% to 20% is actual test execution: clicking buttons, filling forms, reading assertions. This is the part you cannot and should not optimize, because it is the part that provides value.
2. Playwright Sharding Across Docker Containers
Playwright supports native sharding with the --shard flag. The command "npx playwright test --shard=1/4" runs the first quarter of your tests, "--shard=2/4" runs the second quarter, and so on. Each shard can run in a separate Docker container, on a separate CI runner, or on a separate machine.
For a 150-test suite that takes 45 minutes on a single container, sharding across 4 containers typically reduces wall-clock time to 12 to 15 minutes. The scaling is not perfectly linear because shards are not perfectly balanced (some tests take longer than others), but 3x to 3.5x speedup from 4 shards is typical.
The Docker setup is straightforward. Use the official mcr.microsoft.com/playwright image, which includes all browser binaries and system dependencies. Your CI pipeline spawns N containers, each running one shard. After all shards complete, merge the results using Playwright's merge-reports command: "npx playwright merge-reports ./all-blob-reports --reporter html". This gives you a single HTML report covering the entire suite.
In GitHub Actions, the matrix strategy makes this simple. Define a matrix with shardIndex [1, 2, 3, 4] and shardTotal 4. Each matrix job runs its shard in parallel. Upload blob reports as artifacts from each shard, then download and merge them in a final reporting job. The Playwright documentation includes a complete GitHub Actions configuration for this pattern.
Scaling starts with good test code
Assrt generates Playwright tests designed for parallel execution from the start. Each test is isolated, self-contained, and ready for sharding.
Get Started →3. Smoke Suite vs. Full Suite Strategy
Not every test needs to run on every commit. The most effective strategy is a tiered approach: a fast smoke suite that runs on every PR, and a comprehensive full suite that runs on merges to main or on a nightly schedule.
The smoke suite contains your 15 to 25 most critical tests, covering signup, login, the primary user action, and payment. It should run in under 3 minutes. Tag these tests with @smoke in Playwright using test.describe or grep annotations, then run them with "npx playwright test --grep @smoke". This gives developers fast feedback without waiting for the full suite.
The full suite runs all 150+ tests and takes 12 to 15 minutes with sharding. Run this on merges to the main branch and nightly. Nightly runs catch flaky tests and environment-specific issues that PR-level smoke tests miss. If the nightly suite fails, create a ticket immediately. Do not let nightly failures accumulate; that path leads to a permanently red suite that everyone ignores.
Some teams add a third tier: a focused regression suite that runs tests related to the changed code. Playwright's project dependencies and file-based test organization make this possible. If a PR modifies files in /src/checkout/, run all tests in /tests/checkout/ plus the smoke suite. This provides targeted coverage without the full suite cost. Tools like Launchable and Buildkite Test Analytics can automate this test selection based on historical failure data.
4. Execution Time Optimization Techniques
Beyond sharding, several techniques reduce individual test execution time. The highest-impact optimization is API-based test setup. Instead of navigating through the UI to create test data, call your API directly using Playwright's request context. Creating a user via API takes 50ms; creating one through the signup form takes 5 to 10 seconds.
Authentication state reuse is another significant win. Playwright's storageState feature lets you authenticate once and reuse the session across multiple tests. Run a global setup script that logs in, saves the storage state to a JSON file, and configures your tests to load that state. This eliminates the login step from every test, saving 2 to 5 seconds per test. For 150 tests, that is 5 to 12 minutes saved.
Parallel workers within a single machine provide additional speedup. Playwright defaults to running with half your CPU cores as workers. On a 4-core CI machine, that is 2 workers. Increasing to 4 workers (one per core) or even 6 workers (oversubscription works when tests are I/O-bound) can halve your execution time. Monitor CPU and memory to find the sweet spot; too many workers cause thrashing.
Network optimization matters more than most teams realize. Run your test environment geographically close to your CI runners. If your CI runs on AWS us-east-1 and your staging server is on AWS eu-west-1, every network request adds 80ms of latency. For a test that makes 50 network requests, that is 4 seconds of unnecessary waiting. Co-locating CI runners and staging environments in the same region can cut total suite time by 15% to 25%.
5. Infrastructure Patterns for 500+ Test Suites
When your suite grows past 300 to 500 tests, sharding across CI containers is not enough. You need infrastructure designed for large-scale test execution. Several patterns emerge at this scale.
Dedicated test runners solve the cold-start problem. Instead of spinning up Docker containers for each CI run (which adds 30 to 60 seconds of pull and startup time per container), keep warm runner pools that are always ready. GitHub Actions larger runners, self-hosted runners with pre-cached Docker images, and Kubernetes-based runner pools (like actions-runner-controller) all reduce this overhead.
Test result databases become essential at scale. Store every test run's results in a database (InfluxDB, PostgreSQL, or a purpose-built tool like Currents or Playwright Test Analytics). This lets you identify slow tests, flaky tests, and failure patterns over time. A test that passes 98% of the time but fails every Monday morning points to a time-dependent data issue. You cannot spot these patterns without historical data.
Dynamic sharding based on test duration produces better load balancing than Playwright's default file-based sharding. Tools like Currents.dev and Sorry Cypress distribute tests across shards based on historical execution time, ensuring each shard finishes at approximately the same time. This can improve sharding efficiency from 75% to 90%+, shaving minutes off your total execution time.
At this scale, generating tests automatically becomes practical and necessary. An open-source tool like Assrt that crawls your application and generates Playwright test files can keep pace with a rapidly evolving product. Instead of manually writing tests for every new feature, generate a baseline suite and curate it. The combination of generated tests, intelligent sharding, and warm runner infrastructure keeps a 500-test suite running in under 10 minutes, which is fast enough to gate every deployment.
Generate tests that are built for scale
Assrt produces isolated, parallelizable Playwright tests from your running application. No shared state, no serial dependencies, ready for sharding from day one.