Playwright load testing, honestly
Every guide on this topic funnels you toward a cloud runner. Three of the four things people actually mean by this phrase are better served by a protocol-level tool you already know (k6, Artillery HTTP, Lighthouse). The one thing Playwright really is the right shape for can be run on a laptop with seven lines of bash and a single environment variable. This is the map nobody draws and the pattern nobody publishes.
The phrase hides four different problems
If you stare long enough at the pages that currently rank for this topic, a pattern shows up. They start with the word Playwright, pivot into a paragraph about virtual users, and land on a pricing tier. The trick is that the original phrase covers four different problems, and the answer depends on which one you actually have.
Three of these problems are not Playwright problems at all. Running a real browser for case 1 or case 2 is like opening a pizza restaurant when what you needed was a toaster. The fourth is the one case where a browser is load-bearing, and even there you only need N of them, not a cloud of them.
1. Protocol volume
Can my API return 50,000 req/min at p95 under 500 ms? This is pure HTTP-level load. A real browser is wasted CPU here because no one is looking at the page. Right tool: k6, Artillery HTTP engine, wrk, vegeta. Playwright is the wrong shape.
2. Frontend bundle load
Does my JS parse fast enough on a mid-tier Android? This is a single-user CPU story. Right tool: Lighthouse CI in throttled mode, WebPageTest. Playwright without orchestration is also wrong, because you only need one browser, not N.
3. UX under backend load
While k6 hammers the API at 50k req/min, does the checkout button still render within 3 seconds to a real user? This is where Playwright earns its weight. Right tool: k6 for volume, Playwright for observation. Run them side by side.
4. Concurrency race reproduction
Does racing 50 users through 'Buy' at the same 100 ms window expose a double-charge? You need N real browsers firing simultaneously, with the full JS event loop, real cookies, real CSRF tokens. This is the second case where Playwright is the right tool.
The core category mistake
A virtual user in k6 is about 5 MB of resident memory and a few percent of one CPU core. A virtual user in a real Chromium is about 200 to 400 MB of RAM plus a whole core during navigation. The ratio is roughly 60 to 1. That means any vendor selling you Playwright-based load testing at protocol-level volumes is selling you 60x the compute you actually needed, and charging you for the privilege. If your question is whether the backend holds up at 50,000 req/min, the correct tool already existed and it is not Playwright. If your question is whether real users can still click the checkout button while the backend is being hammered, that is a different question, and it is the one worth a real browser.
The pattern nobody publishes: xargs -P plus ASSRT_ISOLATED=1
Here is the part the cloud pages leave out. You do not need a vendor to run 20 or 40 concurrent real-browser scenarios. The OS is already a perfectly good fleet manager. The only problem to solve is that Chromium refuses to share a user-data-dir between processes (it uses a SingletonLock symlink to enforce it), and the default assrt mode puts every browser into the same persistent profile at ~/.assrt/browser-profile. Without isolation, every worker after the first would fail with a lock-file error.
The fix is one environment variable. Setting ASSRT_ISOLATED=1 makes every worker spawn its Playwright MCP with the --isolated flag, which gives each Chromium an in-memory user-data-dir that lives for the lifetime of that single process. No shared state, no singleton contention, no cleanup. The lookup happens at cli.ts:114 and the flag propagates into browser.ts:307-309. From there, xargs handles everything else.
What each worker actually does, step by step
The entire fan-out depends on a handful of lines across two files in assrt-mcp. Walking the chain top to bottom clarifies why the pattern is safe and where the resource ceiling comes from.
Set ASSRT_ISOLATED=1 in the environment
cli.ts:114 reads ASSRT_ISOLATED === '1' || 'true' and flips args.isolated to true for the current invocation. The flag propagates all the way down through the TestAgent constructor into McpBrowserManager.launchLocal.
launchLocal skips the SingletonLock dance
browser.ts:307 branches on isolated and appends '--isolated' to the @playwright/mcp argv. The cleanup block at 319-342 (killOrphanChromeProcesses, unlink SingletonLock, SingletonSocket, SingletonCookie) is entirely skipped, because there is no shared profile directory to contend for.
@playwright/mcp spawns a Chromium with an in-memory user-data-dir
Playwright MCP's --isolated flag means the profile lives in tmpfs for the lifetime of the process and vanishes when it exits. No lock files, no persistent cookies, no cross-worker state. Every worker is a clean browser.
Wrap the invocation in xargs -P N
xargs -P N runs N copies of the command in parallel, feeding each one a unique input slot. The OS is already a perfectly good fleet manager for N independent processes. Each worker writes its own WebM and JSON, so aggregation is a jq one-liner over /tmp/assrt/worker-*.json.
Video + JSON per worker, no proprietary YAML
Each worker produces the same artifacts a single-scenario run produces: a WebM video (1600x900, with visible cursor overlay), a JSON report with per-step timing and assertions, and a screenshot directory. Nothing is vendor-shaped. You own the files.
What the run looks like end to end
The shape of a real invocation. A 20-worker run against a local dev server on a 16 GB laptop, with case 1 being a successful checkout under pressure and case 2 intentionally trying to expose a concurrent double-charge.
“The OS is already a perfectly good fleet manager for N independent browser processes. kill %1 kills worker 1. ps aux | grep chrome shows you exactly what is running.”
Why a cloud runner is overkill below ~150 concurrent workers
The scenario file every worker reads
One markdown file. Under 30 lines. Every worker reads it and the agent loop picks elements from a fresh accessibility-tree snapshot on every action, so there are no selectors to maintain and no retries to hand-code. Session isolation is automatic because each worker uses create_temp_email to mint its own disposable signup address.
Numbers that bound the design
Four numbers worth knowing before you commit to a pattern. Each is a ceiling that, once you pass it, demands a different answer.
Resident RAM per headless Chromium worker, mid-range
Practical concurrent workers on a 16 GB developer laptop
Practical concurrent workers on a 64 GB dedicated box
RAM ratio: real Chromium versus a k6 virtual user
The data flow, end to end
Inputs on the left, the xargs-driven fleet in the middle, artifacts on the right. No proprietary runner between the scenario file and the WebM. The only external endpoint any worker hits is whichever LLM you configured; DOM snapshots and screenshots stay local.
Inputs into the fan-out, artifacts out the other side
What the cloud runners charge you instead
A rough sample of current pricing around this category. None of these are hostile companies, most of them built real infrastructure that would be hard to replicate from scratch. The argument is not that they are bad; the argument is that for the sub-150-worker case they are optional.
A cloud runner versus the local xargs -P pattern
| Feature | Typical cloud runner tier | Assrt local fan-out |
|---|---|---|
| Cost for 40 concurrent real-browser scenarios, daily 5-minute run | $500 to $2,500 per month on a cloud runner, depending on vendor. Minimum seat fees plus per-minute vCPU charges. Some vendors charge for video retention separately. | $0 in incremental cost on a laptop you already own. If you spin up a dedicated 16 vCPU box just for this, something like $80/month on Hetzner bare metal, or $200 to $400 on a cloud VM, with no per-run overage. |
| What you hand over to the vendor | Every scenario, every DOM snapshot, every screenshot goes to their backend by design. Vendor-specific YAML or TypeScript configs that only run on their platform. Lock-in on the artifact format. | Nothing leaves the machine except LLM calls to your own configured endpoint (Anthropic or any compatible base URL). The scenarios are plain markdown under your repo. The reports are plain JSON. MIT license. |
| How the fleet is managed | Proprietary scheduler. You describe a VU ramp in a YAML or TypeScript file; the vendor decides when to spawn workers, on which region, with which rate limit. You cannot poke into the running state. | xargs -P N. The OS process table is the fleet view. kill %1 kills worker 1. ps aux | grep chrome tells you exactly what is running. tail -f /tmp/assrt/worker-7.log tells you what worker 7 is thinking, in real time. |
| What a failure artifact looks like | A dashboard link that expires in 30 or 90 days. A rendered timeline in their UI. Maybe a downloadable HAR or a partial log. Often no video at all, or video behind a paywall. | A local WebM (1600x900, visible cursor, click ripple, keystroke toast), a local JSON with per-step timing and agent reasoning, a local screenshot per step. All on disk, all yours, all gitignorable into an artifacts directory. |
| Where the ceiling actually is | Roughly linear in spend. 500 concurrent real browsers is technically available at price points between $2K and $12K per month, depending on the vendor and region. | 20 to 40 workers on a 16 GB laptop, 100 to 150 on a 64 GB dedicated box. Past that, per-worker resource contention dominates and timings become noise. If you genuinely need 500 concurrent real browsers, a cloud runner is the right choice. Most teams do not. |
If you genuinely need more than 150 concurrent real browsers or cross-region latency, a cloud runner is the right choice. Most teams do not.
How to run your first one this afternoon
Three steps. First, write one plan file at scenarios/checkout.md with a couple of #Case blocks covering the flow you want to put under pressure. Second, save the fanout script above and chmod it. Third, pair it with k6 or Artillery HTTP against the same target so you actually have backend load happening while the browsers watch. The combined readout is the thing worth keeping: protocol-level timings from k6 on the one side, real-user-observed timings from N parallel real Chromiums on the other.
20 WebMs under /tmp/assrt/, 20 JSON reports, one jq-aggregate line. If a worker fails, its WebM is the debuggable artifact. If the aggregate shows a concurrency bug, case 2 of the scenario file is usually the one that caught it.
Want a second opinion on whether Playwright is actually the right tool for your case?
Thirty minutes with the founder. Bring the shape of the load you want to put on the app. I will either sketch the fan-out pattern for your flow, or tell you honestly when a protocol tool (k6, Artillery HTTP) or a cloud runner is the better spend.
Frequently asked questions
Is Playwright actually a load testing tool?
No, and most of the guides that say otherwise are cloud runner sales pages. Playwright is a protocol driver for real Chromium, Firefox and WebKit. It launches a full browser process per session and does JavaScript execution, CSS layout, compositing, the works. That is exactly the wrong shape for load testing in the classical sense, because the resource cost per virtual user is one to two orders of magnitude higher than a protocol-level tool like k6 or Artillery HTTP. What Playwright is genuinely good at is reproducing user experience under concurrent load, which is a real and useful problem, just not the problem the phrase load testing usually names.
So what does 'Playwright load testing' actually mean in practice?
The phrase conflates four distinct problems. (1) Protocol volume: can my API serve 50,000 requests per minute? Wrong tool, use k6 or Artillery HTTP. (2) Frontend load: does my React bundle parse fast on a mid-tier phone? Wrong tool, use Lighthouse CI or WebPageTest. (3) UX under backend load: while the backend is being hammered by a protocol tool, do real users still see the checkout button within 3 seconds? Playwright is the right tool here, but only for the browser half, running alongside k6 or similar. (4) Concurrency bug reproduction: does racing 50 users to click 'Buy' at the same moment expose a double-charge? Playwright is also right here, and this is the case where you actually want N real browsers firing at once.
What is the minimal local pattern that does case 3 or case 4?
Seven lines of bash. Set ASSRT_ISOLATED=1 in the environment so every Playwright MCP process spawns with the --isolated flag. That gives each Chromium an in-memory user-data-dir instead of sharing ~/.assrt/browser-profile. Then pipe N scenario IDs into xargs -P N running npx @assrt-ai/assrt run --url ... --plan-file scenario.md --video --json. Each parallel worker writes its own JSON report and its own WebM recording. 20 to 40 workers is the practical ceiling on a 16 GB developer laptop. Past that, rent a single dedicated box rather than a fleet of cloud runners.
Why specifically ASSRT_ISOLATED=1 and not just --isolated on each call?
Both work. The env var is more convenient for xargs because xargs repeats its command template N times and inherits the parent environment, so you set the flag once and every child uses it. cli.ts:114 resolves ASSRT_ISOLATED as 1, true, or unset, and the resulting boolean is passed all the way down to launchLocal in browser.ts:258-351, which appends --isolated to the Playwright MCP args at line 308. In --isolated mode, the whole singleton-lock cleanup path at lines 319-342 is skipped (kill orphan Chromes, delete SingletonLock, delete SingletonSocket, delete SingletonCookie), because there is no shared disk profile to contend for. That cleanup path is the reason default-mode assrt processes cannot run concurrently.
How many concurrent real browsers can one machine actually run?
Order of magnitude: each headless Chromium on a mostly-static page costs about 200 to 400 MB of resident RAM and roughly one CPU core for short bursts during navigation and rendering. On a 16 GB developer laptop with 8 performance cores, the practical ceiling is 20 to 40 workers before you start swapping or CPU-throttling. On a dedicated box with 64 GB of RAM, you can reach 100 to 150. Past that, per-worker resource contention dominates and your timing measurements become unreliable noise. This is also why cloud runners charge so much: Artillery's rule of thumb is one vCPU per virtual Playwright user, and that vCPU is the bulk of the bill.
What do I use Playwright load testing for that a protocol tool cannot do?
Two things that only a real browser can observe. First, frontend timing under backend stress. k6 can tell you that a POST /api/checkout returns in 1.8 seconds at p95 under load; only Playwright can tell you that the checkout button actually becomes clickable 4.1 seconds after the page starts to load, because the render-blocking analytics bundle does not care that the API is slow. Second, race conditions that only fire through the real event loop, like double-submits from two tabs, stale CSRF tokens from a pre-fetched form, or optimistic UI that rolls back. Both of these require the full DOM, the full JavaScript VM, and real layout. Neither is captured by hitting the HTTP endpoint directly.
Does Assrt do any of the load-test orchestration itself, or is it just a single-scenario runner?
Single-scenario by design. Each assrt invocation spawns one @playwright/mcp process that drives one browser through one plan file of #Case blocks, and exits. There is no fleet mode, no VU ramp-up, no Artillery-style scripting. The whole point of the isolated mode + xargs pattern is that you do not need any of that. The OS is already a perfectly good fleet manager for N independent browser processes. What you get per worker is a real Playwright run with video, a JSON report, and per-step screenshots on disk. What you do not get is consolidated reporting across workers; aggregate the N JSON files yourself with jq or a tiny script if you care.
What changes if I am testing a production site versus local dev?
Two things. First, rate limiting: most production APIs will start returning 429 around the 30 to 50 concurrent-user mark, and your load test becomes a test of your rate limiter instead of your checkout flow. Either coordinate with whoever owns the rate limits or point at a staging environment that has them disabled. Second, session isolation: if your auth uses cookies and every worker logs in as the same user, you are testing one user clicking 50 times, not 50 users. Either fan out across N test accounts (pass emails through the scenario file as {{EMAIL}} variables and interpolate at runtime), or use create_temp_email in the scenario itself to mint a new disposable email per worker via mail.tm.
Where does the assrt scenario format fit in this?
Every worker runs the same markdown file. A plan is a sequence of #Case blocks with one-line headers and bullet steps underneath. There are no selector strings, no waits, no retries; the agent loop reads a fresh accessibility-tree snapshot before every action and picks elements by ref. That matters for load tests because a selector-based .spec.ts will flake under concurrency in ways that make the results unreadable (was that a real bug or just a timing race with the selector?), while a snapshot-based agent loop fails on actual state and passes on actual state. You get the classical Playwright primitives underneath via @playwright/mcp, and the scenario file stays under 30 lines.
When should I stop trying to do this locally and actually buy a cloud runner?
Three honest triggers. (1) You need more than 150 concurrent real browsers for a single test window, and cannot split the scenarios across multiple machines easily. (2) You need geographic distribution (users from eu-west hitting a us-east origin with real TCP round-trip latency), which a single laptop obviously cannot simulate. (3) You need to run this on every pull request as blocking CI, and the wall time on a single box is not acceptable. Below those thresholds, the local pattern is strictly better than a cloud runner, because you own the environment, the tests stay debuggable, and there is no vendor-specific YAML to maintain.
Is assrt itself open source, and does any data leave my machine?
The CLI and MCP server are MIT licensed on npm as @assrt-ai/assrt. When you run locally, the target URL, DOM snapshots, and screenshots go to whichever LLM endpoint you configured (Anthropic by default, or any Claude-compatible base URL). Video and JSON reports stay on disk under /tmp/assrt/. Nothing is uploaded anywhere unless you explicitly opt in to cloud sync. Compare against the $500 to $7500 per month cloud runners, which route every DOM through their backend by design and keep your scenarios on their servers.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.