Headless chrome test parallelism flakiness is a symlink problem
Every post on this topic blames CPU, GPU, or the new headless mode. Those cause slow tests. They do not cause the specific failure most teams actually hit: a run that works fine locally, works fine in serial, and then throws "Opening in existing browser session" once every twenty CI jobs. That is a filesystem bug, not a performance bug. Three symlinks inside the Chromium profile decide whether your parallel run recovers from a crashed worker. This guide shows you those three files, why naive cleanup code misses two of them, and how Assrt's local MCP runner fixes it at launch time.
What the top search results miss
Search this keyword and you get three flavors of article. The first is "headless is slower than headed because of offscreen composition." True, mostly irrelevant to your flakes. The second is "scale down your worker count, your CPU cannot handle it." Sometimes true, does not explain why a 4-worker run flakes only after a prior 4-worker run crashed. The third is generic retry advice. Also fine. None of them name the actual filesystem failure mode, and none of them show code that evicts the three Chromium singleton files that cause it. The gap is the whole reason this page exists.
Chromium thinks a prior instance still owns the profile. It is wrong. The prior instance was SIGKILLed and left a dangling symlink behind.
SingletonLock, SingletonSocket, SingletonCookie. Naive cleanup scripts only handle the first and use existsSync, which lies on dangling symlinks.
The cleanup, verbatim, from Assrt's MCP runner
No abstraction. This is the code that runs every time Assrt boots Chromium for a scenario. If your parallel CI is flaking on the same failure class, you can port this block into your own launcher in ten minutes. Assrt is open source and self-hosted; the file is assrt-mcp/src/core/browser.ts.
“SingletonLock, SingletonSocket, SingletonCookie. All three have to go before every launch, or your parallel run inherits the crash state of the run before.”
browser.ts:326-342
Reproduce the failure in 30 seconds
Do not take my word for it. The failure is trivial to reproduce on any laptop with Chromium installed. Launch, SIGKILL, launch again.
The second launch stalls with "Opening in existing browser session" even though PID 52341 is gone. Removing the three singleton files fixes it. That is the entire bug.
Naive cleanup vs the Assrt cleanup
Most Stack Overflow answers on this problem get two things wrong. They use existsSync instead of lstatSync, which silently skips dangling symlinks. And they only touch SingletonLock, leaving Socket and Cookie to cause the exact same failure on the next launch.
The two-character difference
// What most CI scripts do.
// existsSync follows the symlink. SingletonLock's target is a dead PID.
// Result: existsSync returns false, the leftover link stays, next launch hangs.
import { existsSync, unlinkSync } from "fs";
const lock = `${userDataDir}/SingletonLock`;
if (existsSync(lock)) {
unlinkSync(lock);
}
// Also: only cleans one of three files. Socket and Cookie still leak.What happens when worker 2 starts after worker 1 crashed
Here is the exact sequence in two scenarios. First without the cleanup. Then with Assrt's cleanup in place.
Parallel worker restart after a SIGKILL
Now with the Assrt launcher in place:
Same scenario, with the three-file eviction
Six decisions the launcher makes in under a second
Each card maps to a specific failure mode and a specific line of real source. Nothing here is aspirational; it is all shipping in assrt-mcp today.
SingletonLock
Dangling symlink whose target is hostname-PID. Blocks every subsequent launch if the PID is dead. existsSync lies about it; lstatSync sees it correctly.
SingletonSocket
Abstract Unix socket the Chrome launcher uses to say 'focus the existing instance'. If the real instance is gone but the socket file survives, you get the classic 'Opening in existing browser session' exit with no stack trace.
SingletonCookie
Random per-profile cookie the launcher hands the existing instance to prove profile identity. Orphaned copies cause silent reuse bugs.
120s MCP tool timeout
Per-call budget raised from the SDK's 60s. Slow parallel navigations complete instead of timing out, so logs distinguish CPU contention from real bugs.
Orphan process kill
Before unlinking locks, Assrt finds and SIGKILLs Chrome processes still bound to the user-data-dir. Prevents the classic race where two Chromes fight over one profile.
--isolated fallback
If any unlink fails, the launcher pushes --isolated to Chromium and continues with an in-memory profile. The run degrades, it never fails.
Inputs, launcher, outputs
The launcher is the junction. It takes whatever garbage the previous run left on disk, whatever orphaned Chrome PIDs are still hanging around, and whatever CLI args you passed, and produces one clean Chromium process the rest of your test run can trust.
How parallel-safe headless launches work in Assrt
What this looks like when you run it
The cleanup lines are visible in stderr. That is deliberate. If you see removed stale SingletonLock on every boot you have a bigger problem (something is crashing Chrome every run). If you see it occasionally, the launcher just saved a parallel job from flaking. Either way, the signal is in your logs.
Assrt vs a typical headless CI setup
Most CI matrices treat the launcher as a black box. The problem is that the black box is where your flakes live. Here is what actually differs line by line.
| Feature | Typical CI + Playwright setup | Assrt MCP runner |
|---|---|---|
| Cleans SingletonLock symlink between parallel launches | Naive scripts use existsSync and skip dangling links | Unlinked by name on every launch (browser.ts:326-342) |
| Cleans SingletonSocket and SingletonCookie | Typical fix touches only SingletonLock | All three files are eviction targets |
| Kills orphan Chrome processes holding the profile first | Races with live PID, two Chromes fight the profile | killOrphanChromeProcesses runs before unlink |
| Degrades to --isolated if unlink fails | Permission error kills the worker | Pushes --isolated; run proceeds on an in-memory profile |
| Per-call MCP timeout | 60s SDK default; contended nav looks like a flake | 120s (TOOL_TIMEOUT_MS, browser.ts:381) |
| Test artifact format | Proprietary YAML; evaporates on vendor switch | Real Playwright MCP calls, Markdown #Case blocks |
| Price to start | Closed vendors up to $7,500/month | $0, open source, self-hosted (pay LLM tokens only) |
Port this to your own launcher, eight items
If you are not going to run Assrt, copy the list. Every item is drawn from the failure modes the three-file eviction prevents, plus the timeout and process-cleanup decisions that hold the rest of the launcher together.
Eight rules for parallel-safe headless Chromium
- Evict all three singleton files, not just SingletonLock. Any one of Lock, Socket, or Cookie can block a launch.
- Use lstatSync, not existsSync. SingletonLock is a dangling symlink; existsSync lies about it.
- Kill orphan Chromes before unlinking. If a Chrome still owns the profile, removing its lock lets two Chromes corrupt state together.
- Use per-worker user-data-dir, or pass --isolated. Sharing one profile across parallel workers guarantees this failure.
- Raise per-action timeouts past the SDK's 60s default. Contended CI runners legitimately need 90s+ on real navigations.
- Log the cleanup in stderr. 'removed stale SingletonLock' is a tripwire for deeper instability you would otherwise miss.
- Degrade, do not fail. If unlink throws, switch to an in-memory profile so the run proceeds.
- Keep the tests portable: real Playwright code, not a vendor's YAML dialect.
Bring your parallel-headless flakes
Thirty minutes. You share a CI log with 'Opening in existing browser session' in it. We point at the three singleton files in your Chromium profile, show the lstatSync fix, and hand you the cleanup block from browser.ts to drop into your own launcher.
Book a call →FAQ on headless chrome test parallelism flakiness
Why do headless Chrome tests get flakier under parallelism specifically?
Two reasons. The boring one is resource contention: 20 headless Chromes contending for CPU and GPU time will blow through assertion timeouts long before they blow through memory, and that looks like 'random' flakes in logs. The interesting one, the one almost nobody writes about, is the Chromium singleton lock. A user-data-dir can only be opened by one Chrome process at a time; the lock is a symlink named `SingletonLock` whose target is `hostname-PID`. When a worker process dies without cleaning up (SIGKILL, OOM, CI runner timeout), the symlink survives. The next parallel worker that tries to open that profile gets 'Opening in existing browser session' and hangs. In CI this looks like intermittent startup flakiness because the failure depends on whether the previous run finished cleanly. Assrt's MCP layer evicts the lock before every launch, which is why its parallel runs do not inherit the crash state of the run before.
What are SingletonLock, SingletonSocket, and SingletonCookie, and why three files?
They are three different coordination points inside one user-data-dir. `SingletonLock` is a symlink whose presence means 'some Chrome owns this profile'; it is the one most people know about. `SingletonSocket` is the abstract Unix socket Chrome's launcher uses to send a 'focus your window / open a new tab' request to a running instance. `SingletonCookie` is a short random cookie the launcher checks to confirm it is talking to the same profile. If any of the three survive a crash, a subsequent launch can either hang, silently reuse an invisible process, or print 'Opening in existing browser session' and exit without a useful stack trace. The code in `assrt-mcp/src/core/browser.ts` at lines 326-342 iterates all three by name, uses `lstatSync` rather than `existsSync` (because the broken symlink target makes `existsSync` lie), and unlinks each. If an unlink fails, it falls back to Chrome's `--isolated` in-memory profile so the run still proceeds.
Why lstatSync instead of existsSync?
`existsSync` follows symlinks. `SingletonLock` is a dangling symlink whose target is the PID of the dead Chrome process. When the target PID no longer resolves, `existsSync` returns false and you silently skip the cleanup, so the next launch still sees the leftover link and fails. `lstatSync` stats the symlink itself, not what it points to, so it correctly reports that the link exists and needs to be removed. This is a load-bearing two-character difference. If you copy-paste naive code from Stack Overflow answers on this problem, it almost always uses `existsSync` and it almost always fails on exactly the Chromes that crashed hardest.
How is this different from Playwright's built-in test isolation?
Playwright's test runner gives each test a fresh `BrowserContext` inside a shared browser process; it does not give each test a fresh user-data-dir unless you launch persistent mode with different paths per worker. If you are running in persistent mode and a worker crashes, the next run on that worker inherits the exact singleton-lock problem described above, and Playwright will not clean it for you. The fix is either: stop using persistent mode (isolated contexts are cleaner for most parallel suites), or actively evict the three singleton files between launches. Assrt's MCP runner does both, depending on how it is invoked. When `--isolated` is passed it runs with an in-memory profile (no disk state, no lock to leak); otherwise it uses `~/.assrt/browser-profile` and cleans the singletons before every spawn.
Assrt is a local MCP, not a parallel test runner. How is any of this relevant?
Because the MCP server is the thing that boots Chromium for every scenario an agent runs against a target URL. If you drive Assrt from a harness that spawns multiple scenarios at once (e.g. one MCP client per CI job, fanned out across a matrix), you are implicitly doing parallel headless Chrome, and you hit the same singleton-lock failure mode as a traditional Playwright test matrix. The difference is that the code that launches Chrome is one TypeScript file you can read in ten minutes — `assrt-mcp/src/core/browser.ts` — so you can see exactly why your parallel runs stabilize. Closed cloud vendors hide this layer. You pay for it when CI flakes and there is nothing to grep.
What is the 120-second timeout about, and does it mask real flakiness?
It is the per-call MCP tool timeout defined at `browser.ts:381` as `TOOL_TIMEOUT_MS = 120_000`. The MCP SDK defaults to 60 seconds, which is not enough for slow navigations on real sites, especially under parallel load. Raising it to 120 seconds does not mask flakiness, it exposes it. A navigation that takes 90 seconds is information: either the site is actually that slow under load, or you have a retry loop disguised as a wait. Either way, the log line `[mcp] browser_navigate url=... (92000ms)` is worth more than a raw 'timed out after 60s' error, because it tells you the page eventually responded and by how much. If you then see the same 90-second navigation drop to 800ms when you run the scenario in isolation, you have diagnosed a CPU-contention flake rather than a real bug.
Does running `--headless=new` actually help?
For rendering fidelity, yes. For parallelism and the singleton-lock problem specifically, it changes nothing. The lock is held by the Chromium process on its user-data-dir, and the user-data-dir is shared between old and new headless modes. If you switched to `--headless=new` and your parallel runs got more stable, the likely reason is that `--headless=new` exposes bugs in your own code — requestAnimationFrame callbacks, Intersection Observer, reduced-motion detection — that the legacy `--headless` mode silently worked around. Fixing those makes you faster, but it does not address the specific CI failure where 'Chrome is already running' kills 1 in 20 jobs after a prior crash. That one is a filesystem problem.
What does the fallback to --isolated actually do?
If the code at `browser.ts:336-338` catches an error unlinking a singleton file (permissions, read-only filesystem, weirdness on Windows subsystems, etc), it pushes `--isolated` onto the Playwright MCP args and sets `isolated = true`. That tells Chromium to use an in-memory user-data-dir that lives only for the process lifetime. Nothing persists, nothing to lock. The trade-off: logged-in state does not carry across scenarios, so you have to re-authenticate in every run. That is a real cost for integration testing, which is why the default is persistent mode with singleton cleanup. The fallback exists so that an unluckily permissioned filesystem never brings the whole run down; it degrades to ephemeral instead of failing.
How would I verify this problem exists in my own setup?
Before your next CI run, `ls -la ~/.cache/google-chrome` (or whatever user-data-dir Playwright is using on that runner) and look for `SingletonLock`, `SingletonSocket`, `SingletonCookie`. If the run before yours crashed, at least one of them will be there. Then launch Chrome against that profile and watch it either hang or print 'Opening in existing browser session'. To reproduce deliberately: launch chromium with `--user-data-dir=/tmp/demo`, send it SIGKILL mid-session, then launch again with the same `--user-data-dir`. The second launch will stall. Remove `/tmp/demo/SingletonLock` with `unlink` and try again; it succeeds. This is the loop Assrt automates.
I run my tests on GitHub Actions on ephemeral runners. Does any of this matter?
Less, but still yes. Ephemeral runners start clean, so the 'leftover from the previous run' path does not apply. What does still apply: within a single job, if you run scenarios in parallel (several `npx playwright test` shards in one container, or several worker processes against a shared persistent profile), the same singleton-lock symptoms emerge between workers rather than between runs. If your CI uses a shared cache to persist browser state between runs for speed — some teams do this to avoid re-logging-in — you inherit the cross-run failure mode as well. The cleanest fix on GHA is to keep the profile in-memory per job via `--isolated` or equivalent; the second-cleanest is to evict the three singletons on job start.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.