Root causePlaywrightChromiumCI flakiness

Headless chrome test parallelism flakiness is a symlink problem

Every post on this topic blames CPU, GPU, or the new headless mode. Those cause slow tests. They do not cause the specific failure most teams actually hit: a run that works fine locally, works fine in serial, and then throws "Opening in existing browser session" once every twenty CI jobs. That is a filesystem bug, not a performance bug. Three symlinks inside the Chromium profile decide whether your parallel run recovers from a crashed worker. This guide shows you those three files, why naive cleanup code misses two of them, and how Assrt's local MCP runner fixes it at launch time.

A
Assrt Engineering
10 min read
4.9from Assrt MCP users
Cleans SingletonLock, Socket, Cookie per launch (browser.ts:326-342)
Kills orphan Chromes before unlink (browser.ts:316-319)
120s MCP tool timeout vs the SDK's 60s default (browser.ts:381)
Degrades to --isolated if unlink fails, never aborts the run
Opening in existing browser sessionFailed to launch ChromeTarget page, context or browser has been closedbrowserType.launchPersistentContext: Timeout 30000ms exceededChrome is already runningEADDRINUSE :9222SingletonLock already existsError: Chromium browser binary not found in cache

What the top search results miss

Search this keyword and you get three flavors of article. The first is "headless is slower than headed because of offscreen composition." True, mostly irrelevant to your flakes. The second is "scale down your worker count, your CPU cannot handle it." Sometimes true, does not explain why a 4-worker run flakes only after a prior 4-worker run crashed. The third is generic retry advice. Also fine. None of them name the actual filesystem failure mode, and none of them show code that evicts the three Chromium singleton files that cause it. The gap is the whole reason this page exists.

Symptom
"Opening in existing browser session"

Chromium thinks a prior instance still owns the profile. It is wrong. The prior instance was SIGKILLed and left a dangling symlink behind.

Root cause
Three singleton files survived the crash

SingletonLock, SingletonSocket, SingletonCookie. Naive cleanup scripts only handle the first and use existsSync, which lies on dangling symlinks.

The cleanup, verbatim, from Assrt's MCP runner

No abstraction. This is the code that runs every time Assrt boots Chromium for a scenario. If your parallel CI is flaking on the same failure class, you can port this block into your own launcher in ten minutes. Assrt is open source and self-hosted; the file is assrt-mcp/src/core/browser.ts.

assrt-mcp/src/core/browser.ts
3 files

SingletonLock, SingletonSocket, SingletonCookie. All three have to go before every launch, or your parallel run inherits the crash state of the run before.

browser.ts:326-342

Reproduce the failure in 30 seconds

Do not take my word for it. The failure is trivial to reproduce on any laptop with Chromium installed. Launch, SIGKILL, launch again.

repro — parallel flake from a dead worker

The second launch stalls with "Opening in existing browser session" even though PID 52341 is gone. Removing the three singleton files fixes it. That is the entire bug.

Naive cleanup vs the Assrt cleanup

Most Stack Overflow answers on this problem get two things wrong. They use existsSync instead of lstatSync, which silently skips dangling symlinks. And they only touch SingletonLock, leaving Socket and Cookie to cause the exact same failure on the next launch.

The two-character difference

// What most CI scripts do.
// existsSync follows the symlink. SingletonLock's target is a dead PID.
// Result: existsSync returns false, the leftover link stays, next launch hangs.
import { existsSync, unlinkSync } from "fs";

const lock = `${userDataDir}/SingletonLock`;
if (existsSync(lock)) {
  unlinkSync(lock);
}
// Also: only cleans one of three files. Socket and Cookie still leak.
-60% but handles all three files

What happens when worker 2 starts after worker 1 crashed

Here is the exact sequence in two scenarios. First without the cleanup. Then with Assrt's cleanup in place.

Parallel worker restart after a SIGKILL

CI runnerWorker 2Chromiumuser-data-dirspawncheck SingletonLock (existsSync)false (follows dead symlink)launch --user-data-dir=...open SingletonLockalready existsOpening in existing browser sessionflake, job fails

Now with the Assrt launcher in place:

Same scenario, with the three-file eviction

CI runnerWorker 2Chromiumuser-data-dirspawnkillOrphanChromeProcesseslstatSync(SingletonLock)symlink existsunlink Lock / Socket / Cookieremovedlaunch --user-data-dir=...DevTools listening on ws://...scenario passes

Six decisions the launcher makes in under a second

Each card maps to a specific failure mode and a specific line of real source. Nothing here is aspirational; it is all shipping in assrt-mcp today.

SingletonLock

Dangling symlink whose target is hostname-PID. Blocks every subsequent launch if the PID is dead. existsSync lies about it; lstatSync sees it correctly.

SingletonSocket

Abstract Unix socket the Chrome launcher uses to say 'focus the existing instance'. If the real instance is gone but the socket file survives, you get the classic 'Opening in existing browser session' exit with no stack trace.

SingletonCookie

Random per-profile cookie the launcher hands the existing instance to prove profile identity. Orphaned copies cause silent reuse bugs.

120s MCP tool timeout

Per-call budget raised from the SDK's 60s. Slow parallel navigations complete instead of timing out, so logs distinguish CPU contention from real bugs.

Orphan process kill

Before unlinking locks, Assrt finds and SIGKILLs Chrome processes still bound to the user-data-dir. Prevents the classic race where two Chromes fight over one profile.

--isolated fallback

If any unlink fails, the launcher pushes --isolated to Chromium and continues with an in-memory profile. The run degrades, it never fails.

Inputs, launcher, outputs

The launcher is the junction. It takes whatever garbage the previous run left on disk, whatever orphaned Chrome PIDs are still hanging around, and whatever CLI args you passed, and produces one clean Chromium process the rest of your test run can trust.

How parallel-safe headless launches work in Assrt

Leftover SingletonLock
Leftover SingletonSocket
Leftover SingletonCookie
Orphan Chrome PIDs
CLI args
McpBrowserManager
Clean profile dir
No live orphan PIDs
Healthy Chromium process
--isolated fallback if needed
0
Singleton files evicted per launch
0s
Per-call MCP timeout (browser.ts:381)
0x900
Fixed headless viewport (browser.ts:296)
$0
Signup cost; self-hosted, open source

What this looks like when you run it

The cleanup lines are visible in stderr. That is deliberate. If you see removed stale SingletonLock on every boot you have a bigger problem (something is crashing Chrome every run). If you see it occasionally, the launcher just saved a parallel job from flaking. Either way, the signal is in your logs.

npx assrt-mcp, then assrt_test

Assrt vs a typical headless CI setup

Most CI matrices treat the launcher as a black box. The problem is that the black box is where your flakes live. Here is what actually differs line by line.

FeatureTypical CI + Playwright setupAssrt MCP runner
Cleans SingletonLock symlink between parallel launchesNaive scripts use existsSync and skip dangling linksUnlinked by name on every launch (browser.ts:326-342)
Cleans SingletonSocket and SingletonCookieTypical fix touches only SingletonLockAll three files are eviction targets
Kills orphan Chrome processes holding the profile firstRaces with live PID, two Chromes fight the profilekillOrphanChromeProcesses runs before unlink
Degrades to --isolated if unlink failsPermission error kills the workerPushes --isolated; run proceeds on an in-memory profile
Per-call MCP timeout60s SDK default; contended nav looks like a flake120s (TOOL_TIMEOUT_MS, browser.ts:381)
Test artifact formatProprietary YAML; evaporates on vendor switchReal Playwright MCP calls, Markdown #Case blocks
Price to startClosed vendors up to $7,500/month$0, open source, self-hosted (pay LLM tokens only)

Port this to your own launcher, eight items

If you are not going to run Assrt, copy the list. Every item is drawn from the failure modes the three-file eviction prevents, plus the timeout and process-cleanup decisions that hold the rest of the launcher together.

Eight rules for parallel-safe headless Chromium

  • Evict all three singleton files, not just SingletonLock. Any one of Lock, Socket, or Cookie can block a launch.
  • Use lstatSync, not existsSync. SingletonLock is a dangling symlink; existsSync lies about it.
  • Kill orphan Chromes before unlinking. If a Chrome still owns the profile, removing its lock lets two Chromes corrupt state together.
  • Use per-worker user-data-dir, or pass --isolated. Sharing one profile across parallel workers guarantees this failure.
  • Raise per-action timeouts past the SDK's 60s default. Contended CI runners legitimately need 90s+ on real navigations.
  • Log the cleanup in stderr. 'removed stale SingletonLock' is a tripwire for deeper instability you would otherwise miss.
  • Degrade, do not fail. If unlink throws, switch to an in-memory profile so the run proceeds.
  • Keep the tests portable: real Playwright code, not a vendor's YAML dialect.

Bring your parallel-headless flakes

Thirty minutes. You share a CI log with 'Opening in existing browser session' in it. We point at the three singleton files in your Chromium profile, show the lstatSync fix, and hand you the cleanup block from browser.ts to drop into your own launcher.

Book a call

FAQ on headless chrome test parallelism flakiness

Why do headless Chrome tests get flakier under parallelism specifically?

Two reasons. The boring one is resource contention: 20 headless Chromes contending for CPU and GPU time will blow through assertion timeouts long before they blow through memory, and that looks like 'random' flakes in logs. The interesting one, the one almost nobody writes about, is the Chromium singleton lock. A user-data-dir can only be opened by one Chrome process at a time; the lock is a symlink named `SingletonLock` whose target is `hostname-PID`. When a worker process dies without cleaning up (SIGKILL, OOM, CI runner timeout), the symlink survives. The next parallel worker that tries to open that profile gets 'Opening in existing browser session' and hangs. In CI this looks like intermittent startup flakiness because the failure depends on whether the previous run finished cleanly. Assrt's MCP layer evicts the lock before every launch, which is why its parallel runs do not inherit the crash state of the run before.

What are SingletonLock, SingletonSocket, and SingletonCookie, and why three files?

They are three different coordination points inside one user-data-dir. `SingletonLock` is a symlink whose presence means 'some Chrome owns this profile'; it is the one most people know about. `SingletonSocket` is the abstract Unix socket Chrome's launcher uses to send a 'focus your window / open a new tab' request to a running instance. `SingletonCookie` is a short random cookie the launcher checks to confirm it is talking to the same profile. If any of the three survive a crash, a subsequent launch can either hang, silently reuse an invisible process, or print 'Opening in existing browser session' and exit without a useful stack trace. The code in `assrt-mcp/src/core/browser.ts` at lines 326-342 iterates all three by name, uses `lstatSync` rather than `existsSync` (because the broken symlink target makes `existsSync` lie), and unlinks each. If an unlink fails, it falls back to Chrome's `--isolated` in-memory profile so the run still proceeds.

Why lstatSync instead of existsSync?

`existsSync` follows symlinks. `SingletonLock` is a dangling symlink whose target is the PID of the dead Chrome process. When the target PID no longer resolves, `existsSync` returns false and you silently skip the cleanup, so the next launch still sees the leftover link and fails. `lstatSync` stats the symlink itself, not what it points to, so it correctly reports that the link exists and needs to be removed. This is a load-bearing two-character difference. If you copy-paste naive code from Stack Overflow answers on this problem, it almost always uses `existsSync` and it almost always fails on exactly the Chromes that crashed hardest.

How is this different from Playwright's built-in test isolation?

Playwright's test runner gives each test a fresh `BrowserContext` inside a shared browser process; it does not give each test a fresh user-data-dir unless you launch persistent mode with different paths per worker. If you are running in persistent mode and a worker crashes, the next run on that worker inherits the exact singleton-lock problem described above, and Playwright will not clean it for you. The fix is either: stop using persistent mode (isolated contexts are cleaner for most parallel suites), or actively evict the three singleton files between launches. Assrt's MCP runner does both, depending on how it is invoked. When `--isolated` is passed it runs with an in-memory profile (no disk state, no lock to leak); otherwise it uses `~/.assrt/browser-profile` and cleans the singletons before every spawn.

Assrt is a local MCP, not a parallel test runner. How is any of this relevant?

Because the MCP server is the thing that boots Chromium for every scenario an agent runs against a target URL. If you drive Assrt from a harness that spawns multiple scenarios at once (e.g. one MCP client per CI job, fanned out across a matrix), you are implicitly doing parallel headless Chrome, and you hit the same singleton-lock failure mode as a traditional Playwright test matrix. The difference is that the code that launches Chrome is one TypeScript file you can read in ten minutes — `assrt-mcp/src/core/browser.ts` — so you can see exactly why your parallel runs stabilize. Closed cloud vendors hide this layer. You pay for it when CI flakes and there is nothing to grep.

What is the 120-second timeout about, and does it mask real flakiness?

It is the per-call MCP tool timeout defined at `browser.ts:381` as `TOOL_TIMEOUT_MS = 120_000`. The MCP SDK defaults to 60 seconds, which is not enough for slow navigations on real sites, especially under parallel load. Raising it to 120 seconds does not mask flakiness, it exposes it. A navigation that takes 90 seconds is information: either the site is actually that slow under load, or you have a retry loop disguised as a wait. Either way, the log line `[mcp] browser_navigate url=... (92000ms)` is worth more than a raw 'timed out after 60s' error, because it tells you the page eventually responded and by how much. If you then see the same 90-second navigation drop to 800ms when you run the scenario in isolation, you have diagnosed a CPU-contention flake rather than a real bug.

Does running `--headless=new` actually help?

For rendering fidelity, yes. For parallelism and the singleton-lock problem specifically, it changes nothing. The lock is held by the Chromium process on its user-data-dir, and the user-data-dir is shared between old and new headless modes. If you switched to `--headless=new` and your parallel runs got more stable, the likely reason is that `--headless=new` exposes bugs in your own code — requestAnimationFrame callbacks, Intersection Observer, reduced-motion detection — that the legacy `--headless` mode silently worked around. Fixing those makes you faster, but it does not address the specific CI failure where 'Chrome is already running' kills 1 in 20 jobs after a prior crash. That one is a filesystem problem.

What does the fallback to --isolated actually do?

If the code at `browser.ts:336-338` catches an error unlinking a singleton file (permissions, read-only filesystem, weirdness on Windows subsystems, etc), it pushes `--isolated` onto the Playwright MCP args and sets `isolated = true`. That tells Chromium to use an in-memory user-data-dir that lives only for the process lifetime. Nothing persists, nothing to lock. The trade-off: logged-in state does not carry across scenarios, so you have to re-authenticate in every run. That is a real cost for integration testing, which is why the default is persistent mode with singleton cleanup. The fallback exists so that an unluckily permissioned filesystem never brings the whole run down; it degrades to ephemeral instead of failing.

How would I verify this problem exists in my own setup?

Before your next CI run, `ls -la ~/.cache/google-chrome` (or whatever user-data-dir Playwright is using on that runner) and look for `SingletonLock`, `SingletonSocket`, `SingletonCookie`. If the run before yours crashed, at least one of them will be there. Then launch Chrome against that profile and watch it either hang or print 'Opening in existing browser session'. To reproduce deliberately: launch chromium with `--user-data-dir=/tmp/demo`, send it SIGKILL mid-session, then launch again with the same `--user-data-dir`. The second launch will stall. Remove `/tmp/demo/SingletonLock` with `unlink` and try again; it succeeds. This is the loop Assrt automates.

I run my tests on GitHub Actions on ephemeral runners. Does any of this matter?

Less, but still yes. Ephemeral runners start clean, so the 'leftover from the previous run' path does not apply. What does still apply: within a single job, if you run scenarios in parallel (several `npx playwright test` shards in one container, or several worker processes against a shared persistent profile), the same singleton-lock symptoms emerge between workers rather than between runs. If your CI uses a shared cache to persist browser state between runs for speed — some teams do this to avoid re-logging-in — you inherit the cross-run failure mode as well. The cleanest fix on GHA is to keep the profile in-memory per job via `--isolated` or equivalent; the second-cleanest is to evict the three singletons on job start.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.