From the open Assrt source

The Next.js mismatch: most Playwright test generators record what they see, then break the next time RSC streams in a different order

Direct answer: for a Next.js App Router app, you want a generator that emits test intent (text the runner re-resolves per step) rather than a .spec.ts file with hardcoded locators and waitForLoadState("networkidle") calls. Those locators were unique at record time, not at run time, and networkidle is the wrong signal once your page streams React Server Component chunks. The fix is a stability wait that watches the DOM (a MutationObserver) instead of the network. Assrt is the open option that does both; the implementation is at agent.ts:956-1009.

Matthew Diakonov, Written with AI

Published May 10, 202611 min read

Every line number and file path on this page is real. The agent source is open at github.com/assrt-ai/assrt-mcp. If a claim looks specific, it is because the line is checkable.

Four Next.js patterns that flake recorded tests

Codegen on a static site (or a fully client-rendered SPA) is well-behaved. Codegen on Next.js App Router is unreliable not because Playwright is broken, but because the rendering model violates two assumptions baked into the generator's output: that the DOM you recorded against is the DOM that will exist on the next run, and that there is a single moment after navigation when the page is "done loading". Both assumptions are wrong on Next.js.

The four mechanisms

1
Streaming RSC
HTML arrives in chunks, not all at once
2
Suspense flush
loading.tsx swaps when data resolves
3
Hydration
DOM mutates after first paint
4
Server actions
Re-renders after form submission

1. Streaming RSC chunks the response

An App Router page using server components flushes the shell first, then streams in each Suspense boundary as its data resolves. Two consecutive runs against the same endpoint can ship the same final HTML in different orders depending on which backing fetch returns first. A locator that was unique at record time can match a different element at run time, and getByRole({ name: "..." }) with no parent scope is the most common offender. Generated tests start asserting on the wrong card, the wrong list item, the wrong modal.

2. `loading.tsx` swaps mid-render

When you put a loading.tsx next to a route segment, Next.js renders that fallback while the real segment streams. Once the segment's data resolves, React swaps the fallback subtree out. Codegen recording captures the clicks against whichever state happened to be visible when you clicked. On replay, the test can fire its click before the swap completes (against the fallback) or after (against the real content). Either way, a flaky pass.

3. Hydration mutates the DOM after first paint

Even a server-rendered Next.js page hydrates client components after the initial HTML lands. Hydration attaches event listeners, installs portals for modals and toasts, and runs any useEffect that mounts new DOM. A generated test that fires a click during that window targets an element that exists in the DOM but is not yet wired up. The click registers as a pixel event, the handler never fires, the next assertion times out. Generators that emit page.waitForLoadState("domcontentloaded") are early. Generators that emit networkidle can be late, or never resolve under streaming.

4. Server actions trigger a re-render the test does not see

A form submitted via a server action sends the action result back and React re-renders the affected segments. The transition is quick, but it produces a burst of DOM mutations that arrive after the network response has come back. A generated test that asserts on the post-submit UI right after waiting for the network can catch the page mid-update.

500ms

“The agent injects new MutationObserver(...).observe(document.body, { childList: true, subtree: true, characterData: true }) and polls window state every 500ms. It returns when zero new mutations have arrived for stableSec * 1000 ms (default 2s). Streamed RSC, Suspense flushes, and hydration churn all eventually stop mutating; that moment is when everything is actually settled.”

agent.ts:962-994 in the open Assrt source

The artifact: stop emitting locators, emit intent

A standard Playwright generator records your clicks and writes a .spec.ts with the locators it inferred. Those locators are frozen the moment the file is written. An agent-style generator writes intent and lets the runner derive locators per step from a fresh snapshot of whatever the page actually is at that moment. Same Chromium, same network stack, different artifact. Toggle between the two below.

Codegen .spec.ts vs. agent intent for the same Next.js flow

// npx playwright codegen localhost:3000 emits this. // Hardcoded locators. Hardcoded order. networkidle. import { test, expect } from "@playwright/test"; test("dashboard shows the new project card", async ({ page }) => { await page.goto("http://localhost:3000/dashboard"); await page.waitForLoadState("networkidle"); // Picked up at record time. May be ambiguous on the next // run if a streamed card pushes it past first match. await page .getByRole("button", { name: "New project" }) .click(); await page.locator("input[name='name']").fill("alpha"); await page.getByRole("button", { name: "Create" }).click(); // networkidle on a streaming response can fire halfway, // or never. The generator does not know. await page.waitForLoadState("networkidle"); await expect( page.getByText("alpha") ).toBeVisible(); });

Locators frozen at record time
networkidle is a poor signal under RSC streaming
Order assumption baked into the test
Two of these calls can flake independently

The intent file is what assrt_plan emits. The implementation is at /Users/matthewdi/assrt-mcp/src/mcp/server.ts:766-845: it launches a local Chromium, navigates to the URL you give it, captures three screenshots at scroll positions 0, 800, and 1600 pixels (lines 794-805), slices the concatenated accessibility tree to 8000 characters (line 809), and asks claude-haiku-4-5-20251001 for 5 to 8 #Case blocks (lines 829-834). The system prompt that constrains the output is at server.ts:219-236 and is 18 lines long. You can read it in the repo before you trust it.

The wait primitive: watch the DOM, not the network

Below is one iteration of the runner waiting for a streamed RSC route to settle. The agent does not know anything about Next.js or RSC; it just waits until the DOM has been quiet for the configured window. That works for streaming, for Suspense flushes, for hydration, for AI chat responses, for anything else that mutates the page after the network has come back.

One agent step on a streamed Next.js route

The defaults are 2 seconds of zero-mutation quiet and a 30-second timeout. Both are configurable per step (capped at 10 and 60 respectively at agent.ts:957-958). The poll interval is 500ms, set explicitly at agent.ts:979. The cleanup at the end disconnects the observer and deletes the window globals so a long scenario does not leak across cases (agent.ts:990-994). This is what an open generator looks like when its waits do not assume your stack.

Try it on your Next.js dev server

Start your app and run the generator against it. The first command generates a plan; the second runs it and produces a video and a JSON report. Both work against localhost, Vercel previews, and production.

# 1. Generate the plan from a URL.
npx @m13v/assrt discover http://localhost:3000

# 2. Run the plan. Real Playwright under the hood.
npx @m13v/assrt run http://localhost:3000

# 3. Re-run the same scenario by id later.
npx @m13v/assrt run http://localhost:3000 \
  --scenario <uuid-from-the-previous-run>

Outputs land in /tmp/assrt/: scenario.md for the plan, scenario.json for the metadata, and results/latest.json for the run output. Move them into your repo, version them, and run them from CI. There is no remote dashboard you depend on.

What this approach does not solve for you

A generator that emits intent is not magic. The reader who came here to ship a test today should know what is on you, not on the tool.

Idempotence is your problem. If a generated case clicks "Create project", the second run sees a different staging state than the first. Either write cases that are read-only, or have them clean up. The runner records pass and fail; it does not roll back.
Auth-gated flows need a seeded session. Middleware-driven redirects fire before the page renders. If your app gates a route behind auth, set the auth cookies in the persistent profile before generating, or seed them in a #Case 1 that signs in. The agent reuses the browser session across cases.
Parallel routes need their slot loaded. The generator works from what is on the page when it visits. If a @modal slot is empty when you hit the URL, the generator will not know it exists. Trigger the route into the state you care about first, then run the generator.
Visual diffs are still on you. The runner records video and per-step screenshots, but the generator produces functional cases, not pixel-comparison ones. If you need a pixel snapshot test, write that case explicitly.

Want a walkthrough on your own Next.js app?

Bring a URL (localhost, preview, or production). I will run the generator on it live, walk through what came out, and answer the awkward question about CI integration.

Frequently asked questions

What is the actual problem with npx playwright codegen on a Next.js App Router app?

Codegen records the clicks you make and emits a TypeScript spec.ts with hardcoded locators (page.getByRole, page.locator) and an implicit assumption that the DOM you saw at recording time is the DOM the test will see at run time. On a Next.js App Router app that streams React Server Component output, the second assumption is the one that bites. The model of the page during your recording session can ship in a different order on the next run because Suspense boundaries flush as their data arrives, not in a deterministic order. The locator that was unique at record time can be ambiguous at run time because a streamed-in card pushed it off screen, and 'click first matching' returns a different element. networkidle as a wait condition compounds the problem because RSC keeps the response open while it streams, so networkidle can fire halfway through a render or never fire at all under heavy streaming. The pattern of a test passing locally and failing on the next CI run is almost always one of these.

Concretely, where does the generator design have to change for Next.js?

Two places. First, the artifact: stop emitting hardcoded locators in a spec.ts and start emitting intent (text the runner will re-resolve from a fresh accessibility snapshot per step). Second, the wait primitive: stop relying on Playwright's built-in waitForLoadState('networkidle') and start watching the DOM directly. Assrt does both. The generator (assrt_plan, in /Users/matthewdi/assrt-mcp/src/mcp/server.ts:766-845) emits Markdown #Case blocks. The runner uses a wait_for_stable tool (agent.ts:956-1009) that injects a MutationObserver on document.body and returns only when the DOM has been quiet for the configured stable window. That second piece is what most generators get wrong on Next.js: the network is the wrong signal. Mutations are the right signal.

What does the MutationObserver wait actually do, in code?

It is roughly thirty lines at agent.ts:962-994. The agent calls browser.evaluate to inject this on the page: window.__assrt_mutations = 0; window.__assrt_observer = new MutationObserver((mutations) => { window.__assrt_mutations += mutations.length; }); window.__assrt_observer.observe(document.body, { childList: true, subtree: true, characterData: true }). Then it polls window.__assrt_mutations every 500ms. Whenever the count changes, it resets a stableSince timestamp. When stableSince has been older than stableSec * 1000 (default 2 seconds, capped at 10), it returns 'page stabilized after Xs (N total mutations)'. If timeoutSec * 1000 (default 30, capped at 60) elapses first, it returns the timeout string. Then it disconnects the observer and deletes the window globals. That is the entire mechanism. There is nothing Next.js specific about it, and that is the point: streamed RSC, Suspense flushes, hydration churn, AI streaming responses, all of them eventually stop mutating the DOM, and stability is the moment everything has actually settled, not when the network thinks it is done.

If the generator emits Markdown intent, what does the runner actually run?

Real Playwright, through @playwright/mcp. The runner is an agent loop in agent.ts:692-747 that calls Anthropic with tools, executes the returned tool_use blocks (navigate, click, type_text, scroll, snapshot, evaluate, http_request, wait, wait_for_stable, assert, complete_scenario, and a handful of email primitives), and feeds the new accessibility snapshot back as a tool_result on the next turn. snapshot returns the live page as a YAML-style accessibility tree where every interactable node has a [ref=eN] id, and click/type_text route those refs through Playwright's standard locator engine. Nothing about this is mocked. It is the same Chromium, the same network stack, and the same locator engine that any other Playwright test uses. The difference is that the locators are derived per step from what is currently on the page, not from what was on the page when the generator ran.

Does the generator need a running Next.js dev server, or does it work on a deployed URL?

Either. assrt_plan takes any URL: localhost:3000, a Vercel preview deployment, or production. It launches a local Chromium via @playwright/mcp, navigates, captures three screenshots at scroll positions 0, 800, and 1600 pixels (server.ts:794-805), slices the concatenated accessibility text to 8000 characters (server.ts:809), and sends both to claude-haiku-4-5-20251001 with max_tokens 4096 (server.ts:830-831). Output is 5 to 8 #Case blocks describing the most important user flows visible on the page, written in plain English the runner will later interpret. There is no setup beyond having Node 18+, npx, and a URL to point at.

What Next.js specific gotchas should I know about when generating tests this way?

Three. First, server actions can mutate state in ways that affect the next test run: if your generated case 'submits the contact form', a second run against the same staging environment will see a different state than the first. The runner's complete_scenario records pass/fail, not idempotence; the discipline is to write cases that are either read-only or that clean up after themselves. Second, route groups (the (parens) folders) and parallel routes mean a single URL can render different layouts based on segment. The generator works from what it sees on the page, so if a parallel slot has not loaded, it will not be in the plan. Re-run the generator after navigating into the section you care about. Third, middleware-driven redirects fire before the page loads; if you are testing a flow gated by auth, set the auth cookies in the persistent profile before generating. The agent shares the browser session across cases (agent.ts mentions 'cookies, auth state carry over' explicitly in its system prompt) so a #Case 1 that signs in seeds every later case.

Does this work with the Pages Router, or only the App Router?

Both, for the simple reason that the generator does not parse your routing config or your code at all. It reads the rendered page through Playwright. A Pages Router app is easier in some ways, since it does not stream RSC, so there is less DOM churn for the stability wait to absorb. An App Router app is where the wait_for_stable primitive earns its keep, but the generator is identical in both cases: navigate, snapshot, screenshot, generate. If your app uses both routers (a common transitional state), each individual page is whatever it is when the generator visits it.

How is this different from QA Wolf, Momentic, or BrowserStack's AI test generator?

Three differences that matter. (1) Output format: QA Wolf and Momentic store generated tests in their backend in a proprietary format; you read them through their dashboard. Assrt writes a Markdown file at /tmp/assrt/scenario.md you can grep, diff, and check into git. (2) Hosting: QA Wolf and Momentic run the tests on their cloud and bill per run (QA Wolf publicly quoted around $7,500 per month for managed coverage). Assrt runs locally or in your CI on your boxes. (3) The wait primitive: most closed AI generators emit a fixed waitForLoadState('networkidle') or a small set of hardcoded sleeps in their generated tests, which is exactly the pattern that breaks on Next.js streaming. Assrt's stability wait is dynamic. The trade-off is that closed SaaS tools have polished dashboards and scheduled runs out of the box; with Assrt you wire the cron job yourself or run from CI.

Where is the source so I can verify all of this?

Two repos. The MCP server, CLI, and agent loop are at https://github.com/assrt-ai/assrt-mcp (the file paths in this article are accurate against the main branch). The web dashboard and recorder are at https://github.com/assrt-ai/assrt-mcp as well for now (mono-ish layout). The npm package is @m13v/assrt; npx @m13v/assrt discover https://your-app.com triggers the generator path described here. Every claim about line numbers and prompt text is checkable from those two URLs, which is the whole point of running an open generator instead of a closed one.

Keep reading

Deep dive

Playwright AI test agents, explained: the agent is a while loop that calls a browser

If you want to know what actually happens after the generator emits the plan, this traces the runner loop line by line: snapshot, tool_use, tool_result, stop_reason: end_turn.

Read

Generator

AI Playwright test generator with an open prompt: the 18 lines that write your tests

The exact 18-line prompt that turns three screenshots into a #Case Markdown plan. No proprietary YAML, no SaaS dashboard, no hidden behavior.

Read

Reference

Playwright e2e test best practices

The patterns that hold up under churn (and the ones that do not), independent of how the test was authored. Useful as a reference even if you write tests by hand.

Read

The Next.js mismatch: most Playwright test generators record what they see, then break the next time RSC streams in a different order

Four Next.js patterns that flake recorded tests

The four mechanisms

1. Streaming RSC chunks the response

2. loading.tsx swaps mid-render

3. Hydration mutates the DOM after first paint

4. Server actions trigger a re-render the test does not see

The artifact: stop emitting locators, emit intent

Codegen .spec.ts vs. agent intent for the same Next.js flow

The wait primitive: watch the DOM, not the network

Try it on your Next.js dev server

What this approach does not solve for you

Want a walkthrough on your own Next.js app?

Frequently asked questions

Keep reading

Playwright AI test agents, explained: the agent is a while loop that calls a browser

AI Playwright test generator with an open prompt: the 18 lines that write your tests

Playwright e2e test best practices

2. `loading.tsx` swaps mid-render