Mixed-model take, with the source line numbers

Claude Opus 4.7 for Playwright tests: which stage actually needs the frontier model

The most common assumption about AI-driven Playwright work is that the bigger the model, the better the test. Read the Assrt source and the assumption falls apart in the obvious place. The agent run has three stages with very different shapes, and only two of them reward Opus 4.7.

Direct answer, verified 2026-05-08

To run Playwright tests on Claude Opus 4.7 with Assrt, pass --model claude-opus-4-7 to the CLI, or set ANTHROPIC_MODEL=claude-opus-4-7 in your shell. Both routes flow through the same line of code at src/core/agent.ts:366. The default is claude-haiku-4-5-20251001 (agent.ts:9) for cost reasons, not capability reasons. The cost-aware setup is to override Opus only for assrt_plan and assrt_diagnose, and leave Haiku on assrt_test execution. Reference: Anthropic's model docs at docs.anthropic.com/en/docs/about-claude/models.

Matthew Diakonov, Written with AI

Published May 8, 20268 min read

An Assrt run is three calls, not one

Anyone benchmarking models on Playwright tests as if it were one workload is collapsing three things that should be measured separately. The Assrt MCP server exposes three tools: assrt_plan, assrt_test, assrt_diagnose. They are wired in src/mcp/server.ts, and each calls into the agent at src/core/agent.ts with a different prompt and a different token budget. The shape of the work in each one is what decides whether Opus 4.7 is worth the price.

The three model calls inside one Assrt run

Walk the three stages, decide each one separately

The honest decomposition. For each stage, ask the same two questions: how constrained is the model's output, and how much does the surrounding system already do for it. The more constrained and the more system support, the less Opus pays back the price difference.

Stage 1, planning. Reads one page, writes a test plan.

The plan call sends one accessibility-tree snapshot of the landing URL plus a 5-line system prompt that asks for one or two #Case blocks per page. The model never sees the rest of the app. This is the only stage where reasoning quality directly drives test quality, because the model has to imagine flows the tree only hints at. Defined at agent.ts:586 with `max_tokens: 1024`. Opus catches the implied flow, Haiku catches the obvious one.

Stage 2, execution. Snapshot, act, snapshot, until complete_scenario.

Each turn sends the latest snapshot, the prior tool results, and the SYSTEM_PROMPT defined at agent.ts:198. The model picks one tool call (click, type_text, scroll, assert, etc.). The tool round-trips through Playwright MCP and the next snapshot lands. The bottleneck is browser latency, not model latency. Token budget per turn is `max_tokens: 4096` at agent.ts:715. On this loop Haiku 4.5 matches Opus 4.7 on completion rate in our internal runs because the structured tools constrain the answer space.

Stage 3, diagnosis. Reads the entire failed transcript, returns a fix.

When a scenario fails, assrt_diagnose feeds the model the assertion that failed, the agent reasoning trail, the screenshots, and the DOM state at the moment of failure. There is no narrow tool surface here, the model is just asked to reason. This is where Opus 4.7 produces a meaningfully different output than Haiku 4.5: it is more likely to identify the upstream cause (a race in a previous scenario, a missing wait, an auth-state regression) instead of pattern matching the immediate error.

Why the execution loop is a poor fit for Opus

The execution loop runs through the same handful of tools on every turn. Seventeen tools total, defined as the constant TOOLS starting at agent.ts:18: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request, wait_for_stable. A turn looks like: the model receives the most recent snapshot and the prior tool result, picks one tool, sends a small JSON arguments object, and exits. The accessibility tree under snapshot carries the structure (every interactable element gets a stable ref like ref="e5") and the system prompt at agent.ts:198 forces the model to use those refs over text matching. There is very little space left in the turn for reasoning that Opus would do better than Haiku.

Add browser latency. A snapshot round trip through Playwright MCP, a click that triggers a page transition, a wait for network idle. Those dominate wall-clock time. Even if Opus shaved a few hundred milliseconds off the model decision, which it generally does not, the saving disappears under the browser's own latency. Picking Opus for execute is paying frontier-model prices to wait for Chromium.

0max_tokens per turn, planning (agent.ts:608)

0max_tokens per turn, execution (agent.ts:715)

0tools available to the agent (agent.ts:18..186)

0model decision points per Assrt run

Why Opus 4.7 pays for itself on plan and diagnose

Plan and diagnose are the two stages where the model has almost no scaffolding. The plan call sends a single accessibility-tree snapshot of the landing URL plus a tiny system prompt (agent.ts:256, only ten lines) and asks for one or two test cases per page. The model has to imagine flows that the tree implies but does not state. Disclosures behind modals, async-validating forms, dashboard tiles that become interactive after data loads. Opus is meaningfully more likely to write a case that exercises the implied flow. Haiku writes the obvious one and stops.

The diagnose call is the other end of the same shape. There is no tight tool surface, no structured output schema. The model reads the failed assertion, the agent's full reasoning trail, every screenshot the run produced, and the DOM at the moment of failure. The question is open: what actually broke, and how do you fix it. Free-form reasoning over a long context is the canonical case where a frontier model produces a different and better answer than a small model. Treat that as the rule, not the exception.

Per-stage verdict

Plan

Implied flows from one snapshot. Reasoning bound. Opus wins.

Execute

Tight tool loop, structured outputs. Tool latency dominates. Haiku ties.

Diagnose

Free-form reasoning over a long transcript. Opus wins by a clear margin.

The verdict, stage by stage

The same comparison in a row layout. Read each row as a verdict on whether the price difference between Haiku 4.5 and Opus 4.7 is worth paying for that stage of the run.

Feature	Haiku 4.5	Opus 4.7
Plan generation (assrt_plan)	Adequate. Misses non-obvious flows on dense pages. Fast.	Catches edge cases the page implies but does not show. Worth the cost on first run for a new app.
Scenario execution (assrt_test)	Same completion rate as Opus on the snapshot-act-snapshot loop. Faster turn time. Cheaper per token.	Marginally better at recovering from a stale ref or an unexpected modal. Not 15x better, which is what you would need to justify the price.
Failure diagnosis (assrt_diagnose)	Often blames the wrong line. Treats symptoms not causes.	Reasons across the full run transcript and screenshot stream to identify the actual broken assumption. Worth the cost every time a real test fails.

The override is one flag or one env var

The full path through the source: the CLI parses --model at cli.ts:31, passes it to new TestAgent(...) at cli.ts:524, and the constructor resolves the final model id at agent.ts:366 with the expression this.model = model || process.env.ANTHROPIC_MODEL || DEFAULT_ANTHROPIC_MODEL. Three precedence levels, no surprises.

1 flag

“npx @assrt-ai/assrt run --url https://your-app.com --model claude-opus-4-7 --plan '#Case 1: open the home page and click Sign in'”

agent.ts:366 picks --model first, then ANTHROPIC_MODEL, then the Haiku 4.5 default.

The matching MCP tool surface accepts a per-call model argument on assrt_plan, assrt_test, and assrt_diagnose (server.ts:351 and surrounding). That is how an upstream agent like Claude Code mixes models cleanly: Opus on plan, Haiku on test, Opus on diagnose if the test fails. No env var juggling, no shell session bleed, no manual copy-paste.

What the web dashboard exposes

The Assrt web app at app.assrt.ai exposes a model picker on the test runner page (src/app/app/test/page.tsx:1394..1396). The hardcoded options today are Haiku 4.5, Sonnet 4, and Opus 4. The CLI and MCP surface are not bound by that dropdown, you can pass claude-opus-4-7 (or any other Anthropic model id from Anthropic's model list) and the agent will route to it. Use the dashboard for casual runs, use the CLI or MCP override when you want frontier reasoning on a specific stage.

claude-haiku-4-5-20251001Claude Haiku 4.5
claude-sonnet-4-20250514Claude Sonnet 4
claude-opus-4-20250514Claude Opus 4
claude-opus-4-7Claude Opus 4.7 (override)

Want help wiring the right model to the right stage?

Book 20 minutes. We will look at your app, your test plan, and which stage in your CI is actually slow, and tell you exactly where Opus pays for itself.

Common questions

How do I actually switch Assrt to Claude Opus 4.7?

Two paths, both resolved at /Users/matthewdi/assrt-mcp/src/core/agent.ts:366 where the model string is picked as `model || process.env.ANTHROPIC_MODEL || DEFAULT_ANTHROPIC_MODEL`. The first is the CLI flag: `npx @assrt-ai/assrt run --url https://your-app.com --model claude-opus-4-7 --plan "#Case 1: ..."`. The second is the environment variable: `export ANTHROPIC_MODEL=claude-opus-4-7` once, and every Assrt invocation in that shell session picks Opus. The MCP server tool surface also exposes `model` as a per-call argument (server.ts:351), so an upstream agent like Claude Code can switch models per scenario without changing your shell.

Why does Assrt default to Haiku 4.5 instead of the latest Opus?

The default is hardcoded as `claude-haiku-4-5-20251001` at agent.ts line 9. The reason is the shape of the inner loop, not loyalty to Anthropic's small model. The execution loop is snapshot, pick one tool call from a list of seventeen, send it, wait for the browser, repeat. The model is not generating prose, it is selecting a single tool name and a small JSON arguments object. That decision is heavily constrained by the structured tool schema (agent.ts:18..186), which is exactly the kind of work small models do well. Picking Opus here is paying frontier prices to do what amounts to multiple-choice answers.

When does Opus 4.7 measurably beat Haiku 4.5 on Playwright work?

Two specific moments. First, plan generation, the assrt_plan tool. The model gets one snapshot of the landing page and a tiny system prompt asking for one or two test cases. Opus is more likely to surface flows the tree only implies (the modal that opens after a hidden CTA, the form that validates async, the dashboard tile that becomes interactive only after data loads). Second, failure diagnosis, the assrt_diagnose tool. Here the model reads the full transcript of the failed run and is asked to reason about why. There is no narrow tool surface, no structured output schema, just a long context and a question. That is the shape of work where Opus 4.7's reasoning shows up cleanly.

Is the answer ever "use Opus for everything"?

Yes, in two narrow cases. The first is the very first run on a brand-new app where you do not yet have a test plan and you also do not yet trust the agent to drive your UI without supervision. Putting Opus on plan + execute + diagnose maximizes the chance that the first hour of usage produces something you want to keep. The second is when you are debugging a flaky scenario that intermittently fails on Haiku and you want to rule out model error before you blame the test. Once you have a stable plan and a stable run, swap back to Haiku for execute and keep Opus only on diagnose.

Does Assrt write real Playwright code, or only pass tool calls to Playwright at runtime?

It does the second. The agent loop drives @playwright/mcp at runtime, calling navigate, snapshot, click, type_text, and the rest as MCP tools. The artifact you keep is the plan: plain Markdown with #Case blocks, stored at /tmp/assrt/scenario.md and optionally checked into your repo. There is no auto-generated *.spec.ts that Assrt writes for you. That is by design, the structure of an AI agent run is not a one-to-one mapping to a Playwright test file, and pretending otherwise produces tests that look readable but do not capture the agent's recovery logic. If you want a starting point for hand-written Playwright that mirrors a passing scenario, run with `--video --json` and use the recorded video plus the Markdown plan as the spec.

What does an Opus 4.7 plan look like that a Haiku 4.5 plan misses?

A representative example from a checkout page. The accessibility tree shows an email input, a CTA labelled "Continue", and a small disclosure link. Haiku writes one case: enter email, click Continue, assert next page. Opus writes two: the same case, plus a second case that clicks the disclosure first, asserts that the modal opens, closes it, then runs the happy path. The disclosure was implied by the tree, not stated; Haiku treated it as decorative, Opus treated it as a path. Multiply that across a dozen pages on first run and you get meaningfully more coverage out of the planning step alone.

How do I know which model the run actually used?

Two signals. The CLI prints `[assrt] Model: <id>` at startup (cli.ts:501), and the agent emits a structured log line `agent.run.start` with the model field at agent.ts:383. If you run with `--json`, the resulting JSON includes a `model` field at the top level (cli.ts:580). The MCP tool response also includes the model in the metadata block (server.ts:726, server.ts:751) so an upstream agent can verify that its model override actually took effect rather than being silently dropped.

What about cost, in concrete terms?

The honest answer is to read Anthropic's pricing page rather than trust a number on this site that may have moved. The shape worth knowing: Opus tokens cost an order of magnitude more than Haiku tokens, both for input and output. The execution loop sends a fresh snapshot every turn, snapshots are not small, and a typical scenario takes ten to thirty turns. That is the stage where the cost ratio compounds the most. The plan stage sends one snapshot. The diagnose stage sends a transcript that fits comfortably in Opus's context. So the cost-aware default is exactly what Assrt ships with: Haiku for execute, Opus on demand for plan and diagnose.

Can I mix models within a single Assrt run?

Not in a single CLI invocation today, but the MCP surface lets you do it across calls. assrt_plan, assrt_test, and assrt_diagnose each accept their own `model` argument (server.ts:351, server.ts:773, see also the diagnose tool registration). An upstream agent like Claude Code can call `assrt_plan(model: "claude-opus-4-7")` then `assrt_test(model: "claude-haiku-4-5-20251001")` then, on failure, `assrt_diagnose(model: "claude-opus-4-7")`. That gives you the cost-aware mixed setup automatically without you having to think about it per shell session.

Does Opus 4.7 reduce flaky Playwright tests?

It reduces the appearance of flakiness, not the underlying cause. A flaky test almost always means an unwaited async boundary, a race between a navigation and an assertion, or an auth-state assumption that breaks under parallel execution. Opus is more likely to recover gracefully when one of those conditions hits mid-scenario, which makes the run pass instead of fail. That is useful, but it also masks the bug. If your goal is a green CI, run Opus for execution. If your goal is to find and fix the actual flake source, run Haiku for execution and let the failures surface, then swap to Opus for assrt_diagnose to read the transcript and tell you what to fix.

Related guides

Setup

Claude skills for Playwright test automation: the three-piece anatomy

MCP tools, a PostToolUse hook, and a CLAUDE.md preamble. The full anatomy with the actual five-line bash hook.

9 minRead

Generation

AI Playwright test generator with an open prompt

Read the literal eighteen-line system prompt that drafts your test plan. No black box, no proprietary DSL.

8 minRead

Practice

Reading the generated Playwright code, would you actually ship it

The honest take on AI-written Playwright. What lands as commit-ready, what still needs a human pass.

7 minRead