Runtime guideAgent loopPlaywright MCPclaude-haiku-4-5

Natural language test case descriptions automation, as a live loop instead of a compiler

Search this keyword and you land on two clusters: academic NLP pipelines that read requirements once and emit test code, and vendor tools that wrap their own DSL and call it natural language. Both treat automation as a pre-processing step. This guide documents the opposite shape, the one that actually powers Assrt: the English plan stays English through execution, and a tiny agent loop at /Users/matthewdi/assrt/src/core/agent.ts lines 609-641 drives the browser turn by turn with a 2-line rule and 18 tools.

Matthew Diakonov, Written with AI

Published April 20, 202613 min read

Automation as a loop, not a compile step

Snapshot first, ref=eN, one of 18 tools, re-snapshot

Your English plan is never translated into code

On every turn the live ARIA tree goes back into the prompt

claude-haiku-4-5 picks a tool and a ref per action

Recovery is a re-snapshot, not a healing pass

Everything is 200 lines of source in agent.ts

0:00 / 0:05

4.9from Assrt engineering

2-line execution contract in SYSTEM_PROMPT (agent.ts:207-208)

18-tool surface defined in one TOOLS array (agent.ts:16-196)

No generated code, no stored locators, no derivative artifacts

Open-source, self-hosted; bring any Anthropic or Gemini token

What the top results for this keyword actually describe

Ten out of ten top-ranked pages for "natural language test case descriptions automation" describe the same mental model: a pipeline that takes plain-English requirements, runs NLP on them (tokenization, POS tagging, sometimes SCR specifications), and emits test cases as code. The oldest is a 2014 journal paper (NAT2TESTSCR, Science Direct). The most recent vendor pitches are testRigor, AccelQ, and Testsigma, each of which wraps a proprietary English-flavored DSL and calls it "natural language." None of them describe what happens at runtime when the test actually runs; they all talk about getting from requirements to a test artifact. That is pre-processing, and the artifact they emit is still code.

The gap in the SERP is the runtime view. If "English plan stays English" is your north star, the automation cannot live in a compile step; it has to live in the loop that drives the browser. That is the shape this guide documents.

The two-line rule that is the entire automation contract

In Assrt, the "automation" part of natural language test case automation is enforced by two sentences inside the system prompt at /Users/matthewdi/assrt/src/core/agent.ts, lines 207-208. Those sentences tell the model to call snapshot first and to act on ref IDs from the accessibility tree instead of CSS selectors. Everything downstream (resilience, recovery, selector stability) is a consequence of those two sentences.

agent.ts (system prompt excerpt)

“Lines of prompt that define the entire NL-to-browser execution contract in Assrt.”

agent.ts:207-208

The loop, in its own words

Below is a faithful reproduction of the agent loop. The point of reading it is to see what is not there: no parser for the English, no selector resolver, no codegen pass. The plan stays in messages[] unchanged for every turn, and the model picks one of 18 tool_use blocks per step.

agent.ts (lines 609-641, simplified)

Static compile vs live loop, side by side

Toggle between the two automation shapes. The first is the one implied by every top SERP result. The second is the one we ship. They produce superficially similar artifacts (a passing test run) but differ in where the automation lives, what drifts, and what has to be regenerated.

# The "automation" most tools mean when they say NL-to-test: # a static compile step that reads requirements once and emits # test code with hard-coded selectors. 1. Parse requirements.md with an NLP pipeline (POS tagging, NER). 2. Extract actors, actions, expected outcomes into a template. 3. Map each action to a selector string: "Click Sign In" becomes page.click('button.btn-signin'). 4. Emit a .spec.ts file that runs later, with no memory of the English and no ability to re-select if the UI shifts. 5. When the UI changes, the pipeline has to re-run and re-emit. The English file and the generated code drift apart. Where the automation lives: - before the run (compile time), in a one-shot script that produces a frozen artifact of selectors - not during the run (execution time)

Pre-run pipeline emits test code from English
Stored locators drift when the UI changes
Re-runs require re-emitting the artifact
The English file and generated code can disagree

What the loop eats and produces, per turn

Every turn the agent pulls three inputs together and fans out three outputs. The middle column is a single anthropic.messages.create call at agent.ts:630. That call is the whole automation.

Per-turn data flow

The 18-tool surface, exhaustive

Every English sentence in your plan resolves to a tool call from this set. Defined in TOOLS at agent.ts:16-196.

navigatesnapshotclicktype_textselect_optionscrollpress_keywaitwait_for_stablescreenshotevaluatecreate_temp_emailwait_for_verification_codecheck_email_inboxassertcomplete_scenariosuggest_improvementhttp_request

One turn of the loop, expanded

The same sentence, "Click Get started," becomes a dozen exchanges between the plan file, the loop, the model, Playwright MCP, and the app under test. This is what the word "automation" actually means in natural-language testing when the plan never compiles.

Plan sentence -> click event

0Tools in the agent surface

0Prompt lines define the contract

0Lines of codegen

L0Loop starts at agent.ts

Six properties you get when the plan never compiles

The English never compiles

There is no transformation pass that turns your prose into a .spec.ts file or a stored locator. The plan text lives in messages[] for the duration of the run and leaves no derivative. If you edit the plan, you edit the automation. If you delete the plan, there is nothing left behind to drift.

Selectors are resolved per turn

Every turn gets a fresh snapshot, so "Click the Sign in link" is re-matched against whatever the DOM is right now. There is no locator stored with the test because the test is sentences.

The surface is 18 tools

TOOLS on agent.ts:16-196 is exhaustive. If a sentence cannot be satisfied by one of them, the run stalls. Knowing the surface is the main prerequisite to writing a plan that runs.

Two lines of prompt do most of the work

The SYSTEM_PROMPT at agent.ts:207-208 contains the entire execution contract: snapshot first, use ref=eN. Removing those two lines collapses reliability; adding more detailed rules hurts more than it helps at 4k max_tokens.

Recovery is also a snapshot

When a click fails because a modal opened or the DOM reshuffled, the prescribed recovery is to call snapshot again and re-pick. No retry decorator, no brittle-selector healing; the English plan is agnostic to UI churn.

Browser state carries between cases

The loop reuses the same MCP session across all scenarios in a run, so a case that logs in leaves the next case logged in. The English still belongs to one case; only the browser state is shared.

From plan file to pass/fail, step by step

The only seven things that happen

Parse the scenario file

parseScenarios at agent.ts:543 splits the plan on a single regex: /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi. Each match becomes a { name, steps } object; every sentence inside a case stays as raw text. This is the only static step, and it is seven lines of code.

Snapshot before anything

At agent.ts:569 the loop calls this.browser.snapshot() once to get the initial ARIA tree, then hands it into the first user message alongside the scenario name and steps. The SYSTEM_PROMPT at agent.ts:198-254 enforces the rule: always snapshot first. No action ever runs against stale DOM state.

Send the same prose to the model

agent.ts:630-632 calls anthropic.messages.create with model = claude-haiku-4-5-20251001, system = SYSTEM_PROMPT, tools = the 18-tool TOOLS array, and messages = the growing history. The scenarioSteps string is in that history unchanged. The model is not compiling your English; it is reading it per turn alongside the freshest snapshot.

Model picks a ref, not a selector

The tool definition for click (agent.ts:32-42) asks for element (a human description) and ref (e.g. "e5"). Refs come from the accessibility tree emitted by the Playwright MCP snapshot, not from CSS. The model is picking among roles + accessible names visible right now. A CSS selector in prose is a weaker hint than a role + name, because that is what it actually sees.

Playwright MCP dispatches the action

The switch statement at agent.ts:682-780 maps each tool_use to a browser method: navigate, snapshot, click, type, select, scroll, press_key, wait, wait_for_stable, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, http_request. The call goes over MCP to a real Playwright session.

Re-snapshot and loop

After every visual action, the agent re-calls snapshot so the next turn sees the new DOM. Refs can invalidate mid-run and that is fine: the SYSTEM_PROMPT at agent.ts:220-226 tells the model to re-snapshot when an action fails. This is the entire recovery story. No retry library, no flaky-test annotation, no vendor magic.

complete_scenario closes the book

When the model emits complete_scenario with { summary, passed }, the loop records the result and moves to the next #Case. Cookies and auth state carry over (agent.ts:239-241), but each case gets its own fresh prompt. The whole run emits a TestReport JSON at /tmp/assrt/results/latest.json.

Live loop vs static compile, at the runtime level

Same English plan file. Very different answers to operational questions.

Feature	Static NLP-to-code pipeline	Assrt live loop
When prose meets selectors	Once at compile time, selector is then frozen	On every turn, after a fresh ARIA snapshot
Recovery model	Regex-based self-healing or flakiness annotations	Re-snapshot, pick a different ref (agent.ts:220-226)
Artifact on disk	.spec.ts, .feature, or vendor-locker database	A .md plan and a TestReport JSON. No generated code.
What drives the browser	Generated Playwright/Cypress script or vendor runner	claude-haiku-4-5 tool-use loop over Playwright MCP
Reacting to UI changes	Re-run the compile pipeline, re-emit, diff against git	The next run just re-snapshots; no code regen
Tool surface	Unbounded DSL vocabulary, silently unimplemented at runtime	18 exact tools, defined in TOOLS array (agent.ts:16-196)
License & hosting	Up to $7,500 per month with seat limits and hosted cloud	Open-source, self-hosted, bring your own model token

An actual trace: signup in eight loop turns

This is the trace a real Assrt run would emit for the plan above, drawn from the event shape at /tmp/assrt/results/latest.json. No line here is a compile output; every line is either a snapshot, a tool call, or a browser action.

happy-path.run

Recovery without healing

The SYSTEM_PROMPT at agent.ts:220-226 teaches the model one recovery move: call snapshot again. There is no retry library, no XPath fallback chain, and no stored locator to heal. The trace below shows what happens when a cookie banner intercepts a click mid-run.

recovery.run

The consequence

When the plan is the automation, UI churn is free

A static NLP pipeline pays UI churn cost twice: once to regenerate the test code, once to debug why the regenerated code behaves differently. A live agent loop pays it 0x because the plan file never mentions selectors in the first place. The ARIA tree on the next run is whatever it is; the model re-picks. If you are tracking time-to-green on flaky test suites, this is where the hours go.

Want to see the loop run against your app?

Bring one signup flow. We will write the plan live, run it, and inspect the turn-by-turn trace so you can see exactly how your English maps onto the 18-tool surface.

Frequently asked questions

Is natural-language test case descriptions automation the same as NLP test generation?

No. Most tools under this keyword treat automation as a static NLP pipeline that reads requirements.md, tokenizes, extracts entities, and emits a .spec.ts file or Gherkin bundle. That is pre-processing, not automation. Assrt's model is different: the English plan stays English, and an LLM tool-use loop at /Users/matthewdi/assrt/src/core/agent.ts lines 609-641 drives the browser turn by turn. There is no compile step, no generated code, no derivative artifact. When people say 'automation' for this keyword they usually mean the former; we think the latter is a strictly better shape because there is nothing to drift.

What is the two-line rule that makes the whole thing work?

Lines 207-208 of /Users/matthewdi/assrt/src/core/agent.ts, inside SYSTEM_PROMPT: 'ALWAYS call snapshot FIRST to get the accessibility tree with element refs. Use the ref IDs from snapshots (e.g. ref="e5") when clicking or typing.' Those two sentences tell the model it must re-read the live ARIA tree every turn and act on roles + accessible names, not on a stored locator. Removing them collapses reliability because the model falls back to guessing CSS. Those two lines plus the 18-tool TOOLS array (lines 16-196) are the execution contract.

What actually happens between 'Click Continue' in my plan and a real browser click?

Seven things, none of them code generation. (1) parseScenarios at agent.ts:543-554 splits the plan into { name, steps } objects. (2) agent.ts:569 calls browser.snapshot() and gets back an ARIA tree string with [ref=eN] tags per element. (3) agent.ts:595 builds a user message combining scenario text + snapshot. (4) agent.ts:630 calls anthropic.messages.create with claude-haiku-4-5-20251001, the SYSTEM_PROMPT, the 18 TOOLS, and the conversation so far. (5) The model returns a tool_use block, e.g. click(element="Continue", ref="e44"). (6) agent.ts:682-780 runs a switch that dispatches to browser.click. (7) The loop re-snapshots and repeats. That's it. Your English was never transformed.

What are the 18 tools, and why does it matter that it is exactly that set?

The TOOLS array at agent.ts:16-196 defines navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, wait_for_stable, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, and http_request. It matters because the English you write has to map onto that surface. A sentence like 'Verify the brand color is teal' has no tool and will stall; 'Verify the heading Welcome is visible' maps cleanly to assert. You are not writing to a DSL, you are writing to a tool surface, which is strictly narrower than 'anything English can say'. Knowing the 18 is the discipline.

How does self-healing work if there is no stored locator?

There is nothing to heal, because nothing was ever stored. 'Self-healing' as a feature exists because stored locators drift from the live DOM. Assrt resolves 'Click the Sign in link' fresh every run against the current ARIA tree, so what other tools call self-healing is our default path. If an action fails mid-run because a modal opened or the DOM reshuffled, the SYSTEM_PROMPT at agent.ts:220-226 tells the model to call snapshot again and pick a different ref. The English plan does not change. See the recovery trace in the guide for a concrete example with a cookie banner.

Where does the plan live, and can I edit it mid-conversation with an AI agent?

Plans live at /tmp/assrt/scenario.md on disk (see /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts). After any assrt_test call, the test plan is written there and scenario metadata goes to /tmp/assrt/scenario.json. You can open scenario.md in any editor, change sentences, and re-run with the same scenarioId UUID. Because there is no intermediate artifact, the edit is the automation change. Claude, Cursor, or any MCP-speaking agent can also edit the file in the same conversation and re-run assrt_test, which is how most of our own testing happens.

What model is driving the loop by default, and can I swap it?

DEFAULT_ANTHROPIC_MODEL at agent.ts:9 is claude-haiku-4-5-20251001. The alternative path is claude-opus-4-7 for harder flows or gemini-3.1-pro-preview for cost tests (DEFAULT_GEMINI_MODEL, agent.ts:10). You pass model as a string to assrt_test and it overrides at runtime. The GEMINI_FUNCTION_DECLARATIONS block at agent.ts:277-301 is just a one-to-one translation of the same 18 tools into Gemini's schema; nothing about the loop shape changes between providers. Haiku is the default because the per-turn work is narrow: read the snapshot, pick one tool. Larger models do not get proportionally better at that task.

Is there any codegen at all? I keep seeing 'generates real Playwright code' on your site.

The runtime is Playwright MCP, not a generated .spec.ts file. Each tool call (click, type_text, etc.) invokes a real Playwright browser method over MCP. What 'real Playwright' means in our context is: we drive a real playwright.chromium.Page with the real Playwright API, not a vendor runner that merely resembles Playwright. No proprietary YAML, no DSL. If you want a portable .spec.ts for CI outside Assrt, the model can emit one on request, but the day-to-day run lives in the MCP loop. Tests are yours to keep, as English files.

Can I verify the automation model yourself?

Read three files. First, /Users/matthewdi/assrt/src/core/agent.ts lines 198-254 for the SYSTEM_PROMPT that contains the two-line rule. Second, lines 16-196 of the same file for the TOOLS array. Third, lines 608-780 for the loop itself. That is under 200 lines of source and it is the whole automation. Everything else in the repo is plumbing: MCP session management, video recording, disposable email, Freestyle VM orchestration. The core NL-to-browser mechanism is a while loop that re-snapshots, re-prompts, and dispatches tool calls.

What is the cost profile of a loop turn versus static codegen?

A loop turn is one Anthropic API call at Haiku pricing with a growing messages array (ARIA snapshots take the most tokens; a typical snapshot is 8-30k tokens). A 5-step case usually runs 8-12 turns including snapshots and the final complete_scenario, so 100-300k tokens per case. Static codegen amortizes the cost (run the pipeline once, run the tests a thousand times for free), but rebuilds required on UI changes bring the cost back. In practice, if your UI is stable enough that codegen is cheap, our loop is also cheap; if it churns, our loop stays cheap while codegen pipelines rebuild.

Related guides

Writing

How to write natural-language test case descriptions

The grammar side of this same system: one regex defines what an LLM browser agent can parse into scenarios.

Read

Architecture

AI-powered agentic test execution with automation

Where the agent loop fits in the broader shift from scripted tests to live LLM-driven execution.

Read

Reliability

Automated self-healing tests

Why there is no healing step when the selector is re-resolved from the ARIA tree on every turn.

Natural language test case descriptions automation, as a live loop instead of a compiler

What the top results for this keyword actually describe

The two-line rule that is the entire automation contract

The loop, in its own words

Static compile vs live loop, side by side

What the loop eats and produces, per turn

Per-turn data flow

One turn of the loop, expanded

Six properties you get when the plan never compiles

The English never compiles

Selectors are resolved per turn

The surface is 18 tools

Two lines of prompt do most of the work

Recovery is also a snapshot

Browser state carries between cases

From plan file to pass/fail, step by step

The only seven things that happen

Parse the scenario file

Snapshot before anything

Send the same prose to the model

Model picks a ref, not a selector

Playwright MCP dispatches the action

Re-snapshot and loop

complete_scenario closes the book

Live loop vs static compile, at the runtime level

An actual trace: signup in eight loop turns

Recovery without healing

When the plan is the automation, UI churn is free

Want to see the loop run against your app?

Frequently asked questions

Frequently asked questions

Related guides

How to write natural-language test case descriptions

AI-powered agentic test execution with automation

Automated self-healing tests

Comments (••)

Comments ()