How to write natural-language test case descriptions an LLM browser agent can actually run
The top search results for this keyword describe academic NLP pipelines or proprietary English-flavored DSLs like testRigor. This guide is the opposite: a single open-source parser, one regex of grammar, and a fixed tool surface that an LLM agent maps your prose onto. You are not writing to a vendor, you are writing to a runtime. Once you see the runtime, writing becomes mechanical.
What every top result actually answers
Search this keyword and you get two clusters. The first is academic: NAT2TESTSCR, ICSE 2022 papers, and thesis PDFs on generating tests from formal requirements with tokenization, POS tagging, and SCR specifications. The second is vendor marketing: testRigor, AccelQ, Testsigma pitching their English-flavored DSL as "natural language" while enforcing a proprietary keyword set that fails silently when you step outside it. Neither answers the operational question: if you have an LLM browser agent today, what shape of sentence does it actually execute well?
Helpful if you are writing a research paper. Not helpful if you have a dev server running on localhost:3000 and you need a test by lunch.
You write what looks like English but is actually a proprietary keyword chain. Step outside it and nothing runs.
This guide answers the third question. The Assrt runtime is an LLM browser agent with a fixed tool surface and a one-regex parser. Once you have seen both, writing a natural-language test case stops being a vibes exercise. You are writing sentences that compile to tool calls, and you know exactly which sentences compile.
Anchor fact: the whole grammar is one regex
When people say "natural language test case" they usually mean "not code, but actually a DSL we made up." In Assrt the parser for natural-language #Case blocks is literally a single regular expression in one file. Read it once and you know the entire shape of acceptable input.
“Everything after the header is passed verbatim to the LLM. There is no keyword vocabulary to memorize because there is no keyword vocabulary.”
parseScenarios, agent.ts:620-631
Every string above matches the parser regex. Case-insensitive, optional #, optional number, colon or period. Pick one shape per project and stick with it.
Anchor fact: you are writing to an 18-tool surface
The English after the header is not free-form. The agent has exactly 18 tools and every sentence you write has to compile to one of them. When you see the list you can predict which sentences will work and which will stall before you even run the test.
How a sentence actually executes
Three inputs feed into the agent loop: your prose, the live accessibility tree, and the 18-tool catalog. The agent binds a sentence to a tool per turn and re-reads the tree between every action. Nothing is pre-compiled.
scenario.md → agent loop → browser
Writing one #Case, step by step
Four short moves. Do them in this order every time and you will stop producing scenarios that look correct but stall at runtime.
- 1
Write the header
Start with #Case N: short action-oriented name. The leading # is optional; the colon can be a period. These rules come from parseScenarios, not from style.
- 2
Name the elements
Three to five steps. Each one names a visible element by its accessible name and a single action: click, type, select, scroll, press, wait, verify.
- 3
Add an assertion
Close with 'Verify X is visible' or 'Verify the URL contains Y.' Without an explicit verify, the agent can mark the case passed on navigation alone.
- 4
Scope the data
For signup use 'Get a disposable email.' For external APIs use 'Use http_request to call <url>.' The agent has first-class tools for both; you just have to invoke them in prose.
Good prose vs DSL-flavored "natural language"
The left column is what vendor docs call natural language. It is YAML, CSS selectors, and assertions that the LLM cannot satisfy with its 18 tools. The right column is what actually runs: plain English sentences naming visible elements and observable outcomes.
Same intent, two different targets
test_signup_flow:
given: user is on homepage
when:
- user clicks "#root > div.hero > a.cta-primary"
- user enters valid email matching /^\S+@\S+$/
- user submits form via keyboard ENTER
then:
- backend responds with 201
- response.body.token is a valid JWT
- redirect to /dashboard within 500ms
- session cookie has SameSite=Strict
expected_rendering_width: 1440px
verify_css:
- .cta-primary.background === rgb(5, 150, 105)
verify_network_performance:
- time_to_first_byte < 200msSix rules the runtime enforces for you
Each card names a rule, the file that implements it, and what happens if you ignore it. None of these are style preferences; they are properties of the parser and the agent loop.
Header shape is the whole grammar
parseScenarios on agent.ts:621 splits on /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi. Any of three keywords, optional #, optional number, colon or period. Everything after is raw text passed to claude-haiku-4-5. No Given/When/Then, no YAML, no Gherkin. If you use a foreign keyword like 'Feature' or 'Scenario Outline', the parser collapses the file into one unnamed scenario.
Intent, not implementation
The agent resolves each step against the live ARIA tree, so 'Click the Continue button' is the selector. 'Click .btn-continue' is a failure mode: the LLM is still looking at role+accessible name, and a CSS string is the least-useful hint you can give it.
Every sentence must map to a tool
18 tools. If you ask for CSS colors, viewport widths, Lighthouse scores, or request timings, there is no tool to call and the scenario stalls or the agent fakes an assertion. Keep sentences inside navigate / snapshot / click / type / select / scroll / press / wait / assert / http_request.
Three to five steps per case
The discovery prompt on agent.ts:256-267 says 3 to 4 actions max; the plan prompt on server.ts:219-236 says 3 to 5. Long scenarios fragment the agent's reasoning and exhaust its context on the first failure. Two well-scoped #Case blocks beat one twelve-step saga.
Each case is self-contained
Scenarios share browser state but the prompt tells the agent to treat each #Case independently. If your test needs login, write the login steps inside the case. Do not chain 'assume the user from Case 1 is still signed in.'
Observable, not internal
'Verify the dashboard heading is visible' is observable. 'Verify the JWT in localStorage has role=admin' is not. The agent uses an accessibility snapshot, not devtools. Evaluate exists as an escape hatch, but pushing assertions there loses the audit trail.
Rewriting a sentence that will not compile
The most common failure mode is writing assertions the 18-tool surface cannot satisfy. Here is a real before/ after: three sentences that stall, rewritten as five that map one-to-one onto tools the agent has.
What a well-shaped case looks like at runtime
The parser splits once. The agent navigates, snapshots, binds a sentence to a tool, acts, re-snapshots. Each English line becomes one or two tool calls.
What a bad case looks like at runtime
Three sentences that sound reasonable in a ticket but do not map to any of the 18 tools. The agent navigates, snapshots, finds nothing to do, and the case fails with a suggestion to rewrite.
Eight rules for prose that actually executes
These are the rules that fall out of the regex and the tool surface. Follow them and your scenario.md will run on any snapshot-based LLM browser agent, not just Assrt.
Natural-language test case, eight rules
- Use #Case N: name, Scenario N: name, or Test N: name as the header. These are the three keywords the parser recognizes; anything else is treated as plain text and your file becomes one scenario.
- Write each step as an English sentence that names a visible element and an action. 'Click the Continue button.' 'Type "test@example.com" into the Email field.' 'Verify the heading Welcome is visible.'
- Prefer accessible names to CSS or IDs. The agent reads role plus accessible name from the ARIA tree, so 'the Sign in link' beats '.nav-signin' because it survives the next refactor.
- Keep each case between 3 and 5 actions. Split multi-feature flows into separate #Case blocks; scenarios share browser state but have independent prompts.
- Always include at least one Verify sentence per case. Without it the agent may navigate and report 'scenario complete' without actually checking anything.
- For signup flows, write the steps that use disposable email: 'Get a disposable email. Type it into the Email field. Wait for the verification code. Paste it into the code field.' The agent owns create_temp_email and wait_for_verification_code.
- For integrations, write 'then use http_request to check <external API>' as its own sentence. The agent treats it as a tool call rather than guessing.
- Do not ask for things the tool surface cannot do: visual regression, CSS color assertions, responsive-layout checks, Lighthouse scores, or network timing. Those are different test classes and the agent will either stall or confabulate.
Natural-language-to-an-LLM-agent vs natural-language-as-DSL
Both approaches call themselves "natural language test cases." Their on-disk artifacts and runtime behavior are not the same thing.
| Feature | English-flavored DSL vendor | Assrt (prose + LLM agent) |
|---|---|---|
| Grammar | Proprietary DSL or Gherkin Given/When/Then, enforced by a parser you do not own | One regex: /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi (agent.ts:621) |
| Selector surface | CSS, XPath, or a vendor locator string stored with the test | Accessible name + role from the live ARIA tree, resolved per run |
| Assertion surface | Often unbounded in the DSL, silently unimplemented at runtime | Observable DOM + external API via http_request (18 tools total) |
| What the agent actually executes | A recorded script or a vendor proprietary runner you cannot self-host | A Playwright MCP session driven by claude-haiku-4-5-20251001 |
| Max steps per case | Unbounded; long cases tend to hide the failure point | 3 to 5, enforced by system prompt (server.ts:219-236) |
| Artifact on disk | Vendor database, YAML bundles, or .spec.ts with embedded locators | One markdown file at /tmp/assrt/scenario.md (scenario-files.ts:17) |
| License to run | Up to $7,500 per month with seat limits | $0, open-source, self-hosted, bring your own LLM token |
Bring a ticket, walk away with a runnable #Case
Thirty minutes. You share a user story or a QA ticket; we rewrite it live into a #Case block, run it against your dev server through the Assrt MCP, and hand you the scenario.md plus the log so you can see every English sentence map onto a real tool call.
Book a call →FAQ on writing natural-language test case descriptions
What is the exact grammar for a natural-language test case in Assrt?
The whole grammar is one regex in /Users/matthewdi/assrt-mcp/src/core/agent.ts on line 621: /(?:#?\s*(?:Scenario|Test|Case))\s*\d*[:.]\s*/gi. It matches any of three case-insensitive keywords (Scenario, Test, Case), an optional leading #, an optional number, and a trailing colon or period. Everything after the header is passed to the LLM verbatim. There is no Given/When/Then, no Feature:, no Scenario Outline: — those are foreign keywords to this parser, and using them collapses the whole file into a single unnamed scenario. If you want multiple scenarios, use multiple of the three recognized headers.
What is the LLM actually doing with the sentences I wrote?
The plan text is concatenated into a user message under the SYSTEM_PROMPT at agent.ts:198-254, which tells the model it is an automated web-testing agent with 18 tools. The model is claude-haiku-4-5-20251001 by default (agent.ts:9). For each sentence, the model calls 'snapshot' to get the live accessibility tree, maps the sentence to one tool plus its arguments, calls it, and then re-snapshots. Your sentence is not parsed into intermediate structure: the same English is read on every turn, the selectors are generated from the live ARIA tree, and the 'execution plan' exists only in the model's context window for the duration of the run.
What is the 18-tool surface I should be writing to?
The TOOLS array on agent.ts:16-196 defines everything the agent can do: navigate, snapshot, click, type_text, select_option, scroll, press_key, wait, wait_for_stable, screenshot, evaluate, create_temp_email, wait_for_verification_code, check_email_inbox, assert, complete_scenario, suggest_improvement, and http_request. If a sentence cannot be satisfied by that set, no amount of clever wording will save it. Common traps: 'verify the brand color is teal' (no color tool), 'check mobile layout' (no viewport tool beyond a fixed resize), 'confirm the API returned 201' (no direct network assertion — use http_request explicitly, or assert on observable DOM).
How specific should my selectors be in prose?
As specific as the accessibility tree is. An ARIA node exposes a role (button, link, textbox, combobox, heading, …), an accessible name (usually the visible label), and sometimes a description. 'Click the Sign in link in the header' is better than 'Click Sign in' when there are two 'Sign in' links, because it narrows the match to a link inside a nav landmark. Prose specificity becomes ambiguity-breaking in exactly the cases where a CSS selector would too — except you never have to rename the selector when the markup changes, because the sentence describes intent. Avoid CSS strings in prose; the LLM treats them as a weak hint compared to names and roles.
How many steps should a single #Case contain?
Three to five. The PLAN_SYSTEM_PROMPT in /Users/matthewdi/assrt-mcp/src/mcp/server.ts:219-236 says 3 to 5 actions max; the DISCOVERY_SYSTEM_PROMPT in agent.ts:256-267 says 3 to 4. This is not style, it is operational: longer cases exhaust the model's context on the first failure, hide which step actually broke, and reduce the usefulness of the per-case pass/fail signal. If a flow is longer, split it into sequential #Case blocks. Browser state (cookies, auth) carries between cases in the same run (agent.ts:239-241), so splitting does not force you to re-authenticate.
Do I have to use the #Case prefix, or will Scenario or Test work?
All three work and they are equivalent. The parser on agent.ts:621 accepts Scenario, Test, or Case, case-insensitive, with or without a leading #, with or without a number, with either a colon or a period as the separator. '#Case 1: Login', 'Scenario 2. Reset password', 'test 3: search works' all parse into the same scenario shape. Pick one convention per project so your plan files diff cleanly. The rest of the ecosystem (assrt_plan generation, assrt_diagnose corrections) emits #Case by default (see PLAN_SYSTEM_PROMPT), so unless you have a reason, follow that.
How do I write a test that needs a real email verification code?
Name the tools in prose and the agent will invoke them. Example: '#Case 1: Sign up with email verification. Click the Get started button. Get a disposable email and type it into the Email field. Click Continue. Wait for the verification code email. Paste the code into the 6-digit field. Click Verify. Verify the dashboard heading is visible.' Under the hood that maps to create_temp_email, type_text, click, wait_for_verification_code, an evaluate call that uses the specific 6-digit paste snippet on agent.ts:235, another click, and an assert. You do not have to know the snippet — the SYSTEM_PROMPT already teaches the model to use it when it encounters split-character OTP inputs.
What happens if I write Gherkin (Given / When / Then) instead?
The parseScenarios regex does not recognize Feature:, Scenario Outline:, Examples:, Given, When, or Then as structural keywords, so your whole file becomes one unnamed scenario and the Gherkin keywords end up as plain text inside it. The model will still try to execute it, because LLMs are forgiving, but you lose the per-case pass/fail signal and the scenarios do not show up individually in the results JSON at /tmp/assrt/results/latest.json. If you already have Gherkin, do a one-time rewrite into #Case blocks. The sentences inside mostly translate one-to-one.
Can I assert on things that are not in the DOM, like API calls or emails?
Yes, with the http_request and check_email_inbox tools. For a webhook test, write the sentence 'Then use http_request to GET https://api.telegram.org/bot<token>/getUpdates and verify the latest message contains the test code.' For email, write 'Check the disposable email inbox and verify the most recent subject contains "Confirm your account".' These are two of the 18 tools (agent.ts:172-184 for http_request, and 128-131 for check_email_inbox), so invoking them in prose routes the step to the right call. What you cannot do from prose alone is anything outside that set — CSS colors, viewport emulation beyond a resize, performance budgets, Lighthouse audits, or screenshot-diff visual regressions.
How do I verify the grammar rules in this guide myself?
Read two files. First, /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 620-631 for the parseScenarios function (seven lines of code define the entire header grammar). Second, lines 16-196 of the same file for the TOOLS array — every sentence you write ultimately compiles to one of those 18 tool calls, and that is the exhaustive list. If your use case does not appear there, it cannot be expressed in a natural-language #Case today. Open an issue on the repo if you think the tool surface should expand; do not work around it by over-specifying the prose.
Adjacent guides on the no-DSL testing model
Keep reading
Self-healing tests guide
Why there is nothing to heal when the test is prose resolved per run against the ARIA tree.
E2E testing for beginners
A plain-English entry point to end-to-end testing with AI browser agents.
Automation in QA
How the automation layer changes when the test artifact is a markdown file.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.