Service test automation where the test file is a prompt, not a spec folder.
Every other article on this keyword reads like the same 2018 framework decision tree: pick data-driven or BDD or keyword-driven, commit to Selenium or Cypress or Playwright, spin up a Page Object Model, wire it into CI, budget for a maintenance tax. That story assumes the test is a codebase. Assrt assumes it is a paragraph. One markdown file of English #Case blocks. An LLM agent reads it at runtime and drives real Chromium through exactly ten tools registered in a single TOOLS array. This page is about those ten tools and the file that replaces your test suite.
The claim in one paragraph
The spec file is the bottleneck of service test automation. Assrt deletes it. The tests are two-line English paragraphs; the runner is an LLM agent driving real Playwright under the hood.
Because the agent re-reads the plan and re-picks refs from a live accessibility tree on every run, a button rename does not break anything. Because the runner is the official @playwright/mcp package, the browser behavior matches what a Playwright script would do. Because the entire input is English, the coding agent that wrote your service can write the #Cases, too.
What every other guide on this keyword tells you to pick
The SERP for "service test automation" is a decision tree. Framework type, runner, assertion library, CI orchestrator, reporting layer. Each of the boxes below is a real choice a real team has debated. Assrt replaces the whole tree.
The incumbents all assume the test is a codebase you own and maintain. The framework picks are about how that codebase should be structured. Assrt starts from a different axiom: the test is the prompt, and the prompt is versioned as a single markdown file.
Four numbers that describe the whole stack
Everything downstream of this page is derivable from these four constants.
The anchor: exactly ten tools in one TOOLS array
This is the hinge of the entire design. The agent has a finite, inspectable vocabulary. Nine tools drive the browser; one fires an HTTP request. Every English sentence you write has to express itself as a sequence of these ten. If you can imagine the call stack, you can predict the behavior.
MAX_STEPS_PER_SCENARIO is Infinity because a real end-to-end flow can take dozens of actions: open a modal, upload a file, wait for OCR, pay with Stripe, refresh, assert the confirmation email. There is no artificial budget the way there is with many AI test tools. The agent keeps going until a scenario completes or fails.
The file that replaces your test suite
This is a real plan.md. Three #Cases, a UI flow, an external webhook check, a rate-limit assertion. No imports. No page objects. No base URLs hardcoded. The only file you version.
The trace of one run
This is what npx assrt run actually prints. Note that each browser action is preceded by a fresh snapshot, and http lines sit next to click lines in the same case.
From English paragraph to real browser action
The runtime shape of one #Case. Three actors, one coordinator. The LLM never talks to the browser directly; every browser action is mediated by @playwright/mcp over stdio. Every HTTP assertion is a direct fetch from the same process.
One #Case: paragraph -> tools -> verdict
Where the service and the validator meet
Below: the data flow that makes "service test automation" a single command. Your agent-authored service, your production APIs, and your CI trigger all feed into the same Assrt MCP process, which drives a real browser and returns structured verdicts.
Your service, your agent, your real browser
Six mechanical properties that follow from the design
None of these are features bolted on top. They are consequences of the shape: one TOOLS array, one plan.md, one LLM tool loop over @playwright/mcp.
The scenario is the spec. No second artifact.
A #Case paragraph is the input and the documentation at once. There is no parallel test spec file to drift out of sync. When the flow changes, edit the paragraph.
Snapshot-then-act, every single action
The system prompt at agent.ts:207-209 forces the agent to call snapshot before each interaction, then use a fresh [ref=eN] handle. That is why a color or layout change never breaks the test.
Browser is persistent by default
Scenarios run in the same Chromium tab. Cookies, localStorage, and auth survive across #Cases. The code for this is in browser.ts under ~/.assrt/browser-profile.
http_request sits beside click
The agent can GET an API endpoint mid-scenario to verify a side effect, in the same tool loop as the browser actions. One UI/API test. No second runner.
Recording auto-opens at 5x
Every run writes a .webm. The generated HTML player (assrt-mcp/src/mcp/server.ts) opens in your browser at 5x, with Space/arrow-key seeking and 1x/2x/3x/5x/10x speed pills.
Self-hosted, open source, MIT
The runner is npx @assrt-ai/assrt. The core repos are assrt (web/recorder) and assrt-mcp (CLI + MCP server). No account required. Every byte of the stack is inspectable.
Six steps from zero to a passing service test
There is no scaffold. There is no generator. You install one package, write one file, call one tool. The ordering below is the exact flow end-to-end, and the last step is the recovery path when a scenario fails.
Install the MCP server
npx @assrt-ai/assrt setup registers the assrt MCP server with Claude Code, Cursor, or any MCP-speaking agent. Writes a reminder hook and updates your CLAUDE.md.
Write a plan.md in English
#Case 1: ... two sentences describing what to click, what to type, what to assert. No YAML, no Gherkin, no .spec.ts. You can keep it in the repo root next to README.md.
Call assrt_test or the CLI
The coding agent fires the assrt_test MCP tool with url and plan. Or from a terminal: npx assrt run --url ... --plan-file plan.md. Either way it spawns @playwright/mcp over stdio.
The agent reads the plan at runtime
Claude Haiku by default. Each #Case becomes a conversation turn. On every action the agent calls snapshot first, picks a [ref=eN] from the live a11y tree, then issues click/type_text/http_request.
Assertions stream back as events
scenario_start, step, assertion, improvement_suggestion, scenario_complete. A structured JSON report lands in /tmp/assrt/results/latest.json. The video auto-opens in a browser tab at 5x.
Failed run? Use assrt_diagnose
The diagnose MCP tool reads the run's events, screenshots, and DOM snapshots and returns a root-cause analysis with a suggested fix. You never guess what went wrong from a flat log.
Classic service test automation vs. Assrt
Every line below is the honest shape of the tradeoff. The incumbents are not wrong for their use case; they just assume the test is a codebase.
| Feature | Selenium / Cypress / Playwright spec folders + DSLs | Assrt |
|---|---|---|
| Where tests live | A repo folder of .spec.ts, .feature, or .java files that must be kept in sync with the UI | One plan.md of English #Case paragraphs. No folder. No framework. |
| What the runner reads | A compiled test script with hard-coded selectors and assertions | A prompt. An LLM agent interprets each #Case fresh, picking refs from a live accessibility tree on every run. |
| When selectors change | CI breaks. Someone reads the diff, updates locators, reruns. | Agent gets the new snapshot and re-picks the ref. No code change needed for cosmetic UI drift. |
| Mixing UI and API assertions | Two runners, two artifacts, a CI step to glue them | http_request is in the same TOOLS array as click. Both halves live in one #Case. |
| Cross-scenario state | Per-suite beforeEach / afterEach hooks | Shared browser session by default. Cookies, auth, localStorage carry across #Cases. |
| Artifacts from a run | JUnit XML, HTML report, maybe a CI artifact | Video recording (5x playback auto-opens), screenshot per step, structured JSON report, shareable run URL. |
| Who can author a test | A QA engineer or SDET who knows the DSL | Anyone who can describe the flow in English. Including the coding agent that wrote the service. |
| Cost to start | Framework setup + CI wiring + locator maintenance cadence | npx @assrt-ai/assrt setup. Free. MIT. Zero vendor lock-in. |
The honest tradeoff
If you need hand-tuned deterministic test scripts with line-level coverage reports for a regulated industry audit, a classic Playwright or Cypress suite is still the right pick. Assrt is not competing on that axis.
Assrt is the right pick when the cost of maintaining the suite is larger than the cost of running it. When a UI rename breaks a dozen tests no one wants to fix. When the flow to test is a thing the product manager can describe in three sentences. When the coding agent that wrote the service is still in the loop. For that job, a markdown file plus a tool loop beats a codebase plus a framework.
Why a traditional framework cannot collapse into a prompt-driven runner
The instinct is always "add an AI layer on top of my existing Playwright suite." The problem is that the suite is what you are trying to delete. Playwright scripts hard-code selectors; the LLM layer cannot rewrite them mid-run without becoming a second, fragile codegen step. A framework like ServiceNow ATF encodes the flow in its own XML schema; an AI layer that interprets English still has to translate to that schema before executing. Karate is a DSL; Gherkin is a DSL; every DSL is a second grammar for the agent to guess against.
Assrt starts from the other end. The TOOLS array is the grammar. The plan.md is the sentence. The agent is the parser. There is no second language, no second runner, no second artifact. Adding a new capability is a new entry in TOOLS and a new case in the executor switch. That is the whole product surface.
Turn your next feature spec into a passing #Case in 20 minutes
Give us a URL and one flow that should work end to end. We write the plan.md live, run npx assrt-mcp, and watch the ten-tool loop validate your service in a real browser. You keep the plan.
Book a call →Service test automation questions the framework guides skip
What do you mean by 'no test suite'? There is no .spec.ts file at all?
Correct. The artifact you maintain is a plan.md file (or inline string) of #Case paragraphs. There is no test runner to install (beyond @playwright/mcp, which Assrt spawns for you), no test helper library to import, no selector constants file to keep in sync. The entire input is English. At runtime, an LLM agent reads the plan and drives real Chromium through the ten tools in assrt-mcp/src/core/agent.ts. Everything else, including the pass/fail verdict, is emitted as events from the tool loop. If you delete plan.md, nothing remains that looks like a traditional test.
How does the agent not hallucinate a click on a button that does not exist?
Because it does not guess. The system prompt at agent.ts:207-209 explicitly requires calling snapshot before every interaction. Snapshot returns the live accessibility tree with [ref=eN] handles. The agent must pick a ref that is present in that tree and include it with the click. If the element is not on the page, there is no ref to reference, and the agent records a failed assertion with evidence instead of a fake click. The refs come from @playwright/mcp, not from the LLM.
Is this really different from 'AI-powered test automation' I have read about for years?
The differences are mechanical. First, the tests are the prompt: there is no code-gen step that turns English into Playwright scripts you then maintain. Second, the browser driver is the official @playwright/mcp (the same Anthropic/Microsoft package you would use directly), not a proprietary replay engine. Third, MAX_STEPS_PER_SCENARIO is Infinity (agent.ts:7), so a single #Case can run for hundreds of steps without an artificial cutoff. Fourth, the runner is distributed as an MCP server, so the same coding agent that writes your service is the one that calls assrt_test. That feedback loop did not exist two years ago.
How do tests stay deterministic if an LLM interprets them every run?
Three mechanisms. Assertions you write as English are graded against the agent's evidence (screenshots, DOM snapshots, tool results), not against its opinion. Concrete phrases like "wait for text 'Order confirmed'" or "GET /api/orders and assert response.orders.length == 1" are pass/fail with a clear signal. Variables are substituted before the prompt ever reaches the model, so secrets and IDs are stable. And the default model is Claude Haiku with temperature 0 where supported; bigger models are available behind --model when you want stronger reasoning on flaky flows. Determinism is a function of how specific your #Case is, which is the same tradeoff a human QA engineer already makes.
Can I run this as part of CI, or is it only for local development?
It runs anywhere Node runs. In CI, call npx @assrt-ai/assrt run --url $PREVIEW_URL --plan-file plan.md --json and parse the JSON exit. The --headless default is fine for CI, --extension is for attaching to a developer's already-open Chrome when you need a real login session. Videos are written to /tmp/assrt/<run>/video and can be uploaded as CI artifacts. If you sync to app.assrt.ai (optional), every run gets a shareable URL; otherwise nothing leaves your runner.
Where is the code for the ten-tool TOOLS array I keep seeing referenced?
assrt-mcp/src/core/agent.ts, lines 14 to 196. The const is literally called TOOLS. Each entry is a plain object with name, description, and input_schema in the shape the Anthropic SDK expects. The executor switch lower in the same file dispatches each tool_use to either a @playwright/mcp call (for browser tools) or a direct fetch (for http_request). The SYSTEM_PROMPT immediately below the array (lines 198-254) is what tells the agent to snapshot first and use refs. Clone assrt-mcp and grep for 'TOOLS:' to read it yourself. MIT license, no account required.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.