Long-running agents need test signal as a pulse, not a blocking call
Direct answer (verified 2026-05-21)
Run the test as a detached child process. Have it write structured JSON to a known file path. Let the agent read that file whenever it wants the verdict. In Assrt the command is npx assrt run --url <url> --plan "..." --json kicked as a background bash, and the verdict lands at /tmp/assrt/results/latest.json. The conversation thread is never held open for the duration of the run. Source layout is defined in github.com/assrt-ai/assrt-mcp.
When a coding agent runs for an hour, the part of the testing stack that matters is not the runner. It is the interface. A blocking SDK call stalls the model for the full test duration; a polling pulse lets the model keep editing while the browser run happens out of band. Most posts about agents and tests talk about whether the agent ran them. The question that actually decides whether a long-running agent stays useful is how the signal gets back.
The thesis in one line
Test signal for a long-running agent is a pulse the agent polls when it wants the answer, not a synchronous call that hangs the agent until the test finishes. Any test interface that holds the agent's conversation thread open for the full run is wrong for this shape of work, no matter how good the runner under it is.
Where blocking calls actually hurt
A two-minute test inside a one-shot agent task is barely noticeable. The same two-minute test inside an hour-long task, fired ten times as the agent iterates, is twenty minutes of dead air. The model has nothing to do during a blocking call. It cannot read the next file, edit the next handler, or even update its plan. The tool call returns when the test returns and not before.
The deeper cost is rhythm. Long-running agents work by emitting many small tool calls in tight sequence, each producing a short result the model reads in a few tokens and acts on. A blocking test call breaks that rhythm twice: once with a long silence, and once with a giant report that arrives all at once. The model has to re-orient to its own task after the dump, which often means rereading files it already read or recapping decisions it already made. Long-context agents on real tasks routinely lose ten or twenty percent of useful attention to that re-orientation tax.
And the third cost is failure modes. If the runner hangs or the browser crashes mid-scenario, a blocking call has no clean way for the agent to time out, because the agent's next turn is gated on the call returning. A pulse pattern fails open by default: if the file never appears, the agent moves on and treats the run as a no-op.
The pulse pattern, concretely
Three pieces. A fire-and-forget command. A known file path. A read.
In Assrt the command is the open-source CLI invoked as a detached process. The runner takes ownership of a real Chromium browser, drives the scenario the agent passed in, and at the end writes a structured report to /tmp/assrt/results/latest.json plus a per-run history file at /tmp/assrt/results/<runId>.json. The path constants are defined in assrt-mcp/src/core/scenario-files.ts at lines 16 to 20. The writer is writeResultsFile at lines 77 to 84 of the same file. The same writer runs whether the agent invoked the MCP tool or the CLI, so the file shape is identical across paths.
Note what does not happen in that sequence. The agent never holds a connection open. The agent never blocks on a tool call. The test process runs to completion under its own steam, writes its verdict to a file the agent can read on demand, and exits. If the agent checks back too early, the file is missing or contains a status marker; the agent moves on and checks again later.
The verbatim guidance the MCP server gives on this pattern is shipped inside the source itself. Lines 314 to 327 of assrt-mcp/src/mcp/server.ts read, in the server's own instructions block: "The MCP tools (assrt_test, assrt_plan) block the conversation until they finish. To run tests without blocking, use the assrt CLI via the Bash tool with run_in_background: true". The blocking path exists on purpose for short scenarios and quick checks. The pulse path exists on purpose for long-running agents.
The same test, two different conversation shapes
Imagine the agent is two hours into a refactor that touches the signup flow. It needs to verify the end-to-end happy path still works. Here is what the interface change looks like from the agent's point of view.
Blocking SDK call vs polling pulse
Agent calls test_runner.run(...). The tool call is now open. The model literally cannot do anything else for the next 90 seconds. When the call returns, a 24,000-token report lands in context all at once. The model spends the next two turns re-reading what it was doing and reconciling the report with files it has not looked at since the run started. Total dead time: 90 seconds of the run plus another minute of re-orientation. If the runner crashes at second 70, the model still has to wait the full call timeout before getting control back.
- Conversation thread held open for the full test duration
- Big report dumped at once, breaks the agent's working rhythm
- No clean timeout, agent depends on the runner exiting
What this means if you are designing tools for agents
Anything that takes more than a few seconds and is going to be called many times inside a single agent task should ship with two interfaces. A blocking one for short scenarios and ad-hoc invocation (this is the friendly default). And a non-blocking one for the long-running case: fire the work as a detached process, write structured output to a known address, expose the address to the agent so a future tool call can read it. Test runners are the obvious case. Build pipelines, batch jobs, dataset extraction, slow LLM evaluations, anything where the latency is measured in tens of seconds or more is in the same bucket.
A useful sanity check: would the same interface make sense if a human used it inside a CI pipeline? CI is the original long-running-agent of software engineering. It runs detached. It writes structured output to a known place. The developer reads the result later. Agents converge on the same shape because the underlying constraint is the same: the consumer of the signal has better things to do during the work.
The pitfalls of polling, and how to avoid them
The pulse pattern has two real failure modes. The first is the agent firing a test and then forgetting to read the result before declaring victory. Mitigate this with a habit and a checkpoint: tell the agent (in its system prompt or harness instructions) to read /tmp/assrt/results/latest.json before finishing any task that ran a test, and have the harness sweep that directory once before emitting the final response. A senior engineer doing the same task by hand would check CI before opening the PR; the agent should too.
The second is reading a stale file. If the agent reads latest.json before the current run finished, it sees the previous run's verdict and acts on it. Two defenses. Use the per-run history file at /tmp/assrt/results/<runId>.json: the runId is generated when the run starts, the agent records it (one line in working memory), and the read is keyed to that exact run. Or, simpler, write a small status sentinel at the start of the run that the writer clears at the end; the agent reads the sentinel first.
Wiring this into your agent harness?
If you are building a long-running agent and want help mapping the test pulse pattern onto your stack, I am happy to look at the harness with you and make it concrete.
FAQ
Frequently asked questions
What does a long-running agent actually do when a test starts?
Two things at once, if the interface lets it. It launches the test (the test runner now owns a real browser and is clicking through a scenario), and it goes back to coding. The interface decides whether those two things can really happen in parallel. A blocking SDK call ties the agent's conversation thread to the test process; the model literally cannot think about anything else until the test returns. A non-blocking interface detaches the test process from the conversation thread, so the model can ship another patch, read another file, or call another tool while the browser run is happening in the background.
Why is a blocking test call so bad for a long-running agent?
Three reasons compound. First, latency. A non-trivial end-to-end suite is 60 to 600 seconds. If the agent's interface to test signal is a synchronous call, the agent's context window is sitting idle for that entire interval. Multiply by 10 test runs over a multi-hour task and you have lost an hour to literal nothing. Second, retries. If the test runner crashes or hangs, a blocking call can hang the whole agent. The model has no way to time out cleanly because it cannot interleave a timeout check with the call. Third, narrative compression. Long-running agents work by issuing many small tool calls in a row, each producing a short result the model can read in a token or two. A blocking test call breaks that rhythm and dumps a 30,000-token report into the context after a long silence. The model has to re-orient to its own task from scratch.
What does a non-blocking test pulse look like in practice?
Three pieces. A fire-and-forget command (the agent kicks off the run as a detached child process and gets control back in milliseconds). A known file path the test writes results to when it finishes. A read of that file from the agent's next tool call, hours or steps later. In Assrt the command is `npx assrt run --url <url> --plan "..." --json --video`, the file is `/tmp/assrt/results/latest.json`, and the read is just whatever read-file tool the agent already has. The agent never holds a connection open. It writes one byte of state (the child process exists), comes back later, reads the file. If the file exists and the JSON parses, the verdict is in.
Where in the Assrt source is this pattern actually wired up?
Two files. In `~/assrt-mcp/src/mcp/server.ts` around lines 314 to 327, the MCP server documents the pattern itself: "The MCP tools (`assrt_test`, `assrt_plan`) block the conversation until they finish. To run tests without blocking, use the `assrt` CLI via the Bash tool with `run_in_background: true`". The file path layout is defined in `~/assrt-mcp/src/core/scenario-files.ts`: ASSRT_DIR is `/tmp/assrt` (line 16), LATEST_RESULTS is `/tmp/assrt/results/latest.json` (line 20), and `writeResultsFile` (lines 77 to 84) writes the structured run report there at the end of every test run, plus a per-run UUID file at `/tmp/assrt/results/<runId>.json` for history. The MCP tool and the CLI share the same writer, so the file layout is identical regardless of how the agent kicked the run off.
Doesn't the agent need to know when the test finished?
Not in the strict sense. The agent needs to know what the test concluded by the time it next looks. The two are different. A blocking interface fuses them: the call returns when the test returns. A polling interface decouples them: the test writes its conclusion to disk whenever it finishes, the agent reads the conclusion whenever it next has spare attention. If the test takes longer than the agent's next checkpoint, the agent reads a 'still running' state (file is missing or has a status: running marker) and moves on. If it finished earlier, the agent reads a passed or failed verdict. The decoupling is the whole point.
What does the structured JSON contain, concretely?
The same shape that the blocking MCP tool returns: a list of scenarios, each with a passed flag, the actions the agent took (snapshot, click, type_text, assert), per-step status, assertion-level pass and fail data, screenshots referenced by path, a video URL if --video was passed, and a top-level totals block (passed, failed, duration). Because the writer is the same in both paths (see `writeResultsFile` in scenario-files.ts), the long-running agent gets identical data to what a blocking caller would get. Nothing is downgraded for the non-blocking path. The history file at `/tmp/assrt/results/<runId>.json` keeps the same shape for every run, so an agent that wants to compare two runs (this attempt vs the previous one) does two file reads and a diff, with no SDK and no API call.
How is this different from CI?
CI is also non-blocking from the developer's point of view; you push, you wait, you read the result later. The relevant difference is the loop. CI is one run per push, results come back in minutes, the developer is doing other work meanwhile. A long-running agent fires test runs as a tool inside a single task, expects to issue several of them as it edits code, and needs the results back fast enough to inform the next edit. That is closer to a development pulse than a deploy gate. A 90-second test run inside an hour-long agent task is fine. A 90-second test run inside a CI pipeline that took 8 minutes to set up the environment is the same wallclock time but a totally different shape. The pulse model serves the agent rhythm; the CI model serves the deploy rhythm.
What happens if the agent forgets to read the result file?
The agent finishes its task without checking the verdict, which is bad. There are two reasonable defenses. The first is to make the read a habit: any system prompt that introduces the agent to Assrt should describe the pulse pattern (run the test, do other work, then before declaring victory, read `/tmp/assrt/results/latest.json` and confirm passed:true). The second is a checkpoint sweep. Before the agent emits its final answer to the user, it does one last batch of read-file calls against `/tmp/assrt/results/` and any other status files. If the latest run is unread, read it now. If it failed, do not declare victory. This is the same discipline a senior engineer would apply at the end of a long task: check the CI you fired half an hour ago before opening the PR.
Is this pattern open or proprietary?
Open. The MCP server source is at github.com/assrt-ai/assrt-mcp, MIT-style permissive licence. The output of any run is standard Playwright test files plus the JSON report at `/tmp/assrt/results/`. If you want to consume Assrt's pulse from your own agent harness (Claude Code, Cursor agent, OpenCode, a homegrown background daemon, anything), the contract is: kick `npx assrt run --json` as a detached process, then read `/tmp/assrt/results/latest.json` later. No SDK to install, no API token, no service to call. The whole pattern fits in two file paths and one command. The lock-in cost is zero, because there is nothing to lock in to.
What about agents that run on someone else's machine, like Devin?
The pattern generalises with one substitution. Replace `/tmp/assrt/results/latest.json` with whatever artefact storage the sandbox exposes (a workspace file, an S3 key, a key in a state store). The contract is still: detached process writes structured JSON to a known address, agent reads that address later. The Assrt CLI's `--json` output goes to stdout, so an agent running in a sandboxed VM can pipe it to a file inside the workspace and treat that file as the pulse. The blocking-vs-polling distinction is independent of whether the agent runs on a laptop or a cloud VM; what matters is whether the conversation thread is held open during the test.
Adjacent pieces on the same problem from a different angle.
Related guides
AI coding test execution feedback loops: why tests make AI output reliable
Adjacent piece. Why the write-test-fix loop is the single biggest predictor of reliable AI output, with the loop itself as ground truth.
Playwright AI test agents, explained: the agent is a while loop
How the test agent itself works under the hood, traced line by line from the same Assrt source.
AI test maintenance cost: the inversion most teams miss
The other half of the long-running agent story. Why selectors are the maintenance bill and why re-deriving them at runtime changes the equation.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.