Deterministic, reproducible agent testing infrastructure: the seven knobs that turn an LLM into a test runner

The exact phrase “deterministic, reproducible agent testing infrastructure” does the topic a small disservice. An LLM-driven agent cannot be bit-for-bit deterministic. What you can build is one level up: a contract that, given the same inputs, produces the same pass / fail outcome and the same set of addressable artifacts. Reproducibility, not determinism. The contract has seven knobs, and all seven live in the assrt-mcp source.

Direct answer, verified 2026-05-05

You cannot make an LLM-driven test bit-for-bit deterministic. You can make the run reproducible by: (1) pinning the model to a dated snapshot, (2) minting the run UUID before the agent acts, (3) parameterizing inputs through {{KEY}} variables, (4) gating success on explicit pass criteria, (5) isolating the browser profile per run, (6) scoping every artifact to a per-run UUID, and (7) emitting standard Playwright code as the canonical artifact. Each of these is one or two functions in assrt-mcp/src.

Matthew Diakonov, Written with AI

Published May 5, 202611 min read

claude-haiku-4-5-20251001

“The default model is a dated snapshot, not an alias. Pinning is the floor of reproducibility, and Assrt sets it on line 9 of agent.ts.”

assrt-mcp/src/core/agent.ts:9

Determinism is the wrong target

When an engineer types this phrase into a search bar, they are usually asking one of two questions: “how do I keep my agent tests from being flaky” or “how do I prove the same test ran the same way twice.” The honest framing is that those are different problems. Determinism in the strict sense (every byte matches across runs) is unachievable as long as a transformer is in the loop. Providers re-tokenize, batch, and quantize on a schedule that is not part of your contract. Even at temperature zero, the same prompt can yield slightly different tool-call sequences across deploys.

What is achievable, and what testing actually needs, is reproducibility. A reproducible run is one where the inputs are pinned, the assertions are explicit, the artifacts have stable URLs, and the same set of inputs converges on the same pass / fail verdict. The model can wiggle on what it clicks first; the contract on what counts as “passed” does not.

Two different goals, two different toolchains

Every byte of every run is identical. Every tool call in the same order. Every screenshot pixel-identical. The agent runs the exact same path through the DOM each time.

Requires the model to be a pure function of its inputs
Requires the network and the DOM to be a pure function too
Provider tokenization changes break this without warning
Even temperature zero does not give you this

The seven knobs

These are the seven that have to be pinned for the contract to hold. Every other “reproducibility” advice on this topic reduces to one or two of them; missing the others is what turns a good run into a bad one a week later.

The reproducibility contract

Pinned model version. A dated snapshot, not an alias. Default is set at /Users/matthewdi/assrt-mcp/src/core/agent.ts:9 to claude-haiku-4-5-20251001.
Run UUID minted upfront. crypto.randomUUID() runs at server.ts:429 BEFORE the agent acts, so cloud artifact URLs exist immediately.
Variables substituted into the plan text. Plan templates use {{KEY}}; substitution happens at agent.ts:376-381 before the model sees the plan.
Explicit pass criteria. passCriteria is rendered to the model as a MANDATORY block at agent.ts:670 and converted into named assertions in the final report.
Profile isolation per run. isolated, default, and extension modes are defined in the tool schema at server.ts:354-357; persistent runs scrub SingletonLock and fall back to isolated on conflict.
Per-run artifact UUIDs. Every screenshot, video, and event log is scoped to tmpdir()/assrt/<runId>; runs cannot overwrite each other.
Real Playwright as the canonical artifact. The output is a .spec.ts file you commit; the LLM is no longer in the loop on the deterministic execution path.

Knob 1 and 2: pin the model, mint the run ID before the run

The first knob is the cheapest and the most often skipped. The default model in /Users/matthewdi/assrt-mcp/src/core/agent.ts:9 is a dated string, not an alias:

// /Users/matthewdi/assrt-mcp/src/core/agent.ts (line 9)
const DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001";
const DEFAULT_GEMINI_MODEL = "gemini-3.1-pro-preview";

That string is the floor. Pass model to assrt_test and you override it for one call. The chosen value is then logged on every run inside the structured agent.run.start event at agent.ts:383, alongside the URL, the mode, and a flag for whether you passed pass criteria. If three months from now you cannot reproduce a CI failure, the model snapshot used by the failing run is one grep away.

The second knob fixes a subtler problem. Most testing tools allocate the run identifier on the server, after the run starts, often after the run finishes. That means the artifact URLs (the video, the screenshots, the page) do not exist until the run is over, which makes it hard to drop a deterministic permalink into a PR comment ahead of time. Assrt does the opposite. server.ts:407-432 mints the scenario UUID first (so it is the same across the assert and the upload), then mints a fresh run UUID via crypto.randomUUID() at line 429, then immediately builds the per-run artifact directory under tmpdir()/assrt/<runId> at line 430. By the time the agent fires its first action, every cloud URL the run will eventually expose has been computed by buildCloudUrls in scenario-store.ts:243-263.

One run, three identifiers, one set of stable URLs

Knob 3 and 4: parameterize inputs, force assertions to be explicit

A flaky test almost always has one of two root causes: an input that drifts (hardcoded email that collides on the second run, a hardcoded date that goes stale) or an assertion that is implicitly the agent’s opinion. Assrt fixes the first with variable interpolation and the second with mandatory pass criteria.

Variables are not a hint to the model. They are a substitution into the plan text BEFORE the model sees it. agent.ts:376-381 is a five-line regex replace:

// /Users/matthewdi/assrt-mcp/src/core/agent.ts (lines 376-381)
if (variables && Object.keys(variables).length > 0) {
  for (const [key, value] of Object.entries(variables)) {
    scenariosText = scenariosText.replace(
      new RegExp(`\\{\\{${key}\\}\\}`, "g"),
      value,
    );
  }
}

Pass criteria do the opposite. Instead of removing variance from the inputs, they remove ambiguity from the outputs. The MCP tool accepts a free-text passCriteria at server.ts:343, then agent.ts:670 wraps it in a labelled section the agent cannot ignore: “## Pass Criteria (MANDATORY). The test MUST verify ALL of the following conditions. Mark the scenario as FAILED if any condition is not met.” The agent has to call complete_scenario with one assertion per criterion, and the resulting TestReport.assertions[] is what your CI gates on.

The same flow, with and without the contract

// Naive: hardcoded inputs, vague success, no model pin.
// Re-running this on a different day with a fresh signup
// will fail because the email collides. The model is whatever
// the SaaS feels like running this week. The "did it work"
// answer is whatever the agent reported.

await assrt_test({
  url: "https://example.com",
  plan: `
    #Case: Signup smoke
    1. Click "Sign up"
    2. Enter email "qa@example.com"
    3. Enter password "Hunter22!"
    4. Submit
    5. Verify the dashboard loads
  `,
});

-47% more knobs pinned

Knob 5 and 6: isolate the profile, scope the artifacts

The state-leak failure mode for agent tests is subtle. Two runs share a Chromium profile, run A signs in as qa@example.com, run B inherits that login, and a test that should have started clean instead hits the dashboard immediately and reports an unrelated success. The MCP tool exposes three modes at server.ts:354-357. Default keeps a persistent profile at ~/.assrt/browser-profile. isolated: true keeps the profile in memory, so every run starts logged out. extension: trueattaches to your real running Chrome via the Playwright extension. The picking rule for reproducibility is “isolated for CI, persistent for local dev, extension for production-incident repro.”

Artifact isolation is one layer down and applies in all three modes. server.ts:430 calls mkdirSync(screenshotDir, { recursive: true }) against a path that begins with the freshly minted run UUID. Screenshots are written with filenames like 00_step1_navigate.png at server.ts:467-486, so the order in which the agent acted is on disk, not just in memory. Two parallel runs cannot overwrite each other’s screenshots, even on the same machine, because crypto.randomUUID gives each run 122 bits of entropy. (For the deeper version of this argument see the browser isolation guide.)

Knob 7: real Playwright is the artifact, not the LLM run

The previous six knobs all describe the LLM-driven run. The seventh is what you ship out the back. The reason every other tool in this category emits a proprietary YAML or visual flow is convenience: it is easier to interpret a structured DSL than a Playwright spec. The cost is reproducibility. If their cloud goes down, deprecates the format, or gets acquired, your test infrastructure stops working. The format itself is the lock-in.

Assrt’s output is a Playwright .spec.ts file. You commit it. From that point on, the LLM is no longer in the deterministic execution path. The file runs in your CI today, your CI in five years, and on Chromium, Firefox, and WebKit without further translation. The agent’s role narrows to discover scenarios, draft the spec, and propose updates when selectors break. The shipping artifact is plain code, which means the reproducibility contract you built around the LLM run survives the LLM going away.

The same seven knobs, on a closed-source cloud vs. on Assrt

Every cell on the left is a knob the cloud vendor has chosen for you. Every cell on the right is a knob the source exposes.

Feature	Closed-source cloud testers	Assrt
Model version	Hidden behind a SaaS abstraction; updates change behavior silently	Default pinned to `claude-haiku-4-5-20251001` at agent.ts:9; per-run override exposed at server.ts:351; logged in the agent.run.start event
Run identifier	Allocated server-side after the run starts	Minted via `crypto.randomUUID()` at server.ts:429 BEFORE the agent acts; cloud URLs are addressable immediately
Test inputs	Scripted into the recorded flow; reusing across data sets means re-recording	Plan text contains `{{KEY}}` placeholders; variables substituted at agent.ts:376-381 before the model sees the plan
Pass / fail decision	LLM judgment, sometimes hidden	passCriteria is rendered as a MANDATORY block at agent.ts:670; each criterion becomes a named assertion in the report
Browser state isolation	One mode (container per session), all-or-nothing	Three modes via `isolated`, default, `extension` (server.ts:354-357); per-run artifact UUIDs survive every mode
Test artifact you commit	Proprietary recording; lock-in to the vendor	Standard Playwright spec files; LLM exits the loop after generation
Verification of the run contract	Vendor dashboard; opaque	agent.run.start log line names model, mode, passCriteria flag, and variable count; artifacts at predictable per-run paths

Cloud testers do real work and are sometimes the right call; the table is about reproducibility specifically, not about whether the product is good.

What you give up to get this

Reproducibility is not free. Pinning a dated model snapshot means you do not get newer model improvements until you bump the pin. Minting the run UUID upfront means a slightly noisier MCP tool result. Variables in the plan text mean your scenarios cannot be copy-pasted as-is into a chat without filling in the placeholders first. Pass criteria mean writing them down, which is the part that quietly trips up most adoptions: a team that cannot articulate the success conditions of its own flow is not yet ready for a test runner that demands them. The point is that all of these costs are visible. Cloud testers hide their version of these decisions behind a UI; Assrt asks you to make them on purpose.

The other thing you give up, sometimes, is recency. A model snapshot from October is not the model that shipped last week. The contract is “same model that was running when this test passed last time,” not “the smartest model on the planet today.” If you want the latter, override the model per call. Assrt will log the override in agent.run.start and you keep your audit trail.

Pin the seven knobs against your real CI

If your team is debating where the variance in agent tests is actually coming from, a 20-minute call is enough to map it to one of these knobs and decide what to pin first.

Frequently asked questions

Is an LLM-driven agent test ever truly deterministic?

No, and any tool that claims otherwise is hiding the variance somewhere. The same model, on the same prompt, with temperature zero, can still produce different tool-call sequences across runs because providers occasionally re-tokenize, batch, or quantize behind the scenes. Even the same byte-for-byte output can land at the browser at slightly different wall-clock times and resolve to different DOM states. What you can build is one level up: reproducibility. Same plan, same variables, same pinned model, same explicit pass criteria, and the run produces the same pass/fail verdict and the same set of addressable artifact URLs. Determinism asks 'will every byte match.' Reproducibility asks 'will the contract hold.' The seven knobs in this guide are the contract.

What does Assrt pin by default?

The Anthropic model default is set in /Users/matthewdi/assrt-mcp/src/core/agent.ts on line 9: `const DEFAULT_ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"`. The Gemini default sits next to it. These are exact dated model snapshots, not aliases like `claude-latest`. If you do not pass a `model` argument to assrt_test, those are what runs. The MCP tool also exposes the override at server.ts:351, so you can pin to a different snapshot per scenario, and the value gets logged on every run start at agent.ts:383 inside the `agent.run.start` event. The point is: the model is a load-bearing input to the test, so the source treats it like one. It is logged with every run, defaulted to a dated snapshot, and overridable per call.

Why mint the run UUID before the test runs?

Because every artifact URL has to exist before the artifact does. Look at /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 407-432: assrt_test creates the scenario UUID first (line 408-418), then mints a fresh run UUID via `crypto.randomUUID()` on line 429, then builds the per-run artifact directory under `tmpdir()/assrt/<runId>` on line 430. By the time the agent fires its first browser action, every cloud URL the test will eventually expose, video, screenshots, log, page summary, has already been computed by `buildCloudUrls` in scenario-store.ts:243-263. CI can record those URLs in a comment on the PR while the test is still running. If the test fails halfway, the URLs still resolve to whatever artifacts did get uploaded.

How do variables make a test reproducible if the LLM is still picking what to type?

Variables are not a hint to the model, they are a substitution into the plan text before the model sees it. agent.ts lines 376-381 do a simple regex replace: every `{{KEY}}` in the scenario plan becomes the value you passed in `variables`. If your plan says 'sign up with {{EMAIL}} and product {{SKU}},' and you pass `{ EMAIL: 'qa+canary@yourcompany.com', SKU: 'ENT-001' }`, the agent reads 'sign up with qa+canary@yourcompany.com and product ENT-001.' The model still picks DOM targets, but the input it is steering toward is fixed. That is what makes a parameterized run reproducible across CI matrices: every job sees the same plan with its own variables substituted in, no shared mutable state.

What stops an agent from declaring success on a flaky pass?

passCriteria. The MCP tool schema at server.ts:343 takes a free-text string of explicit success conditions. Inside the agent, agent.ts:670 wraps that string in a labelled section: '## Pass Criteria (MANDATORY)\nThe test MUST verify ALL of the following conditions. Mark the scenario as FAILED if any condition is not met.' The agent is then forced to call its `complete_scenario` tool with one assertion per criterion. If you pass `passCriteria: 'Cart total displays $42.99 exactly. Confirmation email arrives within 30s.'`, you do not get a pass on a green page that did not actually verify those two facts. This converts the test from 'did the agent feel like it worked' to 'did each named criterion get a passing assertion,' which is what reproducibility actually needs.

How do I keep two CI runs from contaminating each other on the same machine?

Three layers. First, every run gets its own UUID and its own artifact directory under `tmpdir()/assrt/<runId>` (server.ts:429-432), so screenshots and videos from run A cannot land on run B's paths. Second, the `isolated: true` option in the tool schema (server.ts:354) keeps the browser profile in memory only, so cookies and localStorage do not leak across runs. Third, if you stay on the persistent profile and two runs race, the second one runs `killOrphanChromeProcesses` and tries to scrub `SingletonLock`; if the lock cannot be removed because the first Chrome is still alive, it silently falls back to `--isolated` for that one call (browser.ts:336-340). In practice that means two parallel persistent runs degrade to one persistent and one isolated, never to a corrupt shared profile.

Why is generating real Playwright code part of the reproducibility story?

Because it is the only artifact that survives the LLM. Every other tool in this category emits a proprietary YAML, JSON DSL, or 'visual flow' that only their cloud can run. If their cloud goes down, deprecates the format, or hikes prices, your test infrastructure stops being reproducible because nobody can run it. Assrt emits standard Playwright .spec.ts files. You commit them. They run in your CI today, your CI in five years, on Chromium, Firefox, and WebKit, and the LLM is no longer in the loop for the deterministic execution path. The agent's role collapses to: discover scenarios, draft the spec, propose updates when selectors break. The shipping artifact is plain code.

Where is the contract logged so I can prove a run held it?

agent.ts:383 emits a structured `agent.run.start` line with the URL, mode, model, whether passCriteria was set, and the variable count. The screenshot files are written under the per-run UUID dir at server.ts:467-486, with filenames like `00_step1_navigate.png`, so the order in which the agent acted is on disk, not just in memory. The cloud URLs from buildCloudUrls (scenario-store.ts:243-263) are deterministic given (scenarioId, runId), so two runs with the same IDs would point at the same artifacts; in practice a fresh runId is minted per call, so each contract is uniquely addressable. If you want to prove a CI job ran with the inputs you think it ran with, the inputs are in `agent.run.start` and the outputs are at predictable paths.

Same product, different angle

Adjacent guides

Same product, deeper

AI agent browser isolation: four layers, not one toggle

The isolation layer in detail: profile, session, process, and artifact, with the file paths in assrt-mcp.

Read

Concurrency

Playwright agent isolation: three pieces of code that keep two agents off each other

The SingletonLock scrub and the orphan-PID walk that protect the persistent profile from concurrent runs.

Read

Reliability

AI agent browser automation reliability: where it actually fails

The non-isolation reliability story: stale selectors, timing, MutationObserver stability.

Read

Deterministic, reproducible agent testing infrastructure: the seven knobs that turn an LLM into a test runner

Determinism is the wrong target

Two different goals, two different toolchains

The seven knobs

Knob 1 and 2: pin the model, mint the run ID before the run

Knob 3 and 4: parameterize inputs, force assertions to be explicit

Knob 5 and 6: isolate the profile, scope the artifacts

Knob 7: real Playwright is the artifact, not the LLM run

The same seven knobs, on a closed-source cloud vs. on Assrt

What you give up to get this

Pin the seven knobs against your real CI

Frequently asked questions

Adjacent guides

AI agent browser isolation: four layers, not one toggle

Playwright agent isolation: three pieces of code that keep two agents off each other

AI agent browser automation reliability: where it actually fails

Comments (••)

Comments ()