AI Verification

Automated Testing for AI Agent Verification: Why Deterministic Checks Beat LLM Memory

AI agents forget instructions, ignore their own plans, and skip critical guardrails. The solution is not better prompts or longer context windows. It is automated test suites that run deterministically and block the next step when something breaks.

80% less maintenance

Self-healing selectors mean tests don't break when UI changes, reducing maintenance by up to 80%

Assrt open-source testing framework

1. Why AI Agents Forget: The Memory and Guardrail Problem

A recent thread on r/Anthropic documented 22 separate instances where Claude ignored its own plans, forgot instructions it had acknowledged just minutes earlier, and bypassed guardrails that were explicitly stated in its system prompt. The failures were not obscure edge cases. They were straightforward situations where the model was given a clear plan, agreed to follow it, and then did something completely different.

This is not a bug that will be fixed in the next model release. It is a fundamental characteristic of how large language models work. LLMs do not have persistent memory in the way humans do. Every token they generate is a probabilistic prediction based on the current context window. When that context window fills up, earlier instructions get compressed or dropped entirely. Even within a manageable context length, the model's attention to specific instructions degrades as more text accumulates between the instruction and the point where it needs to act on it.

The practical consequence is that any workflow relying on an AI agent to “remember” what it should do will eventually fail. It might work 90% of the time, or even 99% of the time, but the failures are unpredictable and often silent. The agent does not announce that it has forgotten your guardrails. It simply proceeds as if they never existed.

This creates a particularly insidious problem for teams using AI agents in production workflows. If an agent is responsible for code generation, deployment steps, data transformations, or any other consequential task, a single forgotten guardrail can cause significant damage. And because the failure mode is silent (the agent confidently does the wrong thing), it often goes undetected until a user reports a problem or a monitoring alert fires.

2. Checklists vs Automated Verification: Deterministic Wins

The instinctive response to AI agents forgetting things is to add more instructions. Write longer system prompts. Include checklists. Add “IMPORTANT” and “CRITICAL” prefixes to key rules. Repeat instructions at multiple points in the prompt. These strategies help marginally, but they do not solve the underlying problem, because they are asking a probabilistic system to behave deterministically.

A checklist that says “always run tests before deploying” depends on the agent reading, understanding, and acting on that instruction every single time. A CI pipeline that literally blocks deployment when tests fail does not depend on anyone reading anything. It is a hard gate. The code either passes or it does not ship.

This distinction between advisory checks and enforced gates is the key insight. Advisory checks (checklists, instructions, guidelines) are useful for humans who have persistent memory and intrinsic motivation to follow them. Enforced gates (automated tests, CI checks, deployment blockers) work for any system, because they do not require memory or motivation. They are structural constraints that make the wrong outcome impossible rather than merely discouraged.

Consider a concrete example. You want an AI agent to never delete production data during a migration. You could add this to the system prompt, and the agent would probably comply most of the time. Or you could write a test that verifies row counts before and after every migration, runs automatically in your CI pipeline, and blocks the deployment if any table has fewer rows than it started with. The second approach is not smarter or more sophisticated. It is simply deterministic. It will catch the problem whether the agent remembers the rule or not.

Stop Relying on AI Memory for Critical Checks

Assrt generates deterministic Playwright tests that run automatically and block deployments when something breaks. No checklists, no hoping the LLM remembers.

Get Started

3. Building a Verification Pipeline That Blocks Bad Deployments

A verification pipeline for AI-driven workflows needs three layers: pre-execution checks, runtime assertions, and post-execution validation. Each layer catches a different category of failure, and together they create a safety net that does not depend on the AI agent's memory.

Pre-execution checks validate inputs before the AI agent acts on them. These include schema validation (is the input in the expected format?), boundary checks (are values within acceptable ranges?), and permission verification (does the agent have the right to perform this action?). Pre-execution checks are fast and cheap. They catch the most obvious failures, like an agent trying to write to a production database when it should be using a staging environment.

Runtime assertionsmonitor the agent's behavior during execution. These are automated tests that run alongside the agent, checking invariants at each step. For example, if an agent is generating code, a runtime assertion might verify that every generated file passes linting, that no new dependencies are introduced without being declared, or that the test suite still passes after each change. Runtime assertions turn a single long operation into a series of checkpointed steps, where each step must pass before the next begins.

Post-execution validationconfirms that the final output meets expectations. This is the most traditional form of testing: end-to-end tests that verify the system behaves correctly from a user's perspective. Did the checkout flow still work after the agent refactored the payment module? Does the login page still render after the agent updated the authentication library? Post-execution validation catches the failures that slip through the other layers, usually subtle regressions that only manifest in the interaction between components.

The critical design principle is that each layer must be a hard blocker, not an advisory warning. If a pre-execution check fails, the agent does not proceed. If a runtime assertion fails, the current step is rolled back. If post-execution validation fails, the deployment does not go out. This removes the agent's ability to “decide” whether a failure is important enough to stop for, because that decision is exactly the kind of judgment call that LLMs get wrong when they forget their guardrails.

4. Tools for Automated Testing in AI-Driven Workflows

Several tools and frameworks can help build the verification pipeline described above. The right choice depends on your stack, team size, and the specific AI workflows you need to verify.

Playwright remains the gold standard for end-to-end browser testing. Its auto-wait mechanism, trace viewer, and cross-browser support make it well suited for verifying that AI-generated UI changes have not broken user flows. Playwright tests are deterministic by design: they either pass or fail, with no probabilistic middle ground.

Assrt extends this approach by auto-discovering test scenarios from your live application, generating Playwright tests automatically, and providing self-healing selectors that adapt when your UI changes. For teams using AI agents to generate code rapidly, Assrt can continuously regenerate and run tests against the latest version of the application, catching regressions that the AI agent introduced without knowing it. Its open-source nature means you can inspect and modify every generated test.

GitHub Actions and GitLab CI provide the pipeline infrastructure for running tests as hard gates. Both support required status checks that block merges when tests fail, turning your test suite into an enforced gate rather than an advisory report. The configuration is straightforward: add a workflow that runs your test suite on every pull request, and mark it as required in your branch protection rules.

Guardrails AI and NeMo Guardrails take a complementary approach by validating LLM outputs before they reach the user or trigger downstream actions. These tools let you define rules (no PII in responses, no SQL in generated code, response must match a specific schema) and automatically reject outputs that violate them. They are particularly useful for the pre-execution and runtime assertion layers of your verification pipeline.

Eval frameworks like Braintrust, Promptfoo, and DeepEval provide systematic ways to evaluate LLM behavior across large sets of test cases. Rather than testing a single output, they run hundreds or thousands of inputs through your AI workflow and measure aggregate quality metrics. This is valuable for catching the kind of probabilistic drift that causes an agent to slowly deviate from its intended behavior over many interactions.

5. From Reactive Debugging to Proactive Verification

Most teams working with AI agents today operate in a reactive mode. The agent does something wrong, someone notices (usually a user), the team investigates, and they add a new instruction to the prompt to prevent it from happening again. This cycle is exhausting and scales poorly. Every fix is a new line in an ever-growing system prompt that the model is increasingly likely to forget.

Proactive verification inverts this pattern. Instead of waiting for failures and patching instructions, you define the expected behavior upfront as automated tests, and you run those tests continuously. When the AI agent's behavior drifts (and it will drift), the tests catch it before it reaches production. The feedback loop is minutes, not days.

The shift requires a change in how teams think about AI agent reliability. The question is not “how do I make the agent more reliable?” but rather “how do I build a system that produces reliable outcomes regardless of whether the agent is reliable?” This is the same mental model that drives good infrastructure engineering. You do not make individual servers more reliable. You build systems that tolerate server failures. Similarly, you do not make AI agents more consistent. You build verification systems that catch their inconsistencies.

The 22 documented failures in that Reddit thread are not evidence that AI agents are useless. They are evidence that AI agents need the same engineering discipline that we apply to every other unreliable component in our systems: automated testing, monitoring, circuit breakers, and hard gates that prevent bad outcomes. The tools exist. The patterns are well understood. The only step left is to apply them.

Start with one critical workflow. Write three tests that verify its most important invariants. Run those tests in CI and make them required. Then expand from there. Within a week, you will have more confidence in your AI-driven workflow than any amount of prompt engineering could provide, because your confidence will be grounded in deterministic evidence rather than probabilistic hope.

Build Verification That Actually Works

Assrt auto-discovers test scenarios, generates real Playwright tests, and runs them in CI. Deterministic, reproducible, and open-source.

$npx @assrt-ai/assrt discover https://your-app.com