Feedback loops in AI testing

AI and feedback loops in testing

There is not one feedback loop in AI testing. There are three, and they run at wildly different speeds. Almost everything written on this topic describes the slowest one and skips the fast one, which happens to be the loop that decides whether a single test actually passes.

Matthew Diakonov, Written with AI

Published June 19, 20269 min read

Direct answer (verified 2026-06-19)

A feedback loop in testing is any cycle where output (a result, a page observation, a failure) is fed back to improve the next action. In AI testing, three loops run at once:

a per-action runtime loop (milliseconds): observe the live page, take one action, re-observe, decide again;
a diagnose-and-correct loop (one run): a failure is classified by root cause and a corrected test is produced;
a model-improvement loop (offline): outcomes across many runs inform how the model behaves later.

The runtime loop is the one that determines pass or fail today, and it is the one most guides leave out.

The word “loop” is doing two jobs

Read the common advice on this and you will see the same picture every time: capture outcomes, label them, retrain or fine-tune the model, validate on a held-out set, repeat. That is a real loop. It is also the slow one. It runs over days. Nothing about it tells you whether the checkout test you just kicked off is going to pass.

The loop that decides pass or fail lives inside a single test run and closes hundreds of times per minute. Every time the agent does something to the page, it looks again at what the page became before it chooses the next move. That re-looking is the loop. It is cheap, it is constant, and it is invisible in the writing because it leaves no dashboard behind.

Keeping the two straight matters because they fail differently. A stale model degrades slowly and you can measure it. A runtime loop that acts on assumptions instead of observations fails instantly and looks like a flaky test. The rest of this page is mostly about that fast loop, because it is the one you can actually do something about.

Three loops, three timescales

Here is the whole thing on one screen. Notice that the column that matters for a given test result is the first row, not the last.

Loop	Closes every	Fed back in	Decides
Runtime	action (ms)	the page state your last action produced	whether this test passes
Diagnose	failed run	the failure, classified by root cause	bug vs. flawed test
Model	days / offline	outcomes across many runs	how the agent behaves later

Loop 1: the runtime loop, in real code

This is the loop nobody shows, so here is the actual one. Assrt drives the browser with an agent. The agent does not write a script up front and run it blind. It reads the page, picks one action, then reads the page again to see what that action did, and only then picks the next action. The exchange looks like this.

One turn of the runtime feedback loop

The piece that makes it a feedback loop, rather than a script, is the last leg: after every action the browser hands back a fresh result and screenshot, and that observation is pushed into the next decision before the agent moves again. In src/core/agent.ts it reads, lightly condensed, like this:

// src/core/agent.ts
const MAX_STEPS_PER_SCENARIO = Infinity;

while (!completed && stepCounter < MAX_STEPS_PER_SCENARIO) {
  // model chooses the next action from what it can see right now
  const response = await this.anthropic.messages.create({
    model: this.model, system: SYSTEM_PROMPT, tools: TOOLS, messages,
  });

  for (const toolCall of toolCalls) {
    const result = await this.execute(toolCall);   // click / type / navigate / assert
    const screenshot = await this.browser.screenshot();
    toolResults.push({
      type: "tool_result", tool_use_id: toolCall.id,
      content: [
        { type: "text",  text: result },
        { type: "image", source: screenshot },      // <- the observation
      ],
    });
  }

  // feed the *observed* page back in before the next decision
  messages.push({ role: "user", content: toolResults });
}

Two details are worth pausing on. First, MAX_STEPS_PER_SCENARIO = Infinity: the loop is not capped at a guessed number of steps. It runs until the work is actually done and the agent calls complete_scenario. Second, the screenshot and accessibility tree go back in as a tool_result on every iteration. The system prompt is blunt about why: “Call snapshot before each interaction to get fresh refs.” The agent never trusts a stale element reference, because the loop hands it a new one each turn. That is the entire trick. It is also why a small UI change does not derail the run.

Loop 2: the diagnose-and-correct loop

A red run is itself a piece of feedback, and the worst thing you can do with it is treat it as a dead end. Assrt’s diagnose step takes a failure as input and runs it through a small loop of its own, described in src/mcp/server.ts: classify the root cause, decide whether the application or the test is at fault, and, when the test was flawed, emit a corrected test in the same case format so it can go straight back into the runtime loop.

What happens to a failed run

Failed run

red assertion + page state

Classify root cause

app bug, flawed test, or environment

Emit corrected case

if the test was at fault

Re-run

back into the runtime loop

The outcome is that every failure resolves into one of two useful things: a confirmed bug worth a ticket, or a better test than the one you started with. Nothing is thrown away. This is the loop that keeps a suite from rotting, because the suite improves itself from its own failures rather than waiting for a human to babysit every red mark.

Why the fast loop is the one to care about

The slow model loop gets the attention because it sounds like the sophisticated part. In practice it is the runtime loop that earns or loses your trust in a test suite. A model that is a little out of date still produces reasonable actions. An agent that commits to a plan and then acts on a page that has moved underneath it produces a result that is confidently wrong, which is the worst possible kind of test result.

This is also the honest answer to the “does AI make tests more flaky” worry. Flakiness does not come from speed. It comes from acting on stale assumptions. A loop that re-reads the page after each action is structurally more stable than a recorded script, because it notices the modal that appeared, the row that shifted, the button that relabeled, and adjusts before any of those turn into a false failure. Tighten the observation step and you remove flakiness at the source instead of papering over it with retries.

So if you are evaluating any AI testing tool, the question to ask is not “how often do you retrain.” It is “how do you re-ground each action on the real page,” and “can I read the code that does it.”

A feedback loop you can actually inspect

Most tools that automate this hand you a loop you cannot see. The decisions happen inside a closed service, the tests are stored as a proprietary format, and you take the vendor’s word that the loop is sound. That is a strange thing to accept for the component that decides whether you ship.

Assrt is built the other way around. The agent that runs the runtime loop is open source, so the observe, act, re-observe cycle is code you can read line by line. And the output of the loop is not a black-box recording: it is standard Playwright test files you read, edit, run in your own CI, and keep if you ever walk away. The loop is inspectable at both ends, the engine and the artifact.

You can watch one run end to end with a single command:

npx @m13v/assrt discover https://your-app.com

It crawls the app, proposes scenarios so you do not start from a blank file, then drives each one through the runtime loop above and writes the Playwright tests out where you can see them.

Want the loop running on your app this week?

Bring your stack and we will wire up AI-driven Playwright tests you can read, run, and keep.

Frequently asked questions

What is a feedback loop in testing?

A feedback loop in testing is any cycle where the output of one step (a test result, an observation of the page, a failure) is fed back as input to improve the next step. The classic loop is run, observe, adjust, run again. AI changes who closes the loop and how fast it closes, not the basic shape.

How do AI feedback loops work in testing specifically?

There are three of them and they run at different speeds. The fast one is per-action: an agent reads the live page, takes one action, re-reads the page, then decides the next action. The medium one is per-failure: a failed run is classified by root cause and a corrected test is emitted. The slow one is per-model: outcomes across many runs eventually inform how the model behaves. Most articles only describe the slow one.

Is an AI feedback loop the same as retraining a model?

No, and conflating the two is the most common mistake. Retraining is the slow offline loop that runs over days or weeks. The loop that determines whether your test passes right now runs in milliseconds inside a single test run, by re-grounding each decision on the page state the last action actually produced. No model weights change during that loop.

Why does grounding each step on the real page matter?

Because a plan written up front goes stale the moment the page does something unexpected: a modal appears, a row shifts, a button relabels. If the agent acts on its original plan instead of what is now on screen, it drifts and produces a flaky or false result. Re-reading the page after every action keeps the loop closed on observed state, which is what makes the result trustworthy.

What does the diagnose loop do when a test fails?

It treats the failure as input, not a dead end. The failure is classified as one of three things: a real application bug, a flawed test, or an environment issue. If the test was flawed, a corrected test case is produced in the same format so you can re-run it. That turns every red run into either a confirmed bug or a better test.

Can I inspect the feedback loop, or is it a black box?

With Assrt you can inspect it. It generates standard Playwright files you read, edit, and commit, and the agent that drives the runtime loop is open source, so the observe-act-re-observe cycle is readable code rather than a proprietary service you have to trust on faith.

Does a faster feedback loop mean more flaky tests?

Speed is not the cause of flakiness. Acting on stale assumptions is. A fast loop that re-observes the page after each action is more stable than a slow one that commits to a fixed script, because it corrects course before a small UI change cascades into a failed assertion.

What are the feedback loops in AI-powered test automation?

In AI-powered test automation the term covers three nested loops, not one. The per-action runtime loop observes the live page, takes a single action, and re-observes the result before deciding the next move. The per-failure diagnose loop turns a red run into either a confirmed bug or a corrected test. The slow model loop is offline retraining that informs behavior over many runs. Automation framing matters because the first two loops run inside a single CI run on your own infrastructure, so the cycle that decides pass or fail is code you can read, not a remote service. With Assrt the runtime agent is open source and the output is standard Playwright you keep, so every loop in the stack is inspectable.

Adjacent pieces on how the loop is built and where it sits.