Inside an AI E2E agent

Multi selector candidate ranking in E2E tests: the four-tier scoring function inside an AI test agent

When an AI E2E agent is told "click the Sign in button"and the page has six elements that could plausibly be the target, the agent runs a scoring function over a fixed set of nine candidate selector types. This guide walks through that function exactly as it ships in the open-source Assrt repo, and shows where it diverges from Playwright's own first-match convention.

Matthew Diakonov, Written with AI

Published May 21, 202610 min read

Direct answer (verified 2026-05-21)

An AI E2E agent picks one element from many candidates by (1) trying the literal input string as a CSS selector via document.querySelector, then (2) scanning a fixed candidate set (a, button, input, [role="button"], select, textarea, label, [onclick], [href]) and scoring each candidate by four tiers: exact text match (winner, breaks immediately), input-text inside candidate text (score 3), candidate text inside input text with candidate length over 2 (score 2), and partial-word match by fraction (score = matched_words / total_words). The highest-scoring candidate wins; ties resolve to DOM order.

Source of truth: function showClickAt in src/core/browser.ts lines 294 to 344 of the open-source assrt-ai repo. Algorithm runs in the browser via Playwright MCP's browser_evaluate.

The problem this function solves

Playwright's recommended locator priority is well documented and well repeated. getByRole first, then getByLabel, then getByPlaceholder, then getByText, then getByTestId, and CSS only when nothing else works. That priority assumes you, the test author, are writing the locator. It says nothing useful about what an agent should do when the input is a free-text description like "the primary call-to-action button".

That description is not a valid getByRole call. It is not a CSS selector. It is what a human types into a Reddit comment when they describe a test step. If an AI E2E agent wants to act on that string, the agent has to turn it into one DOM element before Playwright is even involved. Multi selector candidate ranking is the function that does that turn.

The interesting thing about the algorithm is that it's small, deterministic, and runs inside the browser via a single browser_evaluate call. No model is consulted. The LLM only gets involved later, when this function fails to find anything and the agent has to look at the accessibility tree.

Resolution flow: from human description to clicked element

The candidate set is exactly nine selector types

The first decision the algorithm makes is what counts as a candidate. Too narrow, and the agent misses real click targets. Too wide, and the scoring function gets noisy because most elements on a page are not interactive. The set in Assrt's resolver is:

a, button, input, [role="button"], select, textarea, label, [onclick], [href]

That's the union of (1) the four native interactive elements every browser ships, (2) anything the page explicitly opted into being a button via ARIA, (3) label because clicking a label fires the associated input, and (4) anything with a click handler or an href so divs-as-buttons are caught.

What's deliberately not in the set: [tabindex], [role="link"], [role="menuitem"], and div generically. Adding any of them broadens recall but the scoring function then has to fight the noise. The trade-off was made in favor of precision; if your app uses one of those patterns heavily, the candidate set is one line in browser.ts and you can fork it.

Tier 0: try the literal string as a CSS selector

document.querySelector(sel) wrapped in a try/catch. If the agent was lucky enough to be handed a real selector ('#login-btn', 'button[type=submit]'), the lookup short-circuits before any scoring runs. The try/catch is mandatory because selectors with spaces (the common case when sel is a human description) raise SyntaxError; the catch swallows it and lets the fallback kick in.

Tier 1: exact text match wins, break out

If any candidate's trimmed, lowercased textContent equals the lowercased input string, the loop ends with that element. No score is recorded; no other candidate can win. First-match-in-DOM-order resolves ties.

Tier 2: substring matches score 2 or 3

Input string contained inside candidate text scores 3 (the candidate is a superset of what was asked for). Candidate text contained inside input string scores 2 (the input is a superset of the candidate label, e.g. 'Sign in button' vs a button labelled 'Sign in'). The candidate-length floor of 2 characters prevents one-letter buttons from winning by accident.

Tier 3: partial-word match by fraction

Split the input on whitespace, drop words two characters or shorter. Count how many remaining words appear anywhere in the candidate text. Score equals matched_words divided by total_words, so the max here is 1.0, which is still lower than the substring tier. This catches rebrand cases ('Sign in' matching a button labelled 'Log in to your account').

The whole algorithm fits on one screen

Below is the actual code that runs in the page, copied from src/core/browser.ts in the Assrt repo. It's injected into the browser context via the browser_evaluatetool of Playwright MCP. There is no library involved; it's pure DOM.

// browser.ts, lines 294-344, function showClickAt
// Injected into the page; runs inside the browser, not in Node.
const result = await this.callTool("browser_evaluate", {
  "function": `() => {
    const sel = ${sel};                    // the human-readable description
    const selLower = sel.toLowerCase();
    let el = null;

    // Tier 0: try as a literal CSS selector first.
    try { el = document.querySelector(sel); } catch {}

    if (!el) {
      // Tier 1: scan a fixed candidate set.
      const candidates = document.querySelectorAll(
        'a, button, input, [role="button"], select, textarea, label, [onclick], [href]'
      );

      const words = selLower.split(/\\s+/).filter(w => w.length > 2);
      let bestScore = 0;

      for (const e of candidates) {
        const txt = (e.textContent || '').trim().toLowerCase();
        if (!txt) continue;

        // Tier 2: exact match wins immediately.
        if (txt === selLower) { el = e; break; }

        let score = 0;
        if (txt.includes(selLower)) score = 3;                           // tier 3a
        else if (selLower.includes(txt) && txt.length > 2) score = 2;    // tier 3b
        else {
          const matched = words.filter(w => txt.includes(w)).length;
          if (matched > 0) score = matched / words.length;               // tier 4
        }

        if (score > bestScore) { bestScore = score; el = e; }
      }
    }

    if (el) {
      const r = el.getBoundingClientRect();
      return JSON.stringify({ x: r.left + r.width / 2, y: r.top + r.height / 2 });
    }
    return null;
  }`,
});

A few details worth pointing at. The try/catch around querySelector is necessary because most human descriptions raise SyntaxError when passed as CSS selectors (the string "Sign in button" has unescaped spaces and isn't valid). The txt.length > 2 floor on the candidate-in-target tier is what prevents a one-character label from winning by accident when the input string happens to contain that character. And the partial-word tier ignores words shorter than three characters, which keeps articles and prepositions from scoring as positive matches.

9 candidate types

“Tests are yours to keep. The scoring function is 50 lines of code in a file you can read, fork, and edit. There is no hosted black box deciding which element your test clicks.”

Assrt, open-source AI test framework

What happens when an exact match exists

Input: 'Sign in'

querySelector: null

Scan 37 candidates

Exact match found

✅

Break, click, done

Walking through four realistic scenarios

The scoring tiers are easier to reason about when you watch them resolve a real page. Pretend the page has three candidates: a header <a>Sign in</a>, a hero <button>Sign in to continue</button>, and a footer <a>Sign up</a>.

Scenario 1: input is "Sign in"

Header anchor: txt === selLower, exact match, loop breaks. Hero button and footer anchor never get scored. The deterministic outcome is the header link. Note that DOM order matters here, not score: if the hero button appeared first in DOM order and was the exact match, it would win instead. Ties resolve to DOM order by virtue of the for-loop's ordering.

Scenario 2: input is "Sign in button"

No exact match. Header anchor has text "sign in", which is contained in the input (tier 2, score 2). Hero button has text "sign in to continue", neither contains nor is contained in the input cleanly; partial-word match is 2/3 = 0.67 (two of the three input words "sign", "in", "button" appear, but "button" is too short to count, so it's actually 2/2 = 1.0 because "button" is filtered out by the > 2 length filter). The header anchor wins at score 2 because 2 > 1.0.

Scenario 3: input is "Continue with Google"

None of the three candidates match. Scores are all 0. The function returns null. The agent doesn't click; it logs "no element matched" and the caller is expected to retry with a snapshot-based ref. This is the right behavior: silent guessing on a 0.0 score would exercise the wrong element and pass green.

Scenario 4: input is "Sign in to continue"

Hero button text matches the input exactly. Tier 1 wins, loop breaks. Even though the header anchor would also have scored 2 on tier 2 (its text is contained in the input), the exact match exits before it's evaluated. This is why the early break matters: it forces a deterministic answer when the page actually has the exact label.

Naive resolver vs Assrt's scored resolver

// What a naive resolver does: one selector, one element, throw if no match.
async function click(page, sel) {
  const el = await page.$(sel);
  if (!el) throw new Error("not found: " + sel);
  await el.click();
}

// Caller has to know the exact selector.
// If the LLM says "click the Sign in button", the caller has to translate.
// If there are two buttons that could match, the caller has to disambiguate.
// Anything ambiguous fails hard, even when a human could see the right one.

-22% fewer lines

What an AI agent receives vs what gets clicked

The agent gets a description, not a locator. Examples include 'click the Sign in button', 'click the primary CTA', 'click the link that says "Forgot password?"'. None of these are valid Playwright locators on their own. A naive resolver would throw or, worse, click nothing while reporting success.

Free-text description
Not a valid CSS selector
Not a getByRole call
May match multiple elements

Where this diverges from Playwright's own first-match rule

Playwright has a strict-by-default policy: when a locator matches more than one element, the locator API throws unless you opt in with .first(), .last(), or .nth(). The official guidance is that you should never opt in; you should make the locator unique instead. That guidance is correct for hand-written test code where the author can see the page and disambiguate.

It doesn't apply to an AI agent that's receiving a description from a model and has to produce somedeterministic answer. The agent doesn't have the luxury of throwing and asking the user to rewrite the locator. It either picks one and acts, or it gives up. The Assrt resolver picks one, using the scoring tiers above, and lets DOM order break ties.

The trade-off is conscious: the resolver is biased toward acting deterministically on ambiguous input. The check on whether it acted correctly is the next snapshot the agent takes after the click. If the click landed on the wrong element, the next snapshot won't contain the expected state, and the agent recovers. The cost of being wrong is one extra LLM round-trip; the cost of refusing to act would be a stuck agent.

What this means for someone choosing an AI E2E tool

Most hosted AI E2E products keep the resolver as a closed black box. You give them a description, you get back a pass or a fail, and you have no way to inspect why a particular element was chosen on a particular run. When tests start clicking the wrong button after a UI rebrand, debugging means filing a ticket with the vendor and waiting.

The honest version of multi selector candidate ranking is one where the algorithm is small enough to read in one sitting, lives in a file you can grep, and uses no proprietary scoring model. The Assrt version is fifty lines of plain DOM JavaScript. It runs in your browser, ships its decision in a snapshot diff you can inspect, and you can change the candidate set or the scoring weights by editing one file. That's the trade-off the project makes: less magic, more legibility.

None of this is a knock on hosted products that get the resolution layer right; it's a statement about what kind of tool you want owning the most important runtime decision in your test suite. If your team values the ability to read, modify, and own the resolver, an open-source pick is the natural fit. If you'd rather pay a vendor to take responsibility, that's a different trade-off, and the difference shows up clearly the first time the resolver picks the wrong button and you need to know why.

Want to walk through the resolver on your own app?

Show me the page your AI E2E tests struggle with most. I'll walk through how Assrt's scoring function would resolve the ambiguous clicks, what the candidate set catches, and where to add a data-test attribute or two to make the resolver deterministic.

Multi selector candidate ranking FAQ

What does 'multi selector candidate ranking' mean in an E2E test?

When an E2E test agent is asked to act on an element described loosely (the typical case for AI-driven tests, where the instruction is 'click the Sign in button' rather than a precise locator), the agent has to scan the page, find every element that could plausibly be the target, and pick one. Multi selector candidate ranking is the scoring step in the middle: given N candidates that all partially match the description, assign each a score, take the winner. The interesting part is what counts as a candidate, what scoring function decides the winner, and what tie-breaking rule applies.

How is this different from Playwright's locator priority?

Playwright's recommended priority (getByRole, then getByLabel, then getByPlaceholder, then getByText, then getByTestId, then CSS) applies when you, the test author, write the locator. It says nothing about what an agent should do when the input string is not a valid locator at all (the literal text 'Sign in button' is not a getByRole call). Multi selector candidate ranking lives one layer below Playwright's strictness: it's the step that turns a fuzzy description into one element, and the algorithm has to be its own thing because Playwright's locators throw on ambiguity by design. The Assrt approach is to try the literal string as a CSS selector first, fall back to a fixed candidate set when querySelector returns null, score, and pick.

What's in the candidate set?

The candidate set in Assrt's resolver is exactly nine selector types: a, button, input, [role="button"], select, textarea, label, [onclick], and [href]. That's the union of (1) the native interactive elements, (2) anything that the page has explicitly opted into being interactive via ARIA, and (3) anything the page wires up with click handlers or hrefs. The set is deliberately narrow. Adding [tabindex] or [data-clickable] would broaden the set but also pollute the scoring, because most of those elements are not really click targets. The list is in browser.ts at the line that calls querySelectorAll inside showClickAt.

What are the four scoring tiers?

Tier 0 is 'try the input string as a CSS selector via document.querySelector'. If that resolves, you're done; no scoring needed. Tier 1 (exact text match) breaks the loop immediately if any candidate's lowercased textContent equals the lowercased input. Tier 2 is substring matches: target inside candidate scores 3, candidate inside target scores 2 (with a candidate-length floor of 2 characters to avoid one-letter labels). Tier 3 is partial-word matching: split the input into words longer than two characters, count how many appear in the candidate text, divide by the total word count. Highest score wins across all candidates.

Why does substring match score higher than partial-word match?

Substring match is a stronger signal because it preserves ordering. If the agent is told 'click the Continue with Google button' and one candidate contains the literal string 'Continue with Google', the order of those four words is evidence the candidate is the intended target. Partial-word match catches cases where the description and the candidate share vocabulary but not phrasing ('Sign in' vs a button labelled 'Log in to your account'). Ranking substring higher keeps the agent biased toward the unambiguous case and only falls back to fuzzy matching when the page has rebranded the target.

Why does the algorithm break on exact match instead of scoring it?

Speed and predictability. An exact match cannot be beaten by any other tier (substring match max is 3, partial-word max is 1.0, exact match is the only path that exits the loop). Breaking out of the for-loop saves work on long candidate lists and, more importantly, gives the agent a deterministic answer when the input is unambiguous. If the page has two candidates with identical text, the first one in DOM order wins, which is consistent with Playwright's own first-match convention.

What happens if no candidate scores above zero?

The function returns null. The agent's cursor doesn't move, the click doesn't happen, and the caller logs a 'no element matched' result. The agent's next step is usually to call snapshot() again, get a fresh accessibility tree, and try a different ref. This is intentional: a noisy fallback (clicking the highest-scoring element even at score 0.0) would silently exercise the wrong thing. Honest failure is cheaper than silent success in E2E.

Why this approach instead of using Playwright's getByRole(name) under the hood?

getByRole(name) is the right primitive when the agent already knows the role. Assrt's resolver runs earlier, when the input is still a free-text description and the role hasn't been inferred yet. The flow is: (1) try the literal string as a CSS selector, (2) try the candidate-set scoring function, and only after both fail does the agent fall back to calling snapshot() and asking the LLM to pick a ref from the accessibility tree, which is where getByRole-equivalent matching lives. Skipping straight to the AT-based pick on every step would be slower and would burn LLM tokens on cases the deterministic scoring already handles.

Does this scoring function ever produce the wrong pick?

Yes, and the dangerous case is when two candidates score equally and the wrong one happens to be earlier in DOM order. Example: a marketing page with a Sign In link in the header and a Sign Up link in the hero CTA. The instruction 'click Sign in' matches the header link exactly (score 3 for substring), but if a hero card contains the literal phrase 'Sign in to continue', the partial-word match also scores high. The fix is not to make the scoring smarter (that just hides the problem); the fix is to have the LLM call snapshot() and use a ref, so the deterministic scorer is only used for the easy cases. Assrt's agent prompt explicitly tells the model to prefer refs when available.

Where does this algorithm live in the Assrt repo?

In the function showClickAt inside src/core/browser.ts, around lines 294 to 344. The function is called before every click() and type() so the visual cursor overlay knows where to land, and the same scored pick is used as a fallback when ref-based targeting fails. The function is fewer than 50 lines and depends on no library; it's pure DOM. You can read it on the open-source repo and edit it if you want a different scoring function for your own E2E setup.

How does this connect to Playwright MCP refs?

Playwright MCP returns an accessibility tree where each element has a ref like e5 or e12. When the LLM picks a ref, the click is unambiguous: there is one element with that ref, the scoring function isn't invoked. The scoring function runs when the agent is told to click by description ('the Sign in button') and no ref was given. In practice, the agent prompt encourages ref-based clicks because they're stable across re-renders, and the scoring function is the safety net for the cases where the model gives a description instead.

Can I change the candidate set or the scoring weights?

Yes. The file is browser.ts in the open-source repo; the candidate set is a literal string in a querySelectorAll call and the scoring tiers are 7 lines of if/else. Fork it. If your app uses div role="button" patterns heavily you might add [role="link"] and [role="menuitem"]; if you want to favor exact-substring over partial-word more aggressively you can multiply tier 3a's score. The point of open-sourcing this is that the heuristics are a starting point, not a contract. Hosted SaaS competitors don't let you do this; their resolver is a closed-source black box you call by API.

Related guides on selector reliability

AI Playwright

Cached selector staleness in AI Playwright frameworks

The cache hits but resolves to the wrong element. How a self-validating cache entry catches it before the click lands.

Read

Self-healing

Self-healing tests: what actually heals and what hides bugs

What 'self-healing' means in practice, what it costs in confidence, and where the line between resilience and noise sits.

Read

Playwright

Playwright locator strategy for beginners

getByRole, getByLabel, getByText, getByTestId, CSS, XPath: a beginner-friendly tour of when each one is the right pick.

Read

Multi selector candidate ranking in E2E tests: the four-tier scoring function inside an AI test agent

The problem this function solves

The candidate set is exactly nine selector types

Tier 0: try the literal string as a CSS selector

Tier 1: exact text match wins, break out

Tier 2: substring matches score 2 or 3

Tier 3: partial-word match by fraction

The whole algorithm fits on one screen

Walking through four realistic scenarios

What an AI agent receives vs what gets clicked

Where this diverges from Playwright's own first-match rule

What this means for someone choosing an AI E2E tool

Want to walk through the resolver on your own app?

Multi selector candidate ranking FAQ

Related guides on selector reliability

Cached selector staleness in AI Playwright frameworks

Self-healing tests: what actually heals and what hides bugs

Playwright locator strategy for beginners

Comments (••)

Comments ()