Day 7 of building an AI Playwright framework

AI Playwright cached selector staleness: self-validating cache entries beat retry-only and TTL-only invalidation.

The LLM ranks well on fresh failures. The danger is the cached selector that still resolves after a page restructure and keeps tests green for weeks while exercising the wrong element. TTL revalidation alone misses semantic drift inside the window. Retry-only invalidation catches nothing until something throws. The honest answer is a self-validating cache entry: every hit re-checks role, accessible name, and content hash before acting, falls through to a fresh LLM resolve when any of those drift. Net cost per hit, under 0 ms.

Matthew Diakonov, Written with AI

Published April 27, 202611 min read

Install assrt-mcp Book a 30 minute design review

4.9from design choices that survive a page restructure

Cache hits validate role, accessible name, and content hash on every access

Cache misses fall through to a fresh LLM resolve and update the fingerprint

Every cache decision lands in .ai-playwright/cache.log with reason

TTL is a 24 hour backstop, not the primary invalidation mechanism

The silent-pass cache leak

Cached selector resolves to the wrong element after a page restructure.

Day 1: LLM picks 'button[data-action=checkout]' for the Checkout intent.

Day 5: Page restructures, a different button now resolves to the same selector.

Naive cache: silent click. CI green. Wrong element exercised.

Self-validating cache: role, name, content hash mismatch. Fall through to fresh resolve.

0:00 / 0:05

The principles you cannot ship without

self-validating cacherole check on hitaccessible name matchcontent hash invalidationno TTL-only revalidationno retry-only invalidationright element by accidentsilent-pass failure modefresh resolve fallbackcache entry as record

The leak: cached selector that still resolves, but to the wrong element

On day 1, the LLM resolves the intent 'click the checkout button' to a selector. It works, the test passes, the selector lands in the cache. On day 5, the team ships a refactor: a sibling component gets recycled, classes get reused, the same selector now resolves to a different element. A naive cache returns the locator without re-checking; the click lands on the wrong button; the assertion may or may not still pass; in the worst case it does, and CI stays green for weeks while the test silently exercises the wrong flow.

The leak is not the LLM; the LLM resolved correctly the first time. The leak is the cache layer treating a still-resolving selector as a still-correct hypothesis. A selector that resolves is a necessary condition, not a sufficient one. The sufficient condition includes the role, the accessible name, and the content hash. Validate all three on every hit, and the silent-pass failure mode disappears.

The promise

Cached LLM resolves stay fresh for free

The first run pays for the LLM call. Every subsequent run hits the cache in microseconds. Tests run as fast as a hand-written Playwright suite without giving up natural-language scenarios.

The leak

The cache trusts what it should validate

A still-resolving selector is necessary, not sufficient. Without role, name, and content checks on every hit, the cache silently keeps tests green while the underlying element changes meaning.

Naive cache: if it still resolves, ship it

cache.ts (naive)

The simplest cache is the most dangerous. A selector with count over 0 is treated as valid. The framework never asks whether the element still has the right role, the right name, or the right content. A page restructure that keeps the selector resolving silently breaks the test contract.

TTL revalidation: better, still leaks inside the window

cache.ts (TTL only)

TTL forces a fresh LLM resolve eventually. It does nothing for the drift that lands inside the window. A page that restructured at hour 2 of a 24 hour TTL keeps a stale entry for 22 more hours. A page that restructured at minute 5 of a 1 hour TTL keeps a stale entry for 55 more minutes. Time is the wrong axis to invalidate on.

Self-validating cache: re-check role, name, and content hash on every hit

cache.ts (self-validating)

Every hit pays for three attribute reads (role, name, textContent) and one sha1, which is microseconds. Drift on any field invalidates the entry, logs the reason, and falls through to a fresh LLM resolve. The framework cannot silently click the wrong element by accident; it can only click the element whose fingerprint still matches the original pick.

The cache decision log

.ai-playwright/cache.log (after a run)

One JSON record per cache decision. Hit validated, miss with reason, fresh resolve fallback, the new selector. Reviewers can diff the log across runs to see when a project shipped a structural change worth a review. Cache miss spikes are a regression signal.

The flow of a self-validating cache

Intent in, cache lookup, validate the fingerprint, click or assert if the fingerprint matches, fall through to fresh LLM resolve when it does not. Every branch logs the decision. No path bypasses the fingerprint check.

Intent → cache lookup → validate → act or refresh

The seven structural choices that make the cache safe

Generic caching wisdom (TTL, retry, write-through) does not map cleanly to AI selector caching. Each of these is a structural choice about which signals invalidate, which proceed, and how a reviewer reconstructs what happened.

Cache the resolve, not the trust

A cached selector is a hypothesis: 'this string still points at the element I picked last time'. Treat it like a hypothesis, not a fact. Validate role, accessible name, and content hash on every hit. Trust costs one extra Playwright call per element; it is cheaper than a silent pass.

Role is the floor

If the cached selector resolves but the role drifted (button to span, link to button), the cache is wrong. Role check is fast, side-effect free, and the cheapest signal of structural drift you can compute without a model call.

Accessible name catches rebrands

A label trim from 'Checkout' to 'Buy now' or a copy edit from 'Sign up' to 'Get started' tells you the team renamed the surface. Accept inside a small distance band; escalate to fresh resolve outside it.

Content hash catches the right element by accident

The dangerous case is a cache entry that still resolves and still has the right role and name, but the underlying element now contains different text content (a swapped row, a different cell, a recycled component). A hash of the textContent at pick time, compared to the textContent at hit time, catches this.

TTL is a backstop, not a primary

Time-based revalidation alone misses semantic drift that lands inside the window. A page that restructured at hour 2 of a 24 hour TTL is still in the cache. TTL is fine as a 'no entry lives forever' guarantee; it cannot replace per-hit validation.

Retry-only is worse than TTL

Retry-only invalidation only fires when something throws. A cached selector that resolves to the wrong element does not throw; the test passes silently. Retry catches nothing in the silent-pass failure mode by definition.

Every cache decision lands in a log

Cache hit validated, cache miss with reason, fresh resolve fallback, all of them. One JSON record per decision. The log is the audit trail for why a test passed or failed at the cache layer, and the source for tuning the distance bands.

0days into building before staleness was the top concern

0fingerprint fields validated on every cache hit

0silent-pass failures the team will tolerate

0hmax hours a cache entry stays alive without a fresh validation

Naive cache vs self-validating cache

Same data structure shape, same LLM resolver, same Playwright API. The only difference is what gets validated on every hit. The cost per hit is microseconds; the cost of a silent-pass test is weeks of green CI on a broken flow.

Feature	Naive or TTL-only cache	Self-validating cache (Assrt is one option)
Behavior on selector still resolving but pointing to wrong element	Silent click. Test reports green. Wrong element exercised.	Content hash mismatch. Fall through to fresh LLM resolve. New pick logged.
Behavior on label rename ('Checkout' to 'Buy now')	Either silent click (cached selector still resolves) or a hard fail (selector no longer matches).	Accessible name distance check. Within 0.20 distance, accept and update cache. Outside, fall through to fresh resolve.
Behavior on role drift (button became span)	Click sometimes throws (button click on span fails on aria), sometimes passes silently. Behavior depends on the action.	Role check fails on the first hit. Cache entry invalidated. Fresh resolve. Log entry tagged role_drift.
Cost per cache hit	1 Playwright call (count or click).	1 Playwright call plus 3 attribute reads (role, name, textContent). Microseconds.
Cost per cache miss	1 LLM resolve. About 200 to 600 ms.	Same. The resolve is the same; the trigger is just smarter.
Audit trail	None. Cache is opaque. Reviews of why a test passed are impossible.	One JSON record per decision. Cache hit, cache miss with reason, fresh resolve. Log diffs across runs.
Time to detect structural drift	Hours to weeks (until something throws).	Microseconds (next cache hit on the affected intent).
Failure mode when the model makes a bad pick	Bad pick gets cached, clicked, and silently exercised on every subsequent run.	Bad pick is fingerprinted; the next run that touches a different state on the page surfaces a content hash mismatch and falls through to a fresh resolve.

The competitor column describes the shape of a typical naive AI Playwright cache layer, not any single implementation. The self-validating column maps to how Assrt is wired today; the cache fingerprint shape and miss-reason taxonomy are documented in the public source.

Six steps to wire a self-validating cache into your AI Playwright framework

The refactor takes a day on an existing cache layer. Most of the work is plumbing the fingerprint through the LLM resolver and the Playwright accessor; the validation logic is short.

Define the fingerprint at pick time

When the LLM resolves an intent to a selector for the first time, capture three things along with the selector: the role of the element (button, link, textbox, dialog), the accessible name (aria-label or trimmed textContent), and a sha1 of the relevant textContent window. Store all four under the intent key.

Validate the fingerprint on every hit

Before clicking or asserting, query the DOM through the cached selector and compute the same three fields. Compare. Role mismatch is hard fail. Name distance over 0.20 is fall-through to fresh resolve. Content hash mismatch is fall-through plus a log entry tagged 'right element by accident' candidate.

Fall back to fresh LLM resolve, not to error

If validation fails, the next step is a fresh LLM resolve on the current DOM. The new pick replaces the cache entry. The test continues. The previous selector is preserved in the log so reviewers can see what drifted. This is the difference between a self-healing cache and a brittle cache.

Tune the distance bands per project

Default to 0.20 on accessible name and 0 on role. Adjust based on the log. If the project trims labels frequently, raise name distance to 0.30. If the project rotates classes on every build, store the selector by data-test attribute instead of class.

Add a TTL backstop, but do not rely on it

Set a TTL of 24 hours so no cache entry survives indefinitely. The TTL is a safety net for cases the fingerprint somehow missed (a re-themed page that kept the same role, name, and partial content). Do not raise the TTL above 7 days; the audit value of fresh resolves erodes.

Surface cache decisions in CI

A run summary that says '143 cache hits validated, 7 cache misses on name drift, 2 cache misses on content drift' is more useful than a green check. The cache miss count is a regression signal: a sudden spike means the team shipped a structural change worth a quick review.

What a self-validating run looks like in your shell

One CI run, two cache misses caught, two fresh resolves logged. The test that should fail does fail, the test that should heal does heal, and a reviewer can read the log to see why each decision went the way it did.

a normal day with a self-validating cache

The reframing

A cached selector is a hypothesis, not a fact. Validate it on every hit, not just on every error.

The retry-only mental model is borrowed from network programming, where the layer below throws when it fails. The DOM does not throw on semantic drift; it returns the wrong element silently. The cache layer of an AI Playwright framework cannot inherit the retry-only assumption from its lower layers; it has to validate the semantic fingerprint of the cached element on every hit, or it leaks a silent-pass failure mode the rest of the framework cannot detect.

Assrt is one framework built on this principle: cache entries carry role, accessible name, and content hash; every hit validates them; every miss falls through to a fresh LLM resolve and writes a log entry. There are others. The principle outlasts any single implementation.

3 fields

“A cached AI selector is a hypothesis. The silent-pass failure mode is treating it as a fact. Self-validation on every hit costs microseconds and saves weeks of green CI on broken flows.”

Cache design notes, day 7, 2026

Want a cache design review of your AI Playwright framework?

Bring the resolver code, we will walk through the fingerprint shape and the miss-reason taxonomy in 30 minutes. No pitch.

Frequently asked questions

What is the silent-pass failure mode of a cached selector?

A cached selector silent-passes when it still resolves to a real element on the page, but the element it points at is no longer the one the scenario meant. The page restructured (a row swap, a portal move, a recycled component, a renamed class), and the same selector now hits something different. Playwright does not throw; the click lands; the assertion may even pass on the new content. CI is green for weeks while the test exercises the wrong target. This is the failure mode an AI Playwright framework is most likely to ship if the cache layer is naive, because the LLM was the only thing checking semantic correctness, and the cache skipped it.

Why is TTL revalidation alone insufficient?

TTL fires on time, not on semantics. If the page restructured at hour 2 of a 24 hour TTL window, the cache entry is still alive and still wrong for 22 more hours. You catch the drift on the next refresh, which might be many test runs later. Setting a short TTL (one hour, ten minutes) trades silent passes for a higher LLM bill and slower runs, and it still does not catch drift inside the window. TTL is a useful backstop ('no cache entry should live forever') but cannot be the primary invalidation strategy for an AI framework where every selector is a fuzzy hypothesis.

Why is retry-only invalidation worse?

Retry-only fires when something throws: a click hits no element, an assertion times out, a navigation does not land. A cached selector that resolves to the wrong element does not throw; it produces a successful operation on the wrong target. Retry catches nothing in this case by definition. Retry is the right behavior for transient flakes (network timeouts, animation jitter) and the wrong behavior for semantic drift. The two need different mechanisms.

What is a self-validating cache entry?

An entry that stores enough metadata about the original pick to validate, on every subsequent hit, that the element the selector points to is still semantically the same one. The minimum useful fingerprint is three fields: the role of the element at pick time (button, link, textbox), the accessible name at pick time (aria-label or trimmed textContent), and a content hash (sha1 of the relevant textContent window). On every cache hit, the framework re-queries those three fields and compares. Drift in any of them invalidates the entry and triggers a fresh LLM resolve.

How does this compare to data-test attributes?

Data-test attributes are the gold standard for selector stability; if your team is willing to maintain them, the cache layer becomes much simpler because the selector itself never drifts. The reality is that AI Playwright frameworks are usually adopted on codebases without disciplined data-test coverage. The self-validating cache is the layer that lets the framework be useful on those codebases without paying the silent-pass tax. As the team adds data-test attributes, the cache miss rate drops, the LLM bill drops, and the framework gets faster.

What is 'right element by accident' and how do you catch it?

An LLM resolves 'click the primary call to action' on a marketing page; the page has two buttons with similar shapes; the model picks the wrong one but the click happens to navigate to a page that contains the assertion text the scenario expected. The test passes. Three weeks later, the team A/B-tests the page and the wrong button changes label. Now the test fails for the wrong reason and nobody can reconstruct the original mistake. Content hash invalidation catches this on the next run after the A/B change, because the textContent of the originally-picked element no longer matches the fingerprint, and the cache falls through to a fresh resolve that is more likely to land on the intended target.

Does this make every test slower?

Each cache hit pays for three extra Playwright attribute reads (role, name, textContent), which are microseconds, plus one sha1, which is microseconds. Net cost per cache hit is well under a millisecond. Cache miss cost is unchanged because the fresh LLM resolve was always the fallback. Total runtime impact on a 50 step scenario with a 90 percent hit rate is under 50 ms. The cost is negligible compared to the alternative of a silent-pass test.

Where should the fingerprint be computed for a non-button element?

For form fields, store the role (textbox, combobox), the label (computed via the label-element relationship or aria-labelledby), and a hash of the placeholder plus name attribute. For text spans (price displays, status badges), store role null, the trimmed textContent, and a hash of textContent. For container regions, store role (region, dialog, list), the accessible name (aria-label or heading text), and a hash of the first 200 characters of textContent. The fingerprint shape is per-element-type, not universal.

How does this work with frameworks like Assrt?

Assrt is one option that makes the cache layer auditable. Every selector pick, every cache hit validation, every cache miss with reason lands in a log file in the repo. The fingerprint shape is configurable per project. The fresh-resolve fallback is the default agent behavior, and the model name (claude-haiku-4-5-20251001 in the current build) is greppable in agent.ts. Other AI Playwright frameworks can adopt the same pattern; the principle (self-validate, do not silent-trust the cache) is independent of any vendor. Assrt's contribution is making it the default rather than an opt-in.

What does the cache log buy a debugging session?

When a test fails on a cache miss, the log has the previous fingerprint and the observed fingerprint side by side. The reviewer sees that the role drifted from button to span, or the label trimmed from 'Apply' to 'Apply code', or the content hash changed because the price went from $47.89 to $54.20. The diagnosis is in the log, not in a Playwright stack trace that says 'expected 1 but got 2'. The mean time to root cause drops from minutes to seconds for the most common drift cases.

How do you tune the distance bands?

Start with role match strict (no distance allowed), accessible name distance 0.20 (Levenshtein normalized by length), content hash strict equality on a narrow textContent window. Run a week. Inspect the cache miss log. If the team trims labels frequently, raise name distance to 0.30. If the project uses dynamic IDs that bleed into textContent (timestamps, session tokens), exclude those substrings before hashing. Tuning is a per-project conversation, not a global default.

When should the cache be cleared completely?

On a major release that restructures the surface globally, on a model upgrade that changes resolve semantics, and as a backstop weekly. Most teams find that the self-validation handles structural drift incrementally and a global clear is rarely needed. The cache file is a few hundred entries; clearing it is cheap; the next run repopulates it through fresh resolves.