AI testing jobs in 2026: two primitives, one open-source loop

Most pages on this topic give you a list of frameworks (Selenium, Cypress, Playwright), a salary band, and a paragraph about “working alongside AI as a co-tester.” Useful for a recruiter, useless to a working engineer. The interesting AI testing job in 2026 is the one that tests AI applications themselves: chatbots, RAG widgets, agentic browser tools. The daily work, in that role, collapses to two very specific primitives: an explicit passCriteria field that pins the verdict, and a MutationObserver-based wait_for_stablethat adapts to streaming. Both live in one MIT-licensed reference loop you can clone before tomorrow's call.

Every claim below has a line number against the Assrt reference loop on GitHub. No vendor key, no procurement form, no per-seat license. Read along.

Matthew Diakonov, Written with AI

Published April 24, 202612 min read

4.9from 120+

MIT licensed

Self-hosted

Real Playwright

Streaming-aware waits

AI testing jobs, 2026

Two primitives, one open-source loop

passCriteria pins fuzzy outputs to a verdict

wait_for_stable rides on a MutationObserver

{{VARS}} parameterise across seeds and users

All MIT licensed, all in 33 readable lines

0:00 / 0:05

The frame the job-description templates miss

Two markets share this title. The first is older: a QA engineer who happens to use an AI tool on top of a regular web app. The second is newer and pays better: an engineer who tests AI applications themselves, where the output is generated text or generated actions, not a fixed string in a fixed div. This page is mostly about the second one. The advantage is the work is structurally harder, the title commands a higher band, and the field has not had time to fossilise into framework wars yet.

Once you accept that the system under test is nondeterministic, two design problems show up immediately. What does “passing” mean when the output is a generated string? And how do you wait for a streamed response to finish without inserting a fixed sleep that flakes the moment the model is slow? Solve those two and the rest of the role is recognisable: plan files, JSON reports, CI gates. Fail to solve them and your suite is a vibes check.

TypeScript

Playwright

MCP

Accessibility tree

MutationObserver

Streaming UIs

RAG eval

LLM-as-judge

Prompt seeds

Markdown plans

JSON jq pipelines

GitHub Actions

Anthropic SDK

Gemini SDK

Primitive one: passCriteria, the boundary that pins fuzzy outputs

The first primitive is a single Zod field on the assrt_testtool registration. Free text, your words, the agent must verify every clause before the scenario passes. This is what makes a deterministic verdict possible against a nondeterministic system. The agent observes the page (it is, itself, an LLM), but the boundary of pass/fail is pinned to a fact a human chose. Without an explicit passCriteria, an LLM agent looking at LLM-generated output will happily call almost anything “a successful response,” and the suite degrades into noise. With one, the test asks a question with a checkable answer.

server.ts

The pattern, in practice: name three to five concrete facts the user must see for the feature to count as working. Cart total. URL path. Toast text. A specific phrase from the source doc. The presence or absence of a refusal pattern. Each clause is a small claim about the world that the agent verifies on screen. None of them depend on the assistant generating the same wording twice.

Primitive two: wait_for_stable, a 33-line MutationObserver loop

The second primitive is a tool the agent can call to wait until the page stops mutating. The implementation is 33 lines. It injects a MutationObserver via browser.evaluate, polls the mutation count every 500 milliseconds, and considers the page stable after 2 seconds of no new mutations (default). A 30-second ceiling caps the wait. There is no vendor SDK, no proprietary “auto-wait” rule engine, and no fixed timeout that flakes the moment a model is slow. You can read the entire loop and predict its behaviour without trusting documentation.

agent.ts

Streaming chat responses break a fixed wait. RAG completions that paint tokens incrementally break a fixed wait. Autocomplete dropdowns that update twenty times a second break a fixed wait. The MutationObserver-based version adapts to actual paint speed and is the difference between a flaky AI test and a green one. Once you have written this primitive once, you write it nowhere else: every #Case in your portfolio drops a single tool call and gets correct streaming-aware waits for free.

The two primitives sit on opposite sides of the same loop

What the two primitives look like in a real plan file

Here is a portfolio-grade plan file for a chat surface. Two #Case blocks, both with explicit passCriteria. The second one demonstrates {{VARS}} interpolation across seeds, which is how you exercise nondeterminism without writing the same plan five times.

plans/chat-streaming.md

One bash line runs the whole thing locally and in CI. Same line, same JSON shape, same .webm. A reviewer can clone the repo, paste the line, and watch the agent drive the chat surface in their own browser. No vendor account, no sandbox key, no quota.

terminal

“The entire streaming-aware wait, the piece every AI testing job needs once a chat surface arrives, is 33 lines of MutationObserver in a public repo. You can read it on a flight.”

Two markets, two resumes, two salary bands

The market for “AI testing” jobs is bifurcating. Postings still anchored to the 2022 stack (Selenium plus a vendor flake detector on a regular web app) cluster in the United States in the rough range of $95,000 to $135,000 base. Postings written for the 2026 loop on AI applications (engineer embedded in a feature team, owns the plan-design layer, ships scenarios in the same PR as the feature, reasons about model nondeterminism) pay closer to $150,000 to $200,000. Specialised roles like AI red-teamers and Data QA engineers can land higher; some senior AI/ML test engineering postings exceed $200,000.

The practical implication for a candidate: do not lead your resume with “eight years of Selenium.” That sentence used to read as senior; in 2026 it reads as “measured my career by a tool that is now legacy.” Lead with two artifacts: a portfolio plan file with explicit passCriteria, and a recording showing the agent driving a real AI surface against those criteria. Hiring managers reading those two artifacts can evaluate test design, evidence handling, and model literacy in fifteen minutes. That asymmetry is in your favour.

passCriteria is the boundary for fuzzy outputs

server.ts line 343. A free-text Zod field on the assrt_test tool. The agent must verify every clause before marking the scenario passed. This is what separates a real AI test from a vibes check. If you cannot write a passCriteria for a feature, you cannot test it.

wait_for_stable is the streaming-response wait

agent.ts lines 956 to 994. MutationObserver, 500ms poll, 2s default quiet window, 30s ceiling. No vendor SDK and no fixed sleep. Adapts to actual paint speed.

{{VARS}} are how you parameterise across seeds

server.ts line 344. {{KEY}} interpolation lets you run the same #Case across multiple seeds, prompts, or user identities. The same plan covers ten conversations without ten copies.

JSON report + .webm recording are the audit trail

Per-step screenshots, agent decisions, and a video. A reviewer can play the .webm and watch the agent type, click, and assert. The recording catches modals an assertion misses.

assrt setup wires it into your daily flow

cli.ts lines 201 to 275. Writes a Claude Code hook that fires on every commit. Continuous-test discipline shows up as a checked-in .claude/settings.json in your portfolio repo.

The before / after of writing a single chat assertion

The clearest way to see the difference between the two markets is to look at how they assert one thing: that the assistant's reply mentions a specific source document. The 2022 stack writes a string match. The 2026 stack writes a multi-clause passCriteria string and lets the agent verify each clause as a fuzzy assertion against the rendered page.

Asserting on a streamed chat reply

page.waitForTimeout(3000); expect(await page.textContent('.message-bubble')).toContain('cancellation policy');

Fixed sleep, flakes when the model is slow
Single string match, blind to refusals or wrong citations
Test rerun gives a different answer with no signal why

The daily loop, step by step

Once a candidate lands the role, the day looks like this. None of the steps require a vendor key. None route through a QA queue.

Pull the AI feature plan

/plans/<feature>.md lives next to the feature code. You open it alongside the engineer who is iterating on the prompt. No separate test repository, no QA queue.

Author the passCriteria first, then the steps

Every #Case begins with the question 'what fact must be true on screen for this to pass?'. Write that as plain English. The steps come second; they are how the agent gets to the page where the fact is verifiable.

Insert wait_for_stable after every streamed event

Click submit, then call wait_for_stable. Send the prompt, then call wait_for_stable. Anywhere the model paints tokens or the agent acts, the next assertion sees a frozen DOM.

Run with --video on, watch the recording

Recording auto-opens in a browser player. You scan for unexpected modals, hallucinated logos, double-renders, and other things assertions miss. This is the senior habit.

Tighten passCriteria where the agent was lenient

If the screenshot shows the wrong tenant name, you add a clause that names the right one. The agent verifies every criterion or the run fails. No fuzzy 'looks fine' assertions.

Diagnose the one flake, ship the plan diff

assrt_diagnose reads the failed step trace and the screenshot, returns a root-cause sentence and a corrected #Case. You commit the corrected plan diff in the same PR as the feature.

0Lines in the wait_for_stable loop (agent.ts:956-994)

0msPoll interval for the MutationObserver

0sDefault 'quiet' window before page is stable

0sSeconds to clone and run a candidate's portfolio

How the 2022 and 2026 stacks compare, line by line

Most of the gap between the two markets is visible in a six-row table. Read the right column as “what your portfolio should look like” and the left as “what most postings expected three years ago.”

Feature	The 2022 stack (Selenium + vendor SaaS)	The 2026 loop (open-source reference)
How a chat reply is asserted	expect(reply).toContain('hello') — string match against an LLM output	passCriteria with three plain-English clauses; agent verifies all three or fails
How a streaming response is awaited	page.waitForTimeout(3000) — fixed sleep, flakes when the model is slow	wait_for_stable with a MutationObserver; adapts to actual paint speed
How nondeterminism is handled	Pin the prompt and hope; rerun on red, accept the flake budget	Broader passCriteria that names the fact, plus {{VARS}} for seed sweeps
Reviewer setup time for a take-home	Sign up for the vendor dashboard, paste a sandbox key, request a license	git clone, npx assrt run --url ... --plan-file plans/take-home.md
Authoring artifact in your portfolio	Selenium spec.ts, page-object hierarchy, fixture wrappers, vendor YAML	One Markdown plan with passCriteria blocks; reads top to bottom
Cost shape an employer signs off on	Per-seat license at $7.5K/month for an AI testing SaaS	MIT license + a few cents in LLM tokens per scenario

How big is the cost gap, really?

For an employer making a buy decision, the cost gap between “hire onto an AI testing SaaS” and “run the open-source reference” is most of the conversation. Hosted AI testing platforms competitive at this surface still land in the four to five figures per month per team. The local CLI produces the same JSON report and the same .webm recording on a laptop with no per-seat fee. The math does not require a spreadsheet.

Competitor SaaS, per month

$0.06

Tokens per scenario, Haiku 4.5

MCP tools the role exercises

To clone & run a portfolio

Interview-day prep: a concrete checklist

If you have an interview for an AI testing role on the calendar this week, the prep is short and concrete. None of these steps require a vendor account or a credit card.

Interview-day prep: the things to demonstrate

Clone github.com/m13v/assrt-mcp before the call.
Read passCriteria at server.ts:343 (one Zod field).
Read wait_for_stable at agent.ts:956-994 (33 lines).
Pick one AI app you can test (chat, RAG, agent).
Write one #Case with a 3-clause passCriteria block.
Run with --video so you can play the recording on screen.
Commit /plans, /results, and the .webm to your portfolio repo.
Have an answer for 'why two reruns might disagree'.

why this works

Every artifact above is a real file on disk that a hiring manager can read without trusting your screen-share. The plan is a Markdown diff. The passCriteria are plain English clauses. The report is a JSON file. The recording is a .webm. The two primitives that make the daily work tractable are 33 lines and one Zod field. A candidate who shows up with a runnable portfolio is interviewing for a job at a different leverage from one who shows up with a list of frameworks they have used.

Hiring for an AI testing role, or interviewing for one?

Talk through what passCriteria and wait_for_stable look like at your team and the loop the open-source reference exposes.

Frequently asked questions

What is an AI testing job, exactly, in 2026?

There are two markets that share a name. The first is older: a QA engineer who happens to use an AI tool (a generator, a self-healing locator, a flake detector) on top of a regular web app. Salary band in the United States lands in the rough $95,000 to $135,000 range. The second is newer and pays better: an engineer who tests AI applications themselves. Chatbots, RAG widgets, voice agents, autonomous browser agents. Salary bands here cluster from $130,000 to $200,000 because the work is structurally harder. The output of the system under test is generated text or generated actions, not a fixed string in a fixed div. This page is mostly about the second market.

What does a daily AI testing job actually look like?

A typical day touches a chat surface, an agent loop, and a retrieval layer. The morning starts with a Markdown plan file (one /plans/<feature>.md per feature). Each #Case block declares the steps in plain English plus an explicit passCriteria string that says what must be true on screen for the case to pass. You run the suite locally with the agent driving a real browser. When the assistant streams a response into a chat surface, the test calls wait_for_stable to wait until the DOM stops mutating before it asserts. After the run, you scan the JSON report for failures, watch the .webm recording for things assertions missed (an unexpected modal, a stuck spinner, a hallucinated brand name), and ship a plan diff next to the feature commit.

What is the passCriteria field and why does it matter?

passCriteria is a single free-text Zod field at /Users/matthewdi/assrt-mcp/src/mcp/server.ts line 343 on the assrt_test tool. It tells the agent what conditions MUST be true for the scenario to pass, in your own words, e.g. 'Cart total shows $42.99', 'Reply mentions the customer's first name', 'Error toast does NOT appear'. The agent must verify every criterion or fail the run. This sounds trivial. It is the entire bridge that lets you write deterministic tests against nondeterministic systems. Without an explicit passCriteria, an LLM agent observing an LLM-generated UI will happily call anything 'a successful response' and the run will be junk. With one, the boundary is pinned to a fact a human chose.

What is wait_for_stable and why does an AI testing job need it?

wait_for_stable is a tool the agent can call to wait until the page stops mutating. The implementation lives at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 956 to 994. It injects a MutationObserver via browser.evaluate, polls every 500ms for the mutation count, and considers the page stable after 2 seconds of no new mutations (default), with a 30-second cap. Streaming chat responses, RAG completions that paint tokens incrementally, autocomplete suggestion lists, agent loops that update the same div twenty times in a second, all of these break a fixed wait('2s'). A MutationObserver-based wait adapts to actual paint speed and is the difference between a flaky AI test and a green one.

How is testing an AI app different from testing a normal web app?

Three differences in practice. First, the output is a generated string, so a 'click submit and assert h1' test no longer works. You write passCriteria in your own words and the agent verifies them as a fuzzy assertion. Second, the output is streamed, so the test must wait for paint to stop, not for a fixed time. Third, the same prompt with the same fixtures can produce a different output on a rerun. The mitigation is small fixed seeds where the model permits, and broader passCriteria that name the fact you cared about ('the answer cites the cancellation policy doc') rather than the exact wording. Engineers who internalise these three habits are who the better-paid version of this role hires.

What should I put in my AI testing portfolio for an interview?

A public GitHub repo with three things. First, a /plans directory of Markdown plan files written for one open-source AI app, your choice. Cal.com's AI scheduler, an Excalidraw AI extension, or a public RAG demo are all fine. Each plan should have at least one #Case that exercises a streaming response with wait_for_stable and at least one passCriteria string that names a non-obvious fact. Second, /results/latest.json, the JSON report committed from your last run. Third, a .webm recording from the same run so a reviewer can play the agent driving the browser. The whole repo should be cloneable and runnable in one bash line. The MIT-licensed reference at github.com/m13v/assrt-mcp gives you the runner; the plans, criteria, and recordings are your work.

What languages and frameworks should an AI tester know in 2026?

TypeScript or JavaScript at a working level, because most AI app surfaces are React/Next and the reference runner is TypeScript. Python is acceptable, especially if you also test the model layer (RAG retrieval, embeddings) where the ecosystem is mostly Python. Beyond language, fluency reading an accessibility tree (the agent reads one before every action; see the system prompt at agent.ts lines 206 to 218), comfort with one CI runner (GitHub Actions or GitLab; the JSON-in, exit-code-out shape is identical), and a habit of writing acceptance criteria as plain English propositions rather than CSS selectors. Selenium memory still helps for legacy review; new code rarely needs it.

How does a hiring manager evaluate an AI testing candidate?

Three questions that map to the daily work. First, hand them an open chat widget and ask 'write a #Case for this in five minutes'. Watch whether they reach for passCriteria with concrete strings, or stop at 'verify the assistant replies'. Second, hand them a recording of a streaming response that visually finishes but throws an error console-side, and ask whether they would mark it passing. The senior answer references wait_for_stable plus a passCriteria that names what the user should actually see. Third, ask 'what stops two reruns of the same test from giving you different verdicts?' and listen for a discussion of fixture seeds, broader passCriteria phrasing, and LLM-as-judge boundaries. None of these require a license or a SaaS account.

Are AI testing jobs going away because the AI does the testing?

No. The artifact stack collapses; the judgment layer expands. The agent can drive the browser and read the accessibility tree. It cannot decide which user identity the test should run as, what counts as a passing answer for an ambiguous question, which conversational turns belong on the smoke gate vs the full release gate, or whether a passing screenshot with a wrong tenant logo is actually a regression. Every passCriteria string in your repo is something a human authored. Every variables map (parameterized via {{KEY}} interpolation, see server.ts line 344) is a fixture set the engineer shaped. The role compresses on its repetitive parts and grows on its design parts, which is the same trajectory backend engineering took with ORM frameworks twenty years ago.

What salary should I expect for an AI testing job in 2026?

The market is bifurcating. Postings still anchored to the 2022 stack (Selenium plus Java plus a vendor flake detector) cluster in the United States in the rough range of $95,000 to $135,000 base. Postings written for the 2026 loop on AI applications (engineer embedded in a feature team, owns the plan-design layer, ships scenarios in the same PR as the feature, reasons about model nondeterminism) pay closer to $150,000 to $200,000 because the role is functionally that of a backend engineer with QA judgment plus model literacy. Specialised roles like AI red-teamers and Data QA engineers can land higher. The compression on locator-by-locator authoring is real; the cost line that justified offshore manual contractors is shrinking to a few cents per scenario in tokens.

Can I run the open-source reference at work without my company paying a vendor?

Yes. The reference runner at github.com/m13v/assrt-mcp is MIT licensed. The Playwright MCP package it spawns is Microsoft's, also open source. The only thing your employer pays for is LLM token usage; default is Anthropic Claude Haiku 4.5 (set at agent.ts line 9), with a Gemini path wired at agent.ts lines 354 to 367. There is no per-seat fee, no quota on how often you can run, and no proprietary YAML dialect to learn. A candidate who walks into an interview with a working portfolio against this stack arrives with leverage on tooling cost — competitive pricing for hosted AI testing platforms still lands in the four to five figures per month, while the local CLI produces the same JSON report on a laptop.

What is a sensible take-home for an AI testing role?

Four hours, one PR, against an AI app the candidate has never seen. The brief: pick a chat or agent surface (Cal.com's AI scheduler, a public RAG demo, the candidate's own side project). Author at least three #Case blocks, each with an explicit passCriteria string, including one streamed response that needs wait_for_stable. Run the suite once and commit the JSON report and the .webm. Cause one assertion to fail intentionally (mistype a label or remove a passCriteria token) and run assrt_diagnose against it. Ship the resulting commit. A reviewer can read the diff in fifteen minutes and see test design, evidence handling, and stack-trace literacy in one place. No proprietary YAML, no vendor sandbox, nothing the candidate keeps after rejection.