Argument

A Playwright AI test generator that handles OTP needs four tools, in a strict order. Most generators ship three of them.

When you ask Cursor, Codex, or ChatGPT-with-Playwright-MCP to generate a Playwright test for an email-OTP signup, the file lands in your repo with a placeholder const code = "123456" and a six-step .fill() loop. It does not run. The bug is not the model; it is that test-generator setups skip three pieces of orchestration the OTP flow actually requires. This page names the four pieces, in the order an AI agent has to call them, with the exact tool definitions from the open-source Assrt MCP server.

Matthew Diakonov, Written with AI

Published April 30, 20269 min read

Direct answer (verified 2026-04-30)

An AI test generator can complete a Playwright OTP scenario only if it has these four tools wired in order:

Mint a disposable inbox BEFORE the signup form is filled, so the same address goes into both the form and the polling loop.
Drive the form through standard Playwright (navigate, type, click).
Poll the inbox with a regex cascade tolerant of 4 to 8 digit codes and varying email copy.
Dispatch the code with one ClipboardEvent on the parent of the split-digit inputs, not by typing into each input.

Verified against the open-source agent at github.com/assrt-ai/assrt-mcp, specifically src/core/agent.ts and src/core/email.ts.

The shape of the failure

Generic AI test generators are good at translating intent into syntax. They are not good at orchestrating runtime side effects that have to happen between two lines of test code. An OTP signup is exactly that kind of flow: between click("Sign up") and click("Verify"), a real email has to land somewhere readable, get parsed, and come back as a six-digit string before the test can continue. None of that fits in a single static file.

So the model does the only thing it can do from inside one file: it hardcodes a placeholder, leaves a TODO, and submits.

Two takes on the same scenario

# What a generic AI assistant tries to do 1. Read the form, fill in test@example.com 2. Click "Sign up" 3. ...wait, where does the code come from? 4. Hallucinate "123456" and hope 5. Type 1, 2, 3, 4, 5, 6 into six inputs 6. Form rejects: only the first digit got distributed 7. Test fails. Or worse: passes for the wrong reason.

No inbox mint at all (uses test@example.com fixed)
Hardcoded placeholder code (123456)
Six-step .fill loop that fires keystrokes, not a paste
Run result: form rejects, test red. Or worse, green for the wrong reason.

The four tools, in the order the agent calls them

You can read this list in the source as create_temp_email, browser_*, wait_for_verification_code, and evaluate. The order is enforced two ways: the system prompt says “FIRST call create_temp_email”, and the runtime handlers refuse to run later steps if no inbox exists. Soft and hard guards on the same rule.

The orchestration contract

create_temp_email

POSTs to a disposable-inbox provider's /email/new endpoint and returns a fresh mailbox. No API key, no account. The address is stored on the agent for the rest of the scenario. The system prompt names this as the FIRST call before any signup form is touched.

Defined at agent.ts:114-118. Implementation in core/email.ts wraps a 15-second-timeout fetch around the public temp-mail.io v3 API.

browser_navigate / browser_type / browser_click

Standard Playwright MCP tools driving the form. Each MCP call has its own timeout budget; the orchestration matters here only for the email field, which must receive the EXACT address minted in step one. The agent's running context carries the active address so it cannot fall out of sync.

The agent renders the active address into every system message via agent.ts:667: “Active disposable email: ${this.tempEmail.address}”. The model cannot forget which address it minted because it is in the prompt every turn.

wait_for_verification_code

Polls the inbox every 3 seconds for up to 60 (capped at 120 by the runtime). When an email lands, runs the seven-pattern regex cascade and returns the matched digits. If a guard sees no inbox was created, it refuses with 'Error: Call create_temp_email first.' That guard is the load-bearing piece of the ordering contract.

Tool def at agent.ts:120-126; runtime handler at agent.ts:858-879. Polling logic and regex cascade live in core/email.ts:67-129.

evaluate (DataTransfer + ClipboardEvent)

Dispatch the code into the form. The system prompt hands the agent the EXACT expression to use, only allowing CODE_HERE to be replaced. This stops the model from improvising a six-step .fill loop, which is the failure mode every generic AI test generator falls into.

Tool def at agent.ts:107-112; the canonical expression is in the system prompt at agent.ts:235.

one paste, not six fills

“If the code input is split across multiple single-character fields, you MUST use evaluate to paste all digits at once. Do NOT type into each field one by one.”

Assrt agent system prompt, src/core/agent.ts:234

What the system prompt actually pins down

A reasonable question at this point: why bake the OTP recipe into a system prompt? Could a smart model not figure it out? In practice, no. The failure modes (calling .fill() per input, hardcoding the code, treating ref attributes as DOM selectors) are stable across Sonnet, GPT-5, and Gemini. The system prompt fixes them by handing the agent the exact evaluate expression and forbidding deviation:

From src/core/agent.ts in @m13v/assrt-mcp

## Email Verification Strategy

When you encounter a login/signup form that requires an email:

1. FIRST call create_temp_email to get a disposable email

2. Use THAT email in the signup form

3. After submitting, call wait_for_verification_code for the OTP

4. Enter the verification code into the form

IMPORTANT: If the code input is split across multiple single-character fields, you MUST use evaluate to paste all digits at once. Do NOT type into each field one by one. Call evaluate with EXACTLY this expression (only replace CODE_HERE with the actual code):

() => { const inp = document.querySelector('input[maxlength="1"]'); if (!inp) return 'no otp input found'; const c = inp.parentElement; const dt = new DataTransfer(); dt.setData('text/plain', 'CODE_HERE'); c.dispatchEvent(new ClipboardEvent('paste', {clipboardData: dt, bubbles: true, cancelable: true})); return 'pasted ' + document.querySelectorAll('input[maxlength="1"]').length + ' fields'; }

Do NOT modify this expression except to replace CODE_HERE.

That last sentence is the part most generic generators do not have. A model left to its own devices treats the “dispatch a clipboard event into the parent” trick as a hint and improvises a different version every time. Pinning the expression as a literal string in the prompt is what stops the variance.

The seven-pattern regex cascade

Every email OTP is “some digits, somewhere in the body, surrounded by varying amounts of marketing copy”. There is no one regex that catches them all without false positives. The cascade tries seven patterns in order, most specific first, and stops at the first match:

// src/core/email.ts:101-109
const patterns = [
  /(?:code|Code|CODE)[:\s]+(\d{4,8})/,
  /(?:verification|Verification)[:\s]+(\d{4,8})/,
  /(?:OTP|otp)[:\s]+(\d{4,8})/,
  /(?:pin|PIN|Pin)[:\s]+(\d{4,8})/,
  /\b(\d{6})\b/, // 6-digit (most common)
  /\b(\d{4})\b/, // 4-digit
  /\b(\d{8})\b/, // 8-digit
];

The ordering matters. Without it, an email body like “Order #845710 has shipped. Your verification code is 391847.” would match the bare 6-digit pattern first and return 845710, the order number. With the cascade, verification[:\s]+(\d{4,8}) fires first and returns 391847, the actual code.

A generic LLM asked to extract a code with a regex on the spot will usually pick the bare 6-digit pattern, which is right about 80% of the time and silently wrong the other 20%. The cascade trades a few milliseconds of regex matching for that 20%, which is the difference between a test suite you trust and one you babysit.

What flows between the agent and the browser

The four tools become a single sequence at runtime. Two of them (create and poll) talk to a third-party HTTP service; two of them (drive and dispatch) go through Playwright. The diagram makes the dependency clear: nothing happens after step 1 until step 1 succeeds, and nothing happens after step 5 until step 5 returns a code.

The four-tool sequence in time

The thing the diagram does not show, but matters, is that none of these steps share state through the test file. They share state through the agent's tool-call history. That is why the same agent can run twenty OTP scenarios in a row against twenty different inboxes without the file growing or breaking. The static .spec.ts artifact is what lands in your repo at the end, after the dust settles.

What gets committed: the file vs. the orchestration

Once the agent has run the scenario successfully, it serializes the run as a standard Playwright file you can read, edit, and check in. The orchestration contract still applies at runtime, but the file itself is the same shape a senior engineer would produce by hand if they had two free hours and the seven-pattern cascade in their head:

signup-otp.spec.ts

// What ChatGPT, Cursor, and Codex generate when you say
// "write me a Playwright test for OTP signup"
import { test, expect } from "@playwright/test";

test("signup with OTP", async ({ page }) => {
  await page.goto("/signup");
  await page.getByLabel(/email/i).fill("test@example.com");
  await page.getByRole("button", { name: /sign up/i }).click();

  // TODO: replace with actual OTP code from email
  const code = "123456";

  // Looks reasonable. Fails on every modern OTP component
  // because .fill on each input fires keystrokes, not paste.
  for (let i = 0; i < 6; i++) {
    await page.locator('input[maxlength="1"]').nth(i).fill(code[i]);
  }

  await page.getByRole("button", { name: /verify/i }).click();
  await expect(page.getByText(/welcome/i)).toBeVisible();
});

-248% more lines that actually do work

The right column is real Playwright. No proprietary YAML, no cloud format, no opaque scenario IDs. The two helper functions live in your repo, your CI runs them on every PR, and if you decide to replace temp-mail.io with Mailosaur or Mailslurp tomorrow, you swap the inbox helper and the regex cascade keeps working unchanged.

The honest counterargument: where Mailosaur and Mailslurp earn their keep

A page that pretends Mailosaur and Mailslurp do not exist is dishonest. They are excellent receive-side services, and any team shipping OTP tests in production CI should have one of them or an equivalent. The disposable temp-mail.io endpoint is a great default because it is zero-config and free, but it is unauthenticated, rate-limited per IP, and can be blocked by sender deliverability rules. For real CI you want predictable, paid-for inboxes.

The architectural point of this page is not “use temp-mail.io forever.” It is “the agent needs an inbox-receive primitive at all, and it needs to call it before form fill.” Whether the implementation behind that primitive is temp-mail.io, Mailosaur, Mailslurp, or your own SES-backed catch-all is a downstream choice. Assrt's DisposableEmail class is intentionally a thin wrapper so it can be swapped without changing the agent.

What none of those services solve on their own is the orchestration itself. They give you a stable inbox; they do not give you the system prompt that says “FIRST call create_temp_email” or the canonical evaluate expression for split-digit dispatch. That layer is the one this page is about.

Try it on a real signup flow

The fastest way to see whether your existing AI test generator has the four-tool contract is to point it at a signup form that uses a split-digit OTP component. Most modern stacks (Clerk, Auth0 passwordless, Supabase) ship one by default. If the generated test stops at “TODO: replace with actual code”, you have your answer.

# Run against your app and read the four tool calls in the live stream
npx @m13v/assrt discover https://your-app.com

The repo is at github.com/assrt-ai/assrt-mcp. The four tool definitions are 18 lines total in src/core/agent.ts; the polling and regex cascade are in src/core/email.ts; the system prompt that pins the order lives a few hundred lines below the tool definitions in the same file. Worth reading before deciding whether your current setup is a few prompt edits away or a rewrite.

Walk through the four-tool contract on your test suite

If your team has an OTP signup that breaks every AI test generator you have tried, we will sit down with one of your flows and trace what the agent should be doing at each step. Thirty minutes, no slides.

Adjacent questions on the same shelf

Keep reading

Pattern

Playwright OTP and magic link testing

Why .fill() loses on Shadcn, Clerk, MUI, and Chakra OTP components, and the page.evaluate() ClipboardEvent dispatch that wins.

Read

Guide

How to test SMS OTP login with Playwright

Twilio Verify test credentials, rate limits, OTP autofill, expiry countdowns, and the resend-logic edge cases that catch most teams.

Read

Workflow

AI-assisted Playwright test scaffolding

What an AI scaffolder should and should not generate when it lands on a fresh app, and how to keep generated tests from rotting.

Read

Frequently asked questions

Why can ChatGPT or Cursor generate a Playwright login test but not one that actually completes OTP?

Because completing an OTP test requires runtime cooperation, not just code generation. A one-shot LLM that emits a .spec.ts file has no way to wait for an email, read it, extract the code, and dispatch it into split-digit inputs from inside a single static file. The generated test either hardcodes a placeholder code, uses page.fill() against each input (which fails on Shadcn InputOTP, Clerk, MUI PinInput, and Chakra), or stops at the screen where the code is required.

What four tools does an AI test generator actually need to handle OTP end to end?

1) Create a disposable inbox before the signup form is filled, so the same address goes into both the form and the polling loop. 2) Drive the form (navigate, type, click) through Playwright. 3) Poll the inbox with a regex cascade tolerant of 4 to 8 digit codes and varying email copy. 4) Dispatch the code with one ClipboardEvent on the parent of the split-digit inputs. Assrt exposes those four to the agent as create_temp_email, the standard browser tools, wait_for_verification_code, and evaluate. The names and definitions are at src/core/agent.ts lines 114 to 131 in the open-source @m13v/assrt-mcp package.

Why does the inbox have to be created BEFORE the form is filled, not after?

Because the signup form is what tells the auth provider where to send the code. If the agent fills the form with a real-but-stale address, then mints a disposable inbox after the fact, the code goes to the wrong place and the polling loop times out. The Assrt agent system prompt enforces this with the line 'FIRST call create_temp_email to get a disposable email' (agent.ts:230) and the runtime tool handler refuses to call wait_for_verification_code if no inbox exists yet, returning 'Error: Call create_temp_email first.'

What does the regex cascade for code extraction actually look like?

Seven patterns tried in order from most specific to most permissive: code/Code/CODE followed by digits, verification followed by digits, OTP followed by digits, PIN followed by digits, then bare 6-digit, 4-digit, and 8-digit matches. Most specific first, so 'Your code: 123456' beats 'Order #845710 is on the way'. Source: src/core/email.ts lines 101 to 109.

Why does the dispatch step use a single ClipboardEvent instead of typing into each input?

Modern split-digit OTP components (Shadcn InputOTP, Clerk, MUI PinInput, Chakra) listen for a paste event on the parent container, not for keystrokes on each child input. Calling page.fill() on input nth(0..5) fires keystroke events, not a paste, so the component does not redistribute the digits and the form stays incomplete. One DataTransfer-backed ClipboardEvent dispatched on the parent is the call the components were built around.

How is this different from Mailosaur or Mailslurp?

Mailosaur and Mailslurp are excellent receive-side services for production-grade test inboxes. They solve the polling problem but they do not solve the test-generation problem: an AI assistant that decides to use Mailosaur still has to know to mint the inbox before form fill, still has to extract the code with a regex pattern that matches your sender's copy, and still has to dispatch the code into the right component shape. Assrt is the orchestration layer; Mailosaur or Mailslurp can plug into the inbox slot if you need a stable, pay-for, deliverability-tier address.

What about TOTP authenticator apps (Google Authenticator, 1Password)? Same problem?

No, that one is solved by the otplib package. TOTP is deterministic: given a shared secret, otplib.authenticator.generate(secret) returns the current code with no inbox involved. Email and SMS OTP are the cases where you need the four-tool orchestration. The Playwright community guide that ranks for TOTP testing is correct on TOTP and silent on email OTP for that reason.

Does the AI test generator hardcode the disposable-email service?

Today, yes. The DisposableEmail class in src/core/email.ts wraps temp-mail.io's internal v3 API, which is unauthenticated and rate-limited per IP. The interface is intentionally narrow (create, getMessages, waitForVerificationCode) so swapping in a paid Mailosaur or Mailslurp client is a one-class change with no agent-side impact. The agent only sees the four MCP tool names; the implementation behind them is replaceable.

What does the generated Playwright test that goes into my repo actually look like?

A standard .spec.ts file with imports from @playwright/test, helper functions for newTempInbox() and waitForOtp(), and a test() block that does goto, fill, click, waitForOtp, page.evaluate to dispatch the ClipboardEvent, and assertions. No proprietary YAML, no cloud format, no opaque scenario IDs. The same code a senior engineer would write by hand if they had two free hours and the same regex cascade in their head.

The shape of the failure

Two takes on the same scenario

The four tools, in the order the agent calls them

The orchestration contract

create_temp_email

browser_navigate / browser_type / browser_click

wait_for_verification_code

evaluate (DataTransfer + ClipboardEvent)

What the system prompt actually pins down

The seven-pattern regex cascade

What flows between the agent and the browser

What gets committed: the file vs. the orchestration

The honest counterargument: where Mailosaur and Mailslurp earn their keep

Try it on a real signup flow

Walk through the four-tool contract on your test suite

Keep reading

Playwright OTP and magic link testing

How to test SMS OTP login with Playwright

AI-assisted Playwright test scaffolding

Frequently asked questions

Comments (••)

Comments ()