AI Chat Testing Guide
How to Test AI Chat Streaming UI with Playwright: Complete 2026 Guide
A scenario-by-scenario walkthrough of testing AI chat streaming interfaces with Playwright. Server-Sent Events, token-by-token rendering, AbortController cancellation, typing indicators, markdown rendering mid-stream, auto-scroll behavior, retry on error, and message history persistence.
“ChatGPT reached 100 million weekly active users in early 2024, making AI chat interfaces one of the fastest-adopted UI patterns in software history. By 2026, nearly every SaaS product ships some form of streaming AI chat.”
OpenAI
AI Chat Streaming End-to-End Flow
1. Why Testing AI Chat Streaming Is Harder Than It Looks
An AI chat streaming interface looks simple on the surface: the user types a prompt, the assistant responds token by token, and the conversation grows. Under the hood, the complexity is substantial. The frontend opens a persistent connection (Server-Sent Events or a fetch with ReadableStream), receives dozens or hundreds of small JSON chunks over several seconds, and must parse, decode, and render each chunk into the DOM while simultaneously handling markdown formatting, code block syntax highlighting, scroll position management, and user interaction events like cancellation.
Traditional E2E tests assume a request/response cycle: click a button, wait for a network response, assert on the final state. Streaming breaks that model entirely. The response arrives incrementally over seconds, the DOM mutates continuously, and the “final” state only exists once the stream terminates. If your test waits for the complete response before asserting, you miss the entire streaming experience. If your test asserts too early, the content is incomplete and the assertion fails.
There are six structural reasons streaming chat UIs are hard to test. First, the transport layer varies: some apps use Server-Sent Events (SSE), others use fetch with ReadableStream, and some use WebSocket connections. Your mock strategy must match the transport. Second, token arrival timing is non-deterministic in production and must be simulated with controlled delays in tests. Third, the UI must handle partial markdown: a code fence might open in one chunk and close three seconds later, and the renderer must not break during that interval. Fourth, auto-scroll behavior depends on the user’s scroll position, creating a stateful interaction between user input and stream output. Fifth, AbortController cancellation must cleanly terminate both the network connection and the rendering pipeline. Sixth, error recovery (retry buttons, rate limit banners) must work correctly when the stream fails mid-response.
Stream Lifecycle in a Chat UI
User Prompt
Text input + send
API Request
POST with stream flag
SSE Connection
EventSource opens
Token Chunks
data: {content} events
DOM Render
Append + markdown parse
Stream End
[DONE] signal received
2. Setting Up Your Streaming Test Environment
The key to reliable streaming tests is full control over the mock server response. You never want your Playwright tests hitting a real LLM API: the latency is unpredictable, the cost adds up fast, and rate limits will break your CI. Instead, you intercept the streaming endpoint with page.route() and return a controlled SSE response with deterministic token timing.
Your playwright.config.ts needs a web server configuration that starts your chat application in development mode. The critical setting is webServer.reuseExistingServer set to true for local development, and use.actionTimeout set generously (15 seconds minimum) because streaming responses are intentionally slow.
The helper function below is the foundation of every streaming test in this guide. It creates a mock SSE response that sends tokens at a configurable interval, giving you complete control over timing. You will reuse this helper across all scenarios.
Abort/Cancel Flow
Stream Active
Tokens arriving
User Clicks Stop
Cancel button
AbortController
signal.abort()
Connection Closed
ReadableStream cancelled
UI Settled
Partial message preserved
Mocking an SSE Stream with page.route()
Moderate3. Mocking an SSE Stream with page.route()
The foundation of every AI chat streaming test is intercepting the chat API endpoint and returning a controlled Server-Sent Events response. Playwright’s page.route() method intercepts any matching network request and lets you fulfill it with a custom response. For streaming, you return a ReadableStream body that emits SSE-formatted chunks over time.
The OpenAI-compatible SSE format sends each chunk as a JSON object prefixed with data: and terminated with two newlines. The final signal is data: [DONE], which tells the client the stream is complete. Your mock must replicate this format exactly, because most chat UI libraries (Vercel AI SDK, LangChain, custom implementations) parse it strictly and will silently drop malformed chunks.
Notice that mockSSEStreamtakes an array of tokens, not a single string. This lets you control exactly where the “word boundaries” fall in the stream. In production, LLM APIs send tokens that rarely align with word boundaries (you might receive "Hel", then "lo, how"), so testing with irregular splits helps catch rendering bugs that only appear with real API responses. For the initial test, clean word-aligned tokens are fine; later scenarios use realistic splits.
Verifying Token-by-Token Rendering
Complex4. Verifying Token-by-Token Rendering
The previous scenario waited for the final message. This scenario verifies that tokens actually render incrementally, which is the whole point of a streaming UI. A broken streaming implementation might buffer the entire response and render it at once, which technically passes a “final text matches” assertion but delivers a poor user experience. To catch this, you need to observe the DOM during the stream, not just after it.
The approach uses slower token timing (200ms per token) and polls the message element’s text content at multiple checkpoints during the stream. If the UI renders tokens incrementally, you will see partial text at each checkpoint. If it buffers, all checkpoints will show either empty or the full response.
The key assertion is the mid-stream check: after 450ms with 200ms per token, approximately two tokens should have rendered. The test confirms the early text includes the first two tokens but not the final token. This proves the UI streams tokens incrementally rather than buffering. In production, you would also verify that the cursor or caret animation is active during streaming, which we cover in the typing indicator scenario.
Incremental Rendering Test
test('tokens render incrementally', async ({ page }) => {
const tokens = ['The ', 'answer ', 'to ', 'your ',
'question ', 'is ', '42.'];
await page.route('**/api/chat', (route) =>
mockSSEStream(route, { tokens, delayMs: 200 })
);
await page.goto('/');
await page.getByPlaceholder('Type a message')
.fill('What is the answer?');
await page.getByRole('button', { name: 'Send' }).click();
const msg = page.locator(
'[data-testid="assistant-message"]').last();
await page.waitForTimeout(450);
const earlyText = await msg.textContent();
expect(earlyText).toContain('The answer');
expect(earlyText).not.toContain('42.');
await expect(msg).toContainText(
'The answer to your question is 42.');
});Cancel Mid-Stream with AbortController
Complex5. Cancel Mid-Stream with AbortController
Every production AI chat interface needs a “Stop generating” button. When the user clicks it, the frontend calls AbortController.abort(), which cancels the fetch request, closes the SSE connection, and stops rendering new tokens. Testing this correctly requires verifying three things: the network connection actually closes, the partial message is preserved in the DOM, and the UI returns to an idle state where the user can send a new prompt.
The challenge is timing. You need the stream to be actively sending tokens when the cancel button is clicked. If the stream finishes before your test clicks Stop, you are testing a no-op. The solution is to use a longer token list with generous delays, ensuring the stream is still active when the click happens.
A subtle bug to watch for: some implementations clear the partial message on abort, showing a blank assistant bubble. Others freeze the typing indicator animation but never remove it, leaving a permanently “thinking” UI. Both are real bugs found in production chat applications. The test above catches the first by asserting the partial text is preserved, and the typing indicator scenario (section 6) catches the second.
Typing Indicator and Loading States
Straightforward6. Typing Indicator and Loading States
The typing indicator (often a pulsing dots animation or a blinking cursor) is the user’s primary feedback that the AI is processing their request. It should appear immediately after the user sends a prompt (before any tokens arrive), remain visible during streaming, and disappear once the stream completes. A missing or stuck typing indicator is one of the most commonly reported UX bugs in chat applications.
The 300ms delay per token is intentional. With faster timing, the stream might complete before Playwright can assert on the typing indicator, making the test flaky. Slower tokens give the test a reliable observation window. In your CI configuration, you can reduce the delay for speed, but during development, keep it generous enough for visual debugging with --headed mode.
Markdown Rendering and Auto-Scroll
Complex7. Markdown Rendering and Auto-Scroll During Streaming
AI assistants frequently respond with markdown: headings, bullet lists, code blocks with syntax highlighting, bold text, and inline code. The renderer must handle partial markdown gracefully. A code fence that opens with ```typescript in one chunk should not crash the parser before the closing fence arrives seconds later. Similarly, the chat container should auto-scroll to keep the latest content visible, but it must stop auto-scrolling if the user manually scrolls up to read earlier messages.
The auto-scroll tests verify a subtle but important UX contract. When the user is passively watching the stream, the container scrolls down automatically. When the user scrolls up to review previous content, the auto-scroll must pause so it does not yank the viewport away from what the user is reading. Many chat implementations get this wrong by either never auto-scrolling (bad for long responses) or always auto-scrolling (bad for review).
Retry on Error and Rate Limiting
Moderate8. Retry on Error and Rate Limiting UI Feedback
Streaming connections fail. The LLM provider returns a 429 (rate limited), the connection drops mid-stream, or the server returns a 500 error after sending partial tokens. A robust chat UI handles all three cases gracefully: showing an error message, offering a retry button, and preserving any partial content that was already rendered. Testing these error paths requires mocking failures at different points in the stream lifecycle.
Error Recovery Test
test('retry after error resends prompt', async ({ page }) => {
let callCount = 0;
await page.route('**/api/chat', async (route) => {
callCount++;
if (callCount === 1) {
await route.fulfill({
status: 500,
body: JSON.stringify({
error: { message: 'Internal error' }
}),
});
} else {
await mockSSEStream(route, {
tokens: ['Here ', 'is ', 'your ', 'answer.'],
delayMs: 30,
});
}
});
await page.goto('/');
await page.getByPlaceholder('Type a message')
.fill('Help me');
await page.getByRole('button', { name: 'Send' }).click();
await expect(page.getByRole('button',
{ name: /retry/i })).toBeVisible();
await page.getByRole('button', { name: /retry/i }).click();
const msg = page.locator(
'[data-testid="assistant-message"]').last();
await expect(msg).toContainText('Here is your answer.');
});9. Common Pitfalls That Break Streaming Chat Tests
After building streaming chat test suites for dozens of applications, these are the recurring failures that waste the most debugging time. Every pitfall below comes from real issues observed in production codebases and CI pipelines.
Pitfalls to Avoid
- Using waitForResponse() on streaming endpoints. It resolves on the response headers, not the body, so your assertion runs before any tokens arrive.
- Asserting on textContent() immediately after click. The stream has not started yet. Use expect().toContainText() which auto-retries with Playwright's built-in polling.
- Setting actionTimeout too low. Streaming responses can take 5 to 15 seconds. A default 5-second timeout will cause intermittent failures.
- Forgetting to mock the [DONE] signal. Without it, the UI never transitions from streaming to idle state, and your typing indicator stays visible forever.
- Testing only with word-aligned tokens. Real LLM APIs split tokens at arbitrary byte boundaries. Use irregular chunks like ['Hel', 'lo, h', 'ow are'] to catch parser bugs.
- Not testing the empty response case. If the LLM returns zero tokens before [DONE], the UI should show a graceful fallback, not a blank bubble.
- Ignoring race conditions on rapid send. If the user sends two prompts in quick succession, the second request should either queue or cancel the first stream cleanly.
Message History Persistence
One frequently overlooked test is message persistence across page reloads. Most chat applications store conversation history in localStorage, IndexedDB, or a server-side database. If the storage layer has a bug, a page refresh after a long streaming conversation can lose all messages. Test this by sending a streamed message, reloading the page, and asserting that the conversation history is intact.
Pre-Flight Checklist for Streaming Tests
- Mock SSE helper returns proper Content-Type: text/event-stream header
- Token arrays include irregular splits to test real-world chunking
- actionTimeout in config is at least 15 seconds
- Every mock stream sends the [DONE] termination signal
- Error scenarios cover 429, 500, and mid-stream connection drop
- Cancel/abort tests verify partial content preservation
- Auto-scroll tests verify both follow and pause behaviors
10. Writing These Scenarios in Plain English with Assrt
Every Playwright test in this guide follows the same pattern: set up a mock SSE stream, navigate to the chat UI, type a prompt, click send, and assert on the rendered output. The pattern is clear, but the boilerplate adds up. Each test requires the route interception, the SSE helper, the token array, the delay configuration, and the locator queries. Across 12 tests, that is hundreds of lines of TypeScript that must be maintained as your chat UI evolves.
Assrt lets you express the same scenarios in plain English. It compiles each scenario into the exact Playwright TypeScript shown above, committed to your repository as real tests you can inspect, modify, and run. When your chat UI changes its markup (renaming data-testid="assistant-message" to data-testid="ai-response", for example), Assrt detects the failure, analyzes the new DOM, and opens a pull request with updated locators. Your scenario files remain untouched.
Each scenario block compiles to the same Playwright test patterns shown throughout this guide. The mock configuration section tells Assrt how to intercept the streaming endpoint, so you do not need to write the SSE helper manually. The expect blocks map to Playwright’s expect() assertions with automatic retry logic.
Start with the basic streaming scenario. Once it passes in your CI, add the incremental rendering test, then the cancel flow, then error recovery, then markdown rendering, and finally the persistence check. In a single afternoon you can have complete AI chat streaming coverage that most production applications never achieve by hand.
Related Guides
How to Test Ably Realtime
A practical guide to testing Ably Realtime messaging with Playwright. Covers token auth...
How to Test Collaborative Cursors
A practical guide to testing collaborative cursors with Playwright. Covers Liveblocks and...
How to Test Intercom Messenger
A practical guide to testing Intercom Messenger with Playwright. Covers iframe traversal,...
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.