Specialized Testing Guide
How to Test Video Captions with Playwright: TextTrack API, VTT Parsing, and Multi-Language Switching
A scenario-by-scenario walkthrough of testing video captions and subtitles with Playwright. Covers the TextTrack API, WebVTT parsing, cue timing assertions, track mode toggling between showing, hidden, and disabled, multi-language switching, and the silent failures that break caption testing in production.
“The FCC reports that 98% of surveyed viewers use captions at least some of the time, and caption-related accessibility lawsuits have grown steadily since 2019.”
FCC / 3Play Media 2024 Study
Video Caption Loading Flow
1. Why Testing Video Captions Is Harder Than It Looks
Video captions seem simple on the surface. You add a <track> element to your <video>, point it at a WebVTT file, and the browser renders text overlays at the right moment. In practice, the system has five layers of complexity that make automated testing surprisingly difficult.
First, the TextTrack API is entirely asynchronous. When the browser encounters a <track> element, it does not load the VTT file immediately. It waits until the track mode is set to "showing" or "hidden", then fetches and parses the file in the background. Your test cannot simply check for cues right after page load because the cue list may still be empty.
Second, cue timing is continuous. Unlike DOM interactions where an element is either visible or not, captions depend on the video's currentTime property. A cue that starts at 5.200 seconds and ends at 8.100 seconds is only active during that window. Your test must seek the video to a specific timestamp and then query the activeCues property, accounting for the fact that seeking itself is asynchronous.
Third, track modes create a three-state system. A TextTrack can be "showing" (renders visually and fires events), "hidden" (fires events but renders nothing), or "disabled" (the cue list is not even loaded). Toggling between these modes has side effects that vary across browsers. Fourth, multi-language support means multiple tracks compete for the active slot, and the browser's built-in language preference logic can override your programmatic selection. Fifth, custom video players (Video.js, Plyr, JW Player) wrap the native elements in their own DOM and often manage captions through JavaScript rather than native track elements, meaning your selectors and API calls differ from one player to the next.
Caption Loading Pipeline
HTML Parsed
<video> and <track> elements created
Track Mode Set
Mode changes to showing or hidden
VTT Fetched
Browser requests .vtt file
Cues Parsed
WebVTT cues added to TextTrackCueList
Video Plays
currentTime advances
Active Cues
activeCues updates in real time
2. Setting Up a Reliable Caption Test Environment
Before writing any caption tests, you need a deterministic video fixture and a known VTT file. Using production video URLs introduces network latency and potential CDN failures. Instead, serve a short, silent test video from your local fixture directory and pair it with a handcrafted VTT file that has precisely timed cues.
Test HTML Fixture
Create a minimal HTML page that your Playwright tests can load directly. This isolates caption behavior from your application framework and eliminates variables like lazy loading, route transitions, and JavaScript hydration delays.
Playwright Configuration
Configure Playwright to serve your fixtures directory as a static site. The webServer option handles this cleanly, or you can use page.goto with a file:// URL. The static server approach is more reliable because some browsers restrict TextTrack loading from file:// origins.
3. Scenario: Verifying Captions Load Successfully
The most fundamental caption test confirms that the browser successfully fetches the VTT file, parses it without errors, and populates the TextTrack's cue list. This is your smoke test. A failed VTT fetch (wrong URL, CORS error, malformed file) results in an empty cue list with no visible error in the console, making it a silent failure that only surfaces when a real user tries to enable captions.
Verify Caption Track Loads and Cues Are Parsed
StraightforwardGoal
Load the video fixture, wait for the default caption track to reach the "loaded" readyState, and confirm the cue list contains the expected number of cues.
Preconditions
- Fixture HTML served at
/video-captions.html - English VTT file contains exactly 5 cues
- The English track has the
defaultattribute
Playwright Implementation
What to Assert Beyond the UI
- The track's
readyStateis 2 (LOADED), not 3 (ERROR) - The
cues.lengthmatches the VTT file exactly - No console errors related to VTT parsing or CORS
Caption Load Check: Playwright vs Assrt
import { test, expect } from '@playwright/test';
test('caption track loads', async ({ page }) => {
await page.goto('/video-captions.html');
const video = page.locator('#test-video');
await expect(video).toBeVisible();
const trackReady = await page.evaluate(() => {
return new Promise<boolean>((resolve) => {
const video = document.getElementById('test-video') as HTMLVideoElement;
const track = video.textTracks[0];
if (track.mode === 'disabled') track.mode = 'showing';
if (track.cues && track.cues.length > 0) {
resolve(true);
return;
}
const el = document.getElementById('track-en') as HTMLTrackElement;
el.addEventListener('load', () => resolve(true));
el.addEventListener('error', () => resolve(false));
setTimeout(() => resolve(false), 10_000);
});
});
expect(trackReady).toBe(true);
const cueCount = await page.evaluate(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
return v.textTracks[0].cues?.length ?? 0;
});
expect(cueCount).toBe(5);
});4. Scenario: Asserting Cue Timing and Content
Confirming that captions load is only step one. The real value of caption testing is verifying that the right text appears at the right time. A common production bug is a VTT file where cue timestamps are shifted by a few seconds (often caused by re-encoding or editing tools that recalculate offsets incorrectly). This test seeks the video to known timestamps and asserts which cues are active.
Cue Timing and Active Cue Content Verification
ModerateGoal
Seek the video to multiple known timestamps and verify that the activeCues property returns the expected text at each position.
Playwright Implementation
What to Assert Beyond the UI
- Each cue's
startTimeandendTimematch the VTT file within a 50ms tolerance - The
activeCueslist is empty when seeking to gaps between cues - Cue text content matches exactly, including punctuation
Cue Timing: Playwright vs Assrt
import { test, expect } from '@playwright/test';
test('correct captions at each timestamp', async ({ page }) => {
await page.goto('/video-captions.html');
await page.evaluate(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
v.textTracks[0].mode = 'showing';
});
await page.evaluate(() => {
return new Promise<void>((resolve) => {
const v = document.getElementById('test-video') as HTMLVideoElement;
if (v.textTracks[0].cues?.length) { resolve(); return; }
document.getElementById('track-en')!
.addEventListener('load', () => resolve());
});
});
const assertions = [
{ seekTo: 2.0, text: 'Welcome to the product demo.' },
{ seekTo: 6.0, text: 'Click the dashboard icon to begin.' },
];
for (const { seekTo, text } of assertions) {
const active = await page.evaluate((t) => {
return new Promise<string>((resolve) => {
const v = document.getElementById('test-video') as HTMLVideoElement;
v.currentTime = t;
v.addEventListener('seeked', () => {
const c = v.textTracks[0].activeCues;
resolve(c?.length ? (c[0] as VTTCue).text : '');
}, { once: true });
});
}, seekTo);
expect(active).toBe(text);
}
});5. Scenario: Toggling Track Modes (Showing, Hidden, Disabled)
The TextTrack API has three modes that control caption behavior. The "showing" mode renders captions visually on the video and fires cuechange events. The "hidden" mode fires events but renders nothing, which is useful for programmatic access to caption data without visual display. The "disabled" mode stops everything: no events, no cue loading, and the cue list may be null. Testing mode transitions is critical because many video players implement a "CC" toggle button that cycles through these modes, and a bug in the toggle logic can leave captions in the wrong state.
Track Mode Transitions and Side Effects
ModerateGoal
Programmatically toggle the track through all three modes and verify that captions render in showing mode, are invisible in hidden mode, and that the cue list behaves correctly in disabled mode.
6. Scenario: Multi-Language Caption Switching
Multi-language support is where caption testing gets genuinely tricky. The HTML spec says that only one caption or subtitle track should be showing at a time. When you set one track to "showing", the browser should automatically disable the previously active track. In practice, this behavior is inconsistent. Some browsers keep both tracks in "showing" mode temporarily, leading to overlapping captions. Custom video players often manage track switching in JavaScript and may not respect the native mutual exclusion rule at all.
Switch Between English and Spanish Captions
ComplexGoal
Start with English captions active, switch to Spanish, verify the Spanish cue text appears at the same timestamp, and confirm the English track is no longer showing.
Preconditions
- Both English and Spanish VTT files loaded
- English track has the default attribute
- Spanish VTT cues are time-aligned with English
Language Switching: Playwright vs Assrt
test('switch English to Spanish', async ({ page }) => {
await page.goto('/video-captions.html');
await page.evaluate(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
v.textTracks[0].mode = 'showing';
v.textTracks[1].mode = 'disabled';
});
await page.waitForFunction(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
return (v.textTracks[0].cues?.length ?? 0) > 0;
});
// ... seek, check English, switch, check Spanish ...
await page.evaluate(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
v.textTracks[0].mode = 'disabled';
v.textTracks[1].mode = 'showing';
});
await page.waitForFunction(() => {
const v = document.getElementById('test-video') as HTMLVideoElement;
return (v.textTracks[1].cues?.length ?? 0) > 0;
});
const spanishCue = await page.evaluate(() => {
return new Promise<string>((resolve) => {
const v = document.getElementById('test-video') as HTMLVideoElement;
v.currentTime = 2.0;
v.addEventListener('seeked', () => {
const c = v.textTracks[1].activeCues;
resolve(c?.length ? (c[0] as VTTCue).text : '');
}, { once: true });
});
});
expect(spanishCue).toBe('Bienvenido...');
});7. Scenario: VTT File Parsing and Validation
WebVTT files have a strict format. The file must start with the string WEBVTT, followed by an optional header, then cue blocks separated by blank lines. Common production errors include a missing WEBVTT header (caused by saving as SRT without converting), overlapping cue timestamps, UTF-8 BOM characters that break parsing, and malformed timestamp formats (using commas instead of periods for milliseconds, which is the SRT convention). This scenario validates the VTT file structure programmatically.
VTT File Structure Validation
ModerateGoal
Fetch the VTT file directly, parse its contents, and validate the structure: correct header, valid timestamps, no overlaps, and proper encoding.
VTT Validation Pipeline
Fetch VTT
HTTP GET the raw file
Check Header
Must start with WEBVTT
Parse Cues
Extract timestamps and text
Validate Order
No overlapping timestamps
Check Encoding
No BOM, valid UTF-8
Assert Content
No empty cue text
8. Scenario: Dynamically Loaded and Live Captions
Not all captions come from static VTT files declared in HTML. Modern applications often load captions dynamically via JavaScript, either from an API response, a WebSocket for live captions, or through the addTextTrack() method on the video element. Live streaming platforms generate captions in real time using speech-to-text services, and those cues are appended to the track as the stream progresses. Testing dynamic captions requires intercepting the caption source and verifying cues appear after the JavaScript loading logic completes.
Dynamically Added Caption Track via JavaScript
ComplexGoal
Simulate a JavaScript-driven caption loading flow where tracks are added programmatically after page load, and verify the dynamically added cues are accessible through the TextTrack API.
Route Interception for Caption Testing
Playwright's route interception lets you replace VTT file responses with custom content. This is invaluable for testing error handling (what happens when the VTT file returns a 404?), testing specific cue edge cases, or simulating slow network conditions that delay caption loading.
9. Common Pitfalls That Break Caption Test Suites
Caption testing has a unique set of failure modes that do not appear in standard UI testing. These issues are sourced from real bug reports on the Playwright GitHub repository, the WebVTT spec errata, and video.js community forums.
Querying Cues Before the Track Has Loaded
The most frequent mistake in caption tests is reading textTracks[0].cues immediately after setting the mode. The VTT file is fetched asynchronously, and the cue list is null or empty until parsing completes. Always wait for the load event on the <track> element or poll the cue count with waitForFunction.
Forgetting That Seeking Is Asynchronous
Setting video.currentTime = 5.0 does not instantly move the playhead. The browser must decode the video at that position, and the seeked event fires only after the seek completes. Checking activeCues immediately after assignment will return stale data from the previous position. Always wrap seek operations in a Promise that resolves on the seeked event.
CORS Blocking VTT Requests
If your video is served from a CDN and the VTT file is on a different origin, the browser silently blocks the track load unless the VTT response includes the correct Access-Control-Allow-Origin header and the <track> element has the crossorigin attribute. The track's readyState transitions to ERROR with no console message in some browsers. Test this explicitly by checking readyState after load.
SRT Files Served as VTT
SRT (SubRip) files use commas for millisecond separators (00:00:01,000) while VTT uses periods (00:00:01.000). If someone renames a .srt file to .vtt without converting the timestamp format, the browser silently fails to parse any cues. The track loads, readyState reaches LOADED, but the cue list is empty. This is one of the most common caption bugs in production and the test catches it by asserting cue count after load.
Browser Differences in Track Mode Behavior
Chromium, Firefox, and WebKit handle the transition from disabled to showing differently. Chromium re-fetches the VTT file when re-enabling a disabled track. Firefox may keep the parsed cues in memory. WebKit sometimes requires a small delay after mode change before the cue list populates. Always use waitForFunction rather than fixed timeouts when waiting for cues.
Caption Testing Pre-Flight Checklist
- VTT file starts with WEBVTT header (no BOM)
- Timestamps use periods, not commas, for milliseconds
- Wait for track load event before querying cues
- Wait for seeked event before checking activeCues
- CORS headers set for cross-origin VTT files
- Test fixture uses a short, deterministic video file
- Only one caption track in showing mode at a time
- Check readyState for ERROR (value 3) after load attempt
Common Anti-Patterns to Avoid
- Reading cues immediately after setting track mode without waiting
- Checking activeCues without waiting for the seeked event
- Using fixed sleeps instead of waitForFunction for cue loading
- Serving SRT files renamed to .vtt without format conversion
- Assuming all browsers handle disabled-to-showing transition identically
- Ignoring the crossorigin attribute on track elements
10. Writing Caption Tests in Plain English with Assrt
Every scenario above requires deep knowledge of the TextTrack API, asynchronous seeking patterns, and browser-specific cue loading behavior. The cue timing test alone is over 40 lines of Playwright TypeScript with nested Promises and event listeners. Assrt lets you describe the test intent in plain English, generates the equivalent Playwright code, and regenerates the selectors and API calls automatically when the underlying video player changes.
The multi-language switching scenario from Section 6 demonstrates this well. In raw Playwright, you need to manually manage track indices, wait for cue loading with polling functions, wrap seek operations in Promise-based event handlers, and verify mode transitions across multiple tracks. In Assrt, you describe what you want to verify and the framework handles the async coordination.
Assrt compiles each scenario into the same Playwright TypeScript you saw in the preceding sections, committed to your repo as real tests you can read, run, and modify. When a video player library updates its DOM structure, when a new browser version changes track mode behavior, or when your VTT file format evolves, Assrt detects the failure, analyzes the new API surface, and opens a pull request with updated test code. Your scenario files remain unchanged.
Start with the caption load smoke test. Once it passes in CI, add the cue timing scenario, then track mode toggling, then multi-language switching, then dynamic caption loading. Within a single afternoon you can have comprehensive video caption coverage that catches the silent VTT failures, CORS blocks, and timing bugs that most applications never detect until users report missing captions.
Related Guides
How to Test Google Maps Embed
A practical guide to testing Google Maps embeds with Playwright. Covers canvas-rendered...
How to Test Google Places Autocomplete
A practical, scenario-by-scenario guide to testing Google Places Autocomplete with...
How to Test postMessage
A practical guide to testing iframe postMessage APIs with Playwright. Covers cross-origin...
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.