Feature Flag Testing Guide
How to Test PostHog Feature Flags with Playwright: Complete 2026 Guide
A scenario-by-scenario walkthrough for testing PostHog feature flags end to end. Flag payload evaluation, bootstrapping flags for server-side rendering, local overrides, multivariate flags, rollout percentage verification, and the pitfalls that silently break feature flag test suites.
“PostHog is used by over 100,000 teams, and feature flags are one of the most-used features on the platform, evaluated billions of times per month across production applications.”
PostHog 2025 company blog
PostHog Feature Flag Evaluation Flow
1. Why Testing PostHog Feature Flags Is Harder Than It Looks
PostHog feature flags look simple on the surface. You create a flag in the dashboard, wrap a UI element in a conditional check, and toggle it on or off. But the moment your test suite tries to assert that a flag is evaluated correctly in a real browser, five structural challenges emerge that make reliable testing genuinely difficult.
First, feature flags are evaluated asynchronously. The PostHog JavaScript SDK calls the /decide endpoint after initialization, which means there is always a window where flags are undefined. If your test checks for flag-gated UI immediately after page load, you will get intermittent failures because the SDK has not yet received the flag values from PostHog. Second, multivariate flags return string values (not just booleans), and the variant a user sees depends on a hash of their distinct ID. Your tests need deterministic control over which variant is active, which requires either overriding the distinct ID to match a pre-computed hash bucket or intercepting the API response entirely.
Third, flag payloads add a second layer of dynamic data. A multivariate flag might return "variant-b" as the flag value, but also attach a JSON payload like {"cta_text": "Start free trial", "color": "#22c55e"} that drives the actual rendering. Your test must verify both the variant and the payload separately. Fourth, server-side rendering with bootstrapped flags introduces a timing issue: the server evaluates flags via the PostHog Node SDK before HTML is sent to the browser, and then the client SDK re-evaluates on load. If the two evaluations disagree (due to timing or eventual consistency), the UI flickers. Your test must catch that flicker and fail if it happens. Fifth, rollout percentages are probabilistic. A flag set to 30% rollout does not guarantee that exactly 30 out of 100 test users see it, making assertion logic tricky.
PostHog Feature Flag Evaluation Chain
posthog.init()
SDK initializes with API key
POST /decide
Send distinct_id + groups
Flag Evaluation
Hash distinct_id, check rules
Return Flags
Boolean, multivariate, payloads
SDK Callback
onFeatureFlags() fires
Render UI
Conditional rendering based on flag
2. Setting Up a Reliable Test Environment
Before writing a single test, you need a PostHog project configured for testability. Create a dedicated project in PostHog (free tier works fine for testing) with its own API key. Never run tests against your production project because flag evaluations generate events that pollute your analytics data and can trigger rate limits under heavy parallel test execution.
PostHog Feature Flag Helper for Tests
Use the PostHog API to programmatically create, update, and clean up feature flags in your test suite. The personal API key (not the project API key) is required for flag management. This helper ensures each test run starts with flags in a known state.
Playwright Configuration for Feature Flag Tests
Feature flag tests need extra time for the SDK to call the /decide endpoint and render conditional UI. Set a comfortable action timeout and configure route interception for deterministic flag control.
3. Scenario: Boolean Flag Controls UI Visibility
The simplest feature flag test verifies that a boolean flag toggles a UI element on or off. This is the smoke test every flag-driven feature needs. When the flag is enabled, the new component renders. When it is disabled, the old UI remains. The catch is timing: you cannot check for the element immediately after navigation because the PostHog SDK evaluates flags asynchronously via the /decide endpoint.
Boolean Flag Controls UI Visibility
StraightforwardGoal
Intercept the PostHog /decide response to force a boolean flag to true, then verify the flag-gated component renders. Repeat with the flag forced to false and confirm the component is absent.
Preconditions
- App running at
APP_BASE_URLwith PostHog JS SDK initialized - A page that conditionally renders based on flag
new-pricing-banner
Playwright Implementation
What to Assert Beyond the UI
Boolean Flag Verification Checklist
- The flag-gated element renders when the flag is true
- The flag-gated element is absent when the flag is false
- PostHog captures a $feature_flag_called event with the correct value
- No layout shift or flicker during flag evaluation
- The page does not flash the wrong variant before settling
Boolean Flag Test: Playwright vs Assrt
import { test, expect } from '@playwright/test';
test('shows banner when flag is enabled', async ({ page }) => {
await page.route('**/decide/**', async (route) => {
const response = await route.fetch();
const body = await response.json();
body.featureFlags = {
...body.featureFlags,
'new-pricing-banner': true,
};
await route.fulfill({
response,
body: JSON.stringify(body),
});
});
await page.goto('/pricing');
await expect(
page.getByTestId('pricing-banner')
).toBeVisible({ timeout: 10_000 });
await expect(
page.getByText('Try our new pricing plan')
).toBeVisible();
});4. Scenario: Multivariate Flag Renders Correct Variant
Multivariate flags return a string value instead of a boolean. In PostHog, you define variants like "control", "variant-a", and "variant-b", each with a rollout percentage. The tricky part is that the variant a user receives depends on a hash of their distinct ID combined with the flag key. In production this creates a consistent experience, but in tests it means you need deterministic control over which variant is active, or you will get different results every time a test user ID changes.
Multivariate Flag Renders Correct Variant
ModerateGoal
Force each multivariate variant via API interception and confirm the UI renders the correct component for each variant string.
Preconditions
- Feature flag
checkout-flow-experimentwith variants:control,single-page,wizard - Checkout page renders different layouts per variant
Playwright Implementation
Multivariate Flag: Playwright vs Assrt
import { test, expect } from '@playwright/test';
test('checkout-flow-experiment: single-page variant', async ({ page }) => {
await page.route('**/decide/**', async (route) => {
const response = await route.fetch();
const body = await response.json();
body.featureFlags = {
...body.featureFlags,
'checkout-flow-experiment': 'single-page',
};
await route.fulfill({
response,
body: JSON.stringify(body),
});
});
await page.goto('/checkout');
await expect(
page.getByRole('heading', { name: 'Quick Checkout' })
).toBeVisible({ timeout: 10_000 });
const steps = page.getByTestId('checkout-step');
await expect(steps).toHaveCount(1);
});5. Scenario: Flag Payload Drives Dynamic Configuration
PostHog feature flag payloads let you attach arbitrary JSON data to each flag variant. Instead of hardcoding configuration per variant in your frontend code, you read the payload from PostHog and use it to drive rendering. This is powerful for A/B testing copy, colors, pricing values, and other parameters without deploying new code. But it also means your tests need to verify both the flag value and the payload content separately, because a correct variant with a malformed payload will still break the UI.
Flag Payload Drives Dynamic Configuration
ComplexGoal
Verify that a flag payload containing CTA text, button color, and discount percentage is correctly parsed and rendered in the UI.
Preconditions
- Feature flag
promo-banner-configwith a JSON payload per variant - The promo banner component reads its content from
posthog.getFeatureFlagPayload()
Playwright Implementation
What to Assert Beyond the UI
Payload-driven flags require deeper assertions. The payload is JSON, so your test must verify that the parsing works correctly, that missing fields are handled gracefully, and that type coercion does not silently break the rendering logic.
Payload Flag Verification Checklist
- CTA text from payload renders in the correct element
- Dynamic color from payload applies to button styles
- Numeric payload values (discount_percent) display correctly
- Banner is absent when payload is missing or flag is false
- Malformed JSON in payload does not crash the page
6. Scenario: Bootstrap Flags for SSR Without Flicker
When you use server-side rendering with PostHog, the server evaluates feature flags via the PostHog Node SDK and embeds the results into the HTML. On the client side, you pass these pre-evaluated flags as the bootstrap option in posthog.init(). This eliminates the flash of default content that happens while the client SDK calls /decide. But bootstrapping introduces a subtle failure mode: if the server and client evaluate the flag differently (because the flag was updated between the server render and the client hydration), the UI will switch from one variant to another after hydration. Your test must detect this flicker.
Bootstrap Flags for SSR Without Flicker
ComplexGoal
Verify that a bootstrapped flag renders the correct variant in the initial HTML (no flash of wrong content), and that if the client SDK receives a different value, the transition is either suppressed or handled gracefully.
Playwright Implementation
SSR Bootstrap Flag Flow
Server Request
User requests page
Node SDK
Evaluate flags server-side
Render HTML
Embed flag values in bootstrap
Client Hydration
posthog.init({ bootstrap })
Client /decide
Revalidate flags from PostHog
Compare
Bootstrap vs fresh evaluation
7. Scenario: Verifying Rollout Percentage Distribution
Rollout percentages are one of the most misunderstood aspects of feature flags. A flag set to 30% rollout does not mean "30% of page loads see it." It means 30% of distinct user IDs hash into the enabled bucket. PostHog uses a deterministic hashing algorithm (based on the distinct ID and flag key) so the same user always gets the same result. This is great for user consistency but makes statistical verification in tests more nuanced. You cannot just run 100 requests and expect exactly 30 to be enabled.
Verifying Rollout Percentage Distribution
ComplexGoal
Generate a large number of distinct user IDs, evaluate the flag for each, and verify the distribution falls within an acceptable margin of the configured rollout percentage.
Playwright Implementation
8. Scenario: Local Overrides for Deterministic CI
The biggest problem with testing feature flags in CI is nondeterminism. If your tests rely on the live PostHog API, a flag change in the dashboard during a test run can cause unexpected failures. The solution is local flag overrides: intercept every PostHog API call in your test and return predetermined values. This removes the network dependency entirely and makes your tests deterministic, fast, and runnable offline.
PostHog supports two override mechanisms. First, the JavaScript SDK accepts a bootstrap property at initialization time that provides flag values synchronously. Second, you can call posthog.featureFlags.override() after initialization to force specific flag values for the current session. In Playwright tests, the cleanest approach is route interception because it works regardless of how your app initializes the SDK.
Local Overrides for Deterministic CI
ModerateGoal
Create a reusable Playwright fixture that intercepts all PostHog API calls and returns a fully controlled set of feature flags, making every test deterministic regardless of the live PostHog state.
Playwright Implementation
Using the Fixture in Tests
Local Overrides: Playwright vs Assrt
import { test, expect } from '../fixtures/posthog-override';
test('deterministic flag test', async ({ page, posthog }) => {
await posthog.withFlags({
'new-pricing-banner': true,
'checkout-flow-experiment': 'wizard',
'promo-banner-config': 'active',
}, {
'promo-banner-config': JSON.stringify({
cta_text: 'Limited time offer',
discount_percent: 25,
}),
});
await page.goto('/');
await expect(page.getByTestId('pricing-banner')).toBeVisible();
await expect(page.getByText('Limited time offer')).toBeVisible();
});9. Common Pitfalls That Break Feature Flag Test Suites
Feature flag testing has a unique set of failure modes that differ from standard UI testing. These pitfalls come from real PostHog GitHub issues, community discussions, and production incident reports. Understanding them before you build your test suite will save you hours of debugging flaky tests.
Pitfalls to Avoid
- Asserting flag-gated UI immediately after page.goto() without waiting for the /decide response. The PostHog SDK evaluates flags asynchronously, so checking too early gives you the default (unflagged) state.
- Using the same distinct_id across parallel test workers. PostHog's deterministic hashing means the same user always gets the same variant, but parallel workers sharing an ID can interfere with each other's flag evaluations.
- Forgetting to intercept featureFlagPayloads alongside featureFlags. If your route interception only overrides featureFlags, any code that reads payloads via getFeatureFlagPayload() will get null or stale data.
- Testing rollout percentages with too small a sample. 100 evaluations with a 30% rollout can produce results anywhere from 20% to 40%. Use at least 500 samples and a +/- 5% tolerance.
- Not blocking PostHog event tracking in tests. The /e/ and /batch/ endpoints fire on every page load and flag evaluation. Without blocking them, your test PostHog project fills with garbage events and API calls slow down your tests.
- Relying on posthog.onFeatureFlags() callback timing. This callback fires after the /decide response, but React re-renders are asynchronous. Use Playwright's expect().toBeVisible() with a timeout instead of waitForTimeout after the callback.
- Updating flags via the PostHog dashboard during a CI run. Flag changes propagate to the /decide endpoint within seconds, but CDN caching can delay the change by up to 60 seconds, causing intermittent test failures that only happen in CI.
10. Writing These Scenarios in Plain English with Assrt
Every Playwright test in this guide follows the same pattern: intercept the PostHog /decide endpoint, override the flag values, navigate to the page, and assert the UI matches the expected variant. That pattern is powerful but verbose. Each scenario requires 20 to 40 lines of TypeScript with route interception boilerplate, JSON body manipulation, and explicit timeout management. Assrt lets you express the same intent in plain English and compiles it to the same Playwright code.
Assrt compiles each scenario block into the same Playwright TypeScript shown in the preceding sections. The route interception boilerplate, JSON body manipulation, and timeout handling are all generated automatically. When PostHog changes their /decide API response format or your app restructures its flag-gated components, Assrt detects the failure, analyzes the updated DOM and API responses, and opens a pull request with the updated test code. Your scenario files stay untouched.
Start with the boolean flag scenario. Once it passes in CI, add the multivariate test, then the payload verification, then the SSR bootstrap flicker check, then the local overrides fixture. Within an afternoon you will have complete feature flag coverage that most teams never achieve because the interception boilerplate is too tedious to maintain by hand.
Related Guides
How to Test GA4 Events
A practical guide to testing Google Analytics 4 events with Playwright. Covers dataLayer...
How to Test Mixpanel Events
A practical, scenario-by-scenario guide to testing Mixpanel events with Playwright....
How to Test Segment Track Events
Step-by-step guide to testing Segment analytics.track(), identify, and page calls with...
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.