Healthcare web app testing

AI test generator for medical software: the self-hosted setup that does not become a Business Associate

The hardest question with an AI test tool for a healthcare web app is not whether it can drive a browser. It is whether the tool itself, once it sees patient data on screen, becomes another HIPAA vendor you have to paper. Here is the architecture that avoids that.

Matthew Diakonov, Written with AI

Published May 2, 202611 min read

Direct answer (verified 2026-05-02)

Run a self-hosted, open-source AI test generator that writes plain Playwright code into your own repo. Point its central API at your own endpoint or skip the cloud sync. Run the browser in an in-memory isolated profile so cookies and PHI never persist. Test against staging seeded with synthetic data, not production. Done this way, the test generator never accesses PHI on behalf of your covered entity, and you do not sign a Business Associate Agreement with a test vendor. The HHS reference for the BA definition is at hhs.gov/hipaa/for-professionals/faq/business-associates.

The compliance question hiding in your test stack

When a covered entity buys a SaaS test automation product, the contract conversation usually starts with the security questionnaire and ends with a Business Associate Agreement. That paperwork exists because the test tool, by design, ends up with access to whatever the test sees: a patient record on screen, a chart in the corner of a screenshot, a payload in a network log. Under the HHS definition of a business associate, anyone who performs a function on behalf of a covered entity that involves access to protected health information falls inside the rule. A cloud test runner that records video of your live charting workflow is performing exactly such a function.

The standard answer is to sign the BAA, complete the vendor risk review, and proceed. Some tools (like testRigor) advertise that they can act as a Business Associate. That is a legitimate path, and for some teams the simpler one. It is not the only one.

The other path is to choose an architecture where the test tool never touches PHI in the first place. The way to do that is not to add another vendor to your contract pile. It is to remove the vendor from the data path entirely.

Two architectures, one decision

The trade is not capability against capability. Both architectures run AI-driven browser tests. The trade is who gets access to what.

Your healthcare app, browser session, screenshots, video recordings, and DOM snapshots are streamed to the vendor's cloud where an LLM proposes scenarios and an orchestrator runs them. The scenarios live in the vendor dashboard. Audit trail lives in the vendor dashboard. Test artifacts live in the vendor's S3.

Vendor cloud gets a real or near-real view of your app
PHI in screenshots / video almost certain if run against prod
Business Associate Agreement required, plus annual review
Generated logic lives behind login, not in your git history
Removing the vendor means rewriting tests from scratch

Where the opt-out lives, in source

The point of an open-source generator is that you can verify these claims yourself. Two specific lines in the Assrt source matter for this guide. They are short.

The two opt-outs that keep PHI off the wire

// Line 14
const CENTRAL_API_URL =
  process.env.ASSRT_API_URL ||
  "https://app.assrt.ai";

// Set ASSRT_API_URL to your own URL, or to an unreachable
// host, and scenario sync goes there instead. Local cache
// at ~/.assrt/scenarios is still authoritative.

0% fewer lines

The combination is what matters. The first line means the centralized scenario sync is opt-out, not opt-in. The second means a tester can run an AI agent against an authenticated healthcare workflow and leave nothing on disk between runs. Together they cover the two ways a generic test tool would normally end up with a copy of PHI: through the cloud sync, or through a stale browser profile.

What a healthcare-friendly test generation flow looks like

The flow below is what we recommend for a covered entity testing a patient-facing or charting web app. Nothing in it is novel. It is mostly about putting boundaries in the right places.

Test generation against a healthcare staging environment

The arrows that matter are the absent ones. There is no arrow from the engineer to a vendor dashboard. No arrow from the staging app to a third-party screenshot store. The LLM provider gets text from the accessibility tree of a synthetic environment, never from a real patient record. If your model provider is itself under a BAA (most enterprise Anthropic and OpenAI plans support this), even running against environments containing PHI is on a documented footing. But the recommendation is still to run against staging.

The setup, end to end

The whole loop is six pieces of configuration. None are exotic.

Self-hosted AI test generation for a healthcare web app

Stand up a staging environment of your healthcare app, isolated from production data stores
Seed staging from Synthea or a faker-style generator so the database holds synthetic patients with FHIR-shaped records
Add Assrt as a dev dependency: npm install --save-dev @m13v/assrt
Set ASSRT_API_URL=http://localhost:0 (or your own internal endpoint) so scenario sync stays inside your network
Run with --isolated so the user data directory stays in memory and no cookies persist between runs
Use whatever LLM provider already has a BAA with you, or a local model behind an OpenAI-compatible proxy
Commit the generated .spec.ts files into your repo so they go through normal code review and live in git history
Run the .spec.ts files in CI (any Playwright runner, no Assrt needed at runtime once tests are committed)

Why generated Playwright code matters for medical software

Healthcare software does not get to live in a vendor dashboard. The design history file for a Software as a Medical Device submission, the IEC 62304 process documentation, and the verification evidence the FDA and notified bodies expect, all assume the artifacts are auditable things you control. A pull request with a reviewer name and a SHA is such an artifact. A test that exists only as a row in a third-party orchestrator is not.

Generated Playwright tests slot into the same review process as the rest of your TypeScript. The reviewer reads the file, sees what is asserted, signs off, merges. The git log is the audit trail. If a regulator asks how the patient consent flow is verified, the answer is a path to a file, not a screen recording of someone clicking through a SaaS UI.

The same fact lets you delete the generator without deleting your tests. Once a test is written, the AI tool that wrote it is no longer load-bearing. A team that adopts this pattern can replace the generator next year without losing a single assertion.

When to still pick a closed-source vendor

The self-hosted route is not always the right answer. It costs engineering time to wire up, requires a staging environment and synthetic data, and assumes someone on your team is comfortable owning Playwright. There are two cases where the SaaS path is more honest.

The first is when your team has no in-house QA capacity at all and needs an outsourced QA function plus AI tooling in one purchase. A managed service like QA Wolf is a different product than what this page is about. You are buying labor, not a generator.

The second is when your compliance program is already structured around BAAs with named SaaS vendors and adding one more is cheaper than introducing a self-hosted tool that needs its own security review. That is a reasonable read, and a tool that signs a BAA is a legitimate purchase. The point of this guide is that there is a third option, not that the other two are wrong.

A worked example: a charting workflow

To make this concrete, here is a small piece of generated test code against a synthetic charting flow. Nothing in this file references a real patient. The selectors come from the accessibility tree, so they survive a UI rewrite. The assertions are whatever your regulatory test plan calls for.

Generated tests/charting.spec.ts vs the equivalent SaaS scenario

import { test, expect } from "@playwright/test";

test("chart save persists vital signs", async ({ page }) => {
  await page.goto(process.env.STAGING_URL!);
  await page
    .getByRole("textbox", { name: "MRN" })
    .fill("SYN-000142");
  await page
    .getByRole("button", { name: "Open chart" })
    .click();
  await page
    .getByRole("spinbutton", { name: "Heart rate" })
    .fill("78");
  await page
    .getByRole("button", { name: "Save chart" })
    .click();
  await expect(
    page.getByRole("status", { name: "Saved" }),
  ).toBeVisible();
});

0% fewer lines

The left version is a file. Your existing review process applies. The right version is content in someone else's database. The functional coverage is identical. The compliance posture is not.

Want to walk through this for your specific healthcare app?

Bring your stack and one workflow you would like covered. Thirty minutes, no slides, we sketch the test generation setup against your staging environment.

Frequently asked questions

Does an AI test generator that runs in our CI need a Business Associate Agreement?

It depends on whether the tool has access to PHI. The HHS definition of a business associate is a person or entity that performs a function on behalf of a covered entity that involves access to protected health information (https://www.hhs.gov/hipaa/for-professionals/faq/business-associates/index.html). A test generator that crawls your live healthcare app, takes screenshots of patient records, and forwards them to a vendor cloud has access to PHI and almost certainly needs a BAA. A test generator that runs against a staging environment seeded with synthetic data and outputs plain Playwright code into your own repo does not access PHI at all and avoids the question.

What makes Assrt different from testRigor or ZeroStep for healthcare apps?

testRigor offers a Business Associate Agreement, which is the cloud-SaaS approach: trust the vendor, sign paperwork, route PHI through their pipeline. ZeroStep extends Playwright with an ai() function that forwards DOM and screenshots to OpenAI, which is a third-party data flow your compliance team has to evaluate. Assrt is open source under the MIT license, runs on your own infrastructure, and outputs standard Playwright .spec files into your repo. The tests live in your code, run in your CI, and the only data they ever see is whatever you put in your test fixtures.

Where in the source can I verify that Assrt does not require a vendor cloud?

Two places. First, src/core/scenario-store.ts line 14 reads `const CENTRAL_API_URL = process.env.ASSRT_API_URL || "https://app.assrt.ai";` so you can point the scenario sync at any URL or omit the sync entirely by skipping the cloud step. Second, src/core/browser.ts around lines 307 to 309 implements the `--isolated` flag, which spawns Playwright with an in-memory user data directory: no cookies, history, or PHI persisted to ~/.assrt/browser-profile between runs. Both are visible in the @m13v/assrt npm package and the GitHub repo at github.com/assrt-ai/assrt-mcp.

Can the AI agent see PHI during a test run?

Yes, in the same sense any browser-driving tool sees what is on screen during a run. That is why this guide recommends running tests against a staging environment with synthetic patient data, never against production with real PHI. If you must run against production, the safer pattern is to skip AI plan generation, hand-write the Playwright assertions, and use Assrt only as a runner with the LLM disabled. The architectural point is that you control which environment, which model, and whether the model is called at all. There is no vendor cloud in the path you cannot disable.

What synthetic data should we use for medical app tests?

Two pragmatic sources. The Synthea synthetic patient generator (https://github.com/synthetichealth/synthea) produces realistic FHIR-formatted patient records with no real PHI, which is the standard for healthcare software demos and test fixtures. For lighter weight needs, faker-style libraries can generate names, addresses, dates of birth, and phone numbers that look like patients but are not. Seed your staging database from one of those, point Assrt at staging, and the tests never touch real PHI.

Does outputting Playwright code instead of YAML actually matter for compliance?

It matters because audit and review become possible. A Playwright .spec.ts file is plain TypeScript that goes through your normal pull request review, is committed to git with a SHA, and can be diffed by a security engineer or a regulatory reviewer the same way they review the rest of your code. A proprietary YAML format or a remote scenario stored in a vendor dashboard breaks that chain. When an FDA Software as a Medical Device audit asks how you verify your release, pointing at signed git commits with named reviewers is a much shorter conversation than explaining how a SaaS dashboard works.

What about FDA Software as a Medical Device requirements?

FDA does not require a specific testing tool. IEC 62304 and the FDA general principles of software validation expect documented test cases, traceable execution evidence, and a process for handling defects. A repo full of human-readable Playwright tests with named reviewers in git history maps cleanly to that. The harder problems with SaMD are documenting your software lifecycle and intended use, not generating tests, but having tests that live in your repo means there is no extra vendor in the validation chain to add to your design history file.

Can we run Assrt fully air-gapped?

Mostly. The agent itself needs an LLM endpoint to call. By default it uses Anthropic's API for Claude Haiku, which is a network call. If your environment cannot reach Anthropic, run a local model behind an OpenAI-compatible endpoint and set the model flag accordingly. The Assrt code itself is local, the browser is local, the scenarios live on disk. The only external dependency is whichever model provider you choose, and that choice is yours.

How is this different from just writing Playwright tests by hand?

Hand-written Playwright tests are also a fine answer for healthcare apps and have the same compliance properties as Assrt-generated ones, since the output is the same shape of file. The reason to use an AI test generator is throughput: the LLM proposes scenarios from your app's UI, drafts the assertion logic, and self-heals when an aria role moves. You read what it wrote, edit it, commit it. The tool replaces the blank-file friction, not the human review.