AI pen testing for application-layer flaws

Almost every guide on this topic is about network scanners or LLM prompt injection. The harder, more useful slice of work sits between those two, and it is the part fast-shipping web apps regress on every week: broken access control, OTP and password-reset abuse, session hygiene, and the “closed in the UI but still open in the API” class of bug. This guide walks through how to express those checks as plain-English browser scenarios and run them on a free, MIT-licensed agent loop.

Every file path, line number, and CLI flag below is from the open-source Assrt reference implementation. You can read along on disk; nothing here is a brochure claim.

Matthew Diakonov, Written with AI

Published April 24, 202611 min read

4.9from 120+

MIT licensed

Self-hosted

Real browser, real cookies

Plain JSON reports

AI pen testing, the application-layer slice

An open-source agent, not a $7.5K/mo scanner

Scanners catch SQL injection. Agents catch broken access control.

Plain-English #Case files run by Claude Haiku in a real browser

Disposable email and real Chrome session, both built in

Plain JSON reports, fail your CI build on .passed === false

MIT license. Tokens cost a few cents per scenario.

0:00 / 0:05

The slice the existing playbooks skip

Type this topic into any general source and you will get back two clusters of writing. The first is about classical scanners pointed at your network: well-known payload sets, header audits, deprecated-TLS sweeps, and CVE matchers. The second, newer cluster is about LLM application security: prompt injection, system-prompt extraction, retrieval poisoning. Both clusters are useful, both have shipping vendors, and neither covers the work that fast-moving product teams actually need to do every week.

That work is application-layer. It is the bugs that the OWASP Top 10 calls A01 (broken access control), A04 (insecure design), and A07 (identification and authentication failures): the stuff a payload set cannot find because the bug is not a payload, it is a missing check on the server. Every time a team adds a new feature, they add a new resource and a new permission, and somewhere a route forgets to call the authorisation function. A scanner will not notice. A user from a different account will.

The pattern this guide is about is using an LLM agent as that “different user”. The agent reads a Markdown plan in plain English, drives a real browser via the Playwright MCP server, and verifies that user B genuinely cannot see what user A just created. The scenario is human-readable, the report is JSON, the entire stack is open source, and the cost is a few cents per run.

OWASP A01 Broken Access Control

OWASP A02 Cryptographic Failures

OWASP A03 Injection

OWASP A04 Insecure Design

OWASP A05 Security Misconfiguration

OWASP A07 Identification and Authentication Failures

OWASP A08 Software and Data Integrity Failures

OWASP A09 Security Logging and Monitoring Failures

Eight categories on that strip. The agent loop is appropriate for at least six of them, and an honest pen-testing programme covers all of them with a mix of tools. This page is about how the agent slice works.

The anchor: the same eighteen tools, different intent

The single most important thing to understand is that nothing about this approach is a separate “security mode”. The exact same agent, with the exact same eighteen tool schemas defined at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 16-196, drives both a normal feature test and an authorisation test. The difference is which sentences you put in the plan.

Three of those tools matter disproportionately for application-layer security work, and they are the anchor for this whole guide:

create_temp_email at agent.ts line 115, plus wait_for_verification_code at line 120. Disposable inbox per scenario, polling for the OTP. Lets you write rate-limit and code-replay tests without owning a mail server and without burning a real address per check.
http_request at agent.ts line 172. Raw HTTP from inside the browser session, with the current cookies attached if you ask evaluateto forward them. The single tool that closes the “UI fix did not actually fix the API” gap.
--extension mode at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 299-306. Attaches the agent to your already-running Chrome instead of spawning a clean profile. The only sane way to test against an SSO-protected app without re-implementing your identity provider in fixtures.

18 tools

“The same eighteen agent tool schemas drive both a feature test and an authorisation test. Only the plan changes.”

agent.ts lines 16-196

Where the agent fits in your stack

At the level of the box-and-arrow diagram, the agent is a thin layer between a Markdown file and the Playwright MCP server. The plan goes in, the JSON report comes out, and along the way the agent is allowed to talk to a disposable mailbox and to your own backend.

A security #Case as it flows through the loop

The exact CLI flags that matter for security runs

The launch args at browser.ts line 296 set the shape of every run. Three of them are non-obvious for security work and worth pulling out.

browser.ts

--viewport-size 1600x900 stabilises the accessibility tree across runs. A tree whose layout shifts because the viewport changed will give the same element a different label, and a cross-user check that asserts on a label will get a flaky result. Pin it.
--isolated swaps the persistent profile (the default) for an in-memory one. Use this when you want a guaranteed-empty cookie jar on every run, which is the right setting for password-reset and signup-rate-limit checks where you do not want session leakage between runs. Documented at browser.ts lines 307-309.
--extension attaches to your real Chrome (lines 299-306) and reads the token from ~/.assrt/extension-token after the first approval. This is the mode for any check that needs a real, fully-warm SSO session. Run it from your own laptop against staging so the agent inherits the identity provider state your real users have.

A real broken-access-control plan

Here is what an A01 check looks like as a single Markdown file. The parser at agent.ts lines 620-631 splits on #Case, #Scenario, or #Test headers, and runs each block in the same browser session.

plans/security/access-control.md

The interesting line is the http_request assertion at the end. The browser-only check is necessary but not sufficient: you need to confirm that the API also says no, because attackers do not navigate, they curl. Doing both in one scenario, with the same session cookie, in one file, is the part most other approaches miss.

One step, four actors

For the cross-user check above, here is what one step looks like at the wire-protocol level. The agent is not talking to the page; the Playwright MCP server is. The agent is reading tool results and emitting tool calls.

A cross-user GET, traced end to end

An OTP rate-limit run, on the terminal

The verbose CLI output for a rate-limit scenario is a useful sanity check on the loop's shape. One signup attempt per step, an assert at each, and a final pass once the limiter actually fires.

otp-rate-limit.log

Notice the cost line at the bottom. A real OTP rate-limit run takes about twenty seconds and a few cents of model tokens. Running this every CI build is well below noise.

Why this beats a SaaS scanner for application-layer work

Feature	Closed-source AI scanner SaaS	Open agent loop with #Case files
What it is good at	Pattern fuzzing: SQLi, XSS, header misconfig, deprecated TLS	Intent checks: 'user A must NOT see user B's project'
Authoring surface	HTTP request templates, macros, attack payloads	English sentences in a Markdown file, one #Case per check
Per-step targeting	Locator strings or request signatures	Live accessibility tree, [ref=eN] resolved per snapshot
Source of auth state	Saved session token or recorded login macro	Real Chrome via --extension, or per-run disposable email
Diff in pull request	Binary project file, hard to review	Markdown plan, line-by-line diffable
What runs in CI	Scanner CLI + saved session + report	Node + plan.md + JSON report, fail build on .passed === false
Cost shape	Per-seat license or per-scan SaaS, often $7.5K/mo+	MIT license + LLM tokens (a few cents per scenario)
Cloud dependency	Hosted dashboard, scans run in vendor cloud	Local Node process; the only network call is the model API

Six categories of check this approach does well

Below is the working list of check categories that fit the agent loop cleanly. Add them to your repo one at a time, version them next to the feature code that introduced the resource they protect.

Where to start

Broken access control

Two scenarios per check, one logged in as user A, one as user B. Read with the UI, then re-read with http_request to confirm the API agrees. The single most under-tested OWASP category in fast-shipping web apps.

Authentication and session hygiene

OTP rate limits, magic-link replay, password-reset enumeration. create_temp_email gives you fresh inboxes per scenario; assertions on response text and timing catch enumeration bugs that scanners miss.

Authorisation surface drift

Every new feature ships a new resource type. Diff your plans/security/*.md folder against last week's; if you added a route but did not add a #Case for it, that is the regression you want to catch in review.

Sensitive data in client state

Use evaluate to dump localStorage, sessionStorage, and IndexedDB in a #Case after login; assert no JWT secret, refresh token, or PII shows up where the documentation says it should not be.

Webhook and integration trust

After the UI says 'webhook configured', use http_request to fire a forged signature against your own endpoint and assert it returns 4xx. The most common third-party integration bug is accepting unsigned payloads.

Logout, tab, and CSRF assumptions

Open two tabs (browser_tabs in the underlying Playwright MCP surface), log out in one, assert the other can no longer perform privileged actions. The check most apps assume passes and almost none verify.

Five ideas worth keeping

create_temp_email turns OTP into a checkable scenario

agent.ts line 115. Spins up a disposable inbox per #Case. wait_for_verification_code (line 120) polls it. That is enough to write rate-limit, replay, and cross-account OTP tests without owning email infra or hard-coding fixed addresses.

--extension uses your real Chrome session

browser.ts lines 299-306. Attaches the agent to an already-running Chrome with SSO, 2FA, and password manager state intact. The only sane way to test 'user A logged in via Okta, must not see user B's resource' without sharing real credentials.

http_request closes the UI-fix-doesn't-fix-API gap

agent.ts line 172. Lets a #Case probe the backend directly from inside the browser session. Hiding the Delete button is meaningless if DELETE /api/projects/{id} still returns 200; this is how you assert both layers.

Scenarios share state, so cross-user checks are one file

Browser state carries between #Case blocks (system prompt lines 238-241). #Case 1 logs in as user A and creates a resource; #Case 2 logs out, logs in as user B, tries to read it, asserts a 403. One plan, no fixtures.

Reports are plain JSON, so CI gates are a one-liner

writeResultsFile at scenario-files.ts lines 77-84 dumps TestReport to /tmp/assrt/results/<runId>.json. Pipe through jq, exit non-zero on any .scenarios[].passed === false. No proprietary format, no rate limit, no vendor lock-in.

The numbers, ballparked from a real run

These are the orders of magnitude for a five-scenario plan with an average of about twenty steps each, run locally on a dev MacBook against a staging deployment.

0agent tool schemas

0browser_* primitives underneath

0locator strings in your repo

0monthly seat licenses

Per-scenario token cost

~0¢

Twenty steps of snapshot-act-assert on Claude Haiku 4.5, April 2026 rates. Ten such scenarios per build is well under a dollar.

Typical OTP rate-limit run

~0s

Eight tool calls end to end. Fast enough to run on every PR, slow enough to actually exercise the limiter.

Where this approach is the wrong tool

Three places to be honest about. First, network and infrastructure security: kernel hardening, OS patching, container image scanning, supply-chain auditing. None of that is in scope for a browser-driving agent; use a dedicated scanner and keep doing whatever you already do for those layers. Second, payload fuzzing on a known surface: if you want to test ten thousand SQL injection variants against a single endpoint, ZAP and Burp will do it cheaper and faster than an LLM. Third, deep social-engineering and red-team work: that is a human exercise with a scope document, not a CI job.

What this approach is for is the slice in the middle: the application-layer regressions that ship every week as your feature surface grows, that no scanner can verify because the bug is a missing check rather than a malformed input, and that an external firm only sees once a year. Writing those as plain-English #Case files and running them every CI build is the win.

Stand up the security #Case folder for your repo

30 minutes. We will walk through MCP launch flags, write a first cross-user check against your own staging app, and wire the JSON report into your CI gate.

Frequently asked questions

Does this replace a real pen test from a security firm?

No, and the framing matters. A scoped engagement against your infra (network scanning, kernel and OS hardening, supply-chain auditing, social engineering) is a different exercise; an external firm is still the right answer for the once-a-year deep dive. What this approach replaces is the gap in between: the application-layer regressions that show up week to week as you ship features. OWASP A01 broken access control, A07 identification and authentication failures, and A04 insecure design issues like OTP rate limits and password reset replay are exactly what a per-build agent loop catches and a once-a-year engagement misses by months. Read the OWASP Top 10 2021 list and notice how many entries describe behaviour you can only verify by being a logged-in user inside the app; that is the slice this approach is for.

Why an LLM agent instead of a traditional scanner like ZAP, Burp, or a SaaS like XBOW?

Traditional scanners excel at known-pattern fuzzing: SQL injection probes, well-known XSS payloads, header misconfiguration, deprecated TLS. They do not excel at intent-driven checks like "a user without project membership must NOT be able to GET /api/projects/{id}/runs". To answer that you need to log in as user A, navigate or call as user A, then log out and try again as user B, and assert the second response is 401 or 403. That logic is trivial to write as five English sentences and very hard to write as a payload set. The agent loop reads those English sentences, reads the live accessibility tree of your app, and drives the browser. Look at /Users/matthewdi/assrt-mcp/src/core/agent.ts at lines 16-196: every tool the agent can call is listed there, including http_request (line 172) for raw API probing and evaluate (line 106) for arbitrary in-page JavaScript.

What is the exact pattern for testing broken access control as a #Case?

Two scenarios in one plan, run sequentially in the same browser session. #Case 1 logs in as user A and creates a private resource, recording the URL and ID. #Case 2 logs out, logs in as user B, and tries to access user A's resource by URL and via http_request to the API. The trick is that browser state carries over between scenarios, which is documented as 'Scenario Continuity' in the system prompt at /Users/matthewdi/assrt-mcp/src/core/agent.ts lines 238-241. The plan files live at /tmp/assrt/scenario.md (see scenario-files.ts line 17), so they are easy to keep next to the rest of your repo.

How does this handle authentication that is more complicated than a password?

Three layers of help, all built-in. For OTP and magic-link logins, create_temp_email at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 115 spins up a disposable inbox and wait_for_verification_code at line 120 polls for the message; that is enough to test signup rate limits and code replay without owning email infrastructure. For passwords kept in your real browser, the --extension flag at /Users/matthewdi/assrt-mcp/src/core/browser.ts lines 299-306 attaches the agent to your already-running Chrome, so SSO, 2FA, and password manager autofill all work as the user already configured them. For the split-input OTP pattern (six single-digit fields), the system prompt at lines 233-236 hard-codes a clipboard-paste expression so the agent never has to type one digit per field and lose focus.

Can I probe an endpoint at the API layer from inside a #Case, or only the UI?

Both. The http_request tool at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 172 takes url, method, headers, and body. After the agent has logged in via the UI, the auth cookie is in the Playwright-controlled browser, but evaluate (line 106) can read document.cookie or call fetch() from within the page. That gives you a real session-bound API call without exporting cookies to a separate tool. This is the pattern for verifying that closing a UI access path actually closes the corresponding API path: the front-end fix that hides the Delete button is meaningless if DELETE /api/projects/{id} still returns 200 for an unauthorised user.

What does an OTP rate-limit test actually look like?

One scenario that asks for a code, then asks again, then again, all within a few seconds. Each request goes through the live signup form so the rate limiter sees the same client fingerprint a real attacker would generate. assert lines (agent tool at line 132) record after each request whether the response was "code sent" or "too many requests". A pass is the second or third attempt being rate-limited. A fail is the tenth attempt still going through. The whole scenario is roughly fifteen lines of Markdown; the agent figures out the snapshot-act loop on its own. The same pattern catches password-reset enumeration when the response time differs between known and unknown email addresses.

Where do the test results go, and how do I block a CI build on a failure?

Two JSON files written by writeResultsFile at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 77-84: /tmp/assrt/results/latest.json (overwritten each run) and /tmp/assrt/results/<runId>.json (UUID-keyed historical). The shape is a TestReport (types.ts lines 28-35) wrapping a ScenarioResult[] (lines 19-26). Each scenario carries name, passed (boolean), assertions[], and steps[]. In CI, run assrt with --json and pipe into jq; if any .scenarios[].passed is false, exit non-zero. There is no proprietary report format, no cloud you have to sign into, and no rate limit on how often you can run.

Is this safe to point at production?

Same answer as for any other functional check: it depends on what your scenarios do. The agent only does what the plan tells it to do. A scenario that creates and deletes a test project is fine in production with a dedicated test account. A scenario that fuzzes ten thousand login attempts is not, and the rate-limit scenario above is the better shape for that question anyway (you want the rate limiter to fire). For destructive checks (mass-delete authorisation, billing-edge cases, real-user data), point it at staging with a synthetic dataset. The point of writing these as plain-English #Case files is that the diff is human-readable and you can decide per-environment what runs.

How is this different from running Burp or ZAP with a saved session?

Burp and ZAP both expect you to bring your own request templates and macros. Their authoring surface is HTTP requests, not English sentences. They are excellent at the pattern-fuzzing slice of pen testing and they remain the right tool for that. The agent-loop approach is for the slice that requires per-step reasoning: read the page, decide what 'create a new project' looks like in this UI today, do that, then verify a different user cannot see it. The artifact in your repo is also different: a Burp project file is binary; a #Case plan is Markdown you can review in a pull request. For most application-layer regressions, the second one is the artifact you want.

Which model runs the loop, and what does a security check cost?

The default is Claude Haiku 4.5, pinned as claude-haiku-4-5-20251001 at /Users/matthewdi/assrt-mcp/src/core/agent.ts line 9. A typical security #Case is twenty to thirty steps (login as user A, do thing, log out, login as user B, try thing, assert), where each step is one snapshot plus one action plus one short reasoning turn. At Haiku April 2026 rates that lands on the order of a few cents per scenario; running ten such checks every CI build is well under a dollar per build. Provider is pluggable: the Gemini path lives at lines 342-367 with default gemini-3.1-pro-preview, so you can run the same scenarios on a different model if you want a second opinion.

What about prompt-injection and AI-specific security testing?

That is a different topic and a real one, but it is not what this page is about. If your app embeds an LLM (chatbots, agentic features, retrieval-augmented generation), you also want a separate set of #Case files that send hostile inputs (system-prompt extraction, tool-call hijacking, output-poisoning) and verify the model refuses or the application sanitises. The agent loop here is just a runner; the scenario you write decides whether it is exercising an authorisation gate or a guardrail. Both are valid uses, both produce the same TestReport shape, both can fail a CI build the same way.

Is the whole thing actually open source?

Yes. The MCP server lives at github.com/m13v/assrt-mcp under the MIT license. The Playwright MCP package it spawns (@playwright/mcp) is Microsoft's, also open source. There is an optional hosted runner at app.assrt.ai for sharing results in a browser, but the local CLI produces identical reports without it. There is no proprietary YAML, no closed-source rule engine, no monthly per-seat fee. The npm package's dependencies are listed in plain view: @anthropic-ai/sdk, @google/genai, @modelcontextprotocol/sdk, @playwright/mcp.