AI Code Verification

Learning to Trust Claude Code: E2E Tests Are the Verification Layer

The most common reason engineers bounce off Claude Code and similar tools is not the code quality. It is the exhaustion of reading every line the agent generated, looking for the one that will bite them. Line-by-line review does not scale, and it is also not how humans accept work from a contractor. Treat AI output the same way: trust the structure, verify the user-facing outcome with E2E tests. This guide is the workflow.

By the Assrt team|April 17, 2026|9 min read

Trust layer

“Treat AI output like a contractor's PR. You do not read every line. You run E2E tests against the user flow and check the outcome.”

Reddit thread, 'Tried claude code. Hate it.'

1. Why Line-By-Line Review Breaks

Claude Code can produce several hundred lines of TypeScript in under a minute, across multiple files. A careful human review of that diff takes thirty to sixty minutes, during which you are context switching between files, trying to hold the blast radius in your head, and looking for the specific shape of mistake the agent tends to make. It is harder, and more tiring, than writing the code yourself would have been.

The first instinct is to blame the agent for the exhaustion. The honest read is that line-by-line review is the wrong verification primitive for code produced at this speed. It worked when a human wrote ten lines in an hour and you reviewed them in ten minutes. The ratio has inverted. The verification layer has to invert with it.

2. The Contractor PR Model

Imagine a contractor handed you a PR for a checkout flow. You do not read every line. You do three things. You glance at the file list to make sure the change is scoped where you expected. You skim the diff for anything that looks dangerous (a hardcoded secret, a touched migration, a touched auth helper). Then you exercise the feature as a user and confirm it does what the contract said.

That is the model that scales. Claude Code is the contractor. The glance at the file list and the scan for dangerous patterns both take under a minute. The exercise step, running the feature end to end as a user, is the expensive one, and it is the one that is most worth automating.

3. What To Verify With E2E, And What Not To

E2E tests are the right tool for user-facing outcomes. They are the wrong tool for a handful of things that tend to bite teams that try to make E2E cover everything.

Good fit. User flows (login, CRUD, checkout). API round trips that have a visible effect. Third-party integrations via their sandbox. Regression checks on bugs that users reported.
Poor fit. Low-level algorithmic correctness. Performance regressions. Security properties like constant-time comparisons. Things whose failure does not produce an observable UI state.

For the poor-fit cases, keep unit tests, load tests, or specialized scanners. The question is not "how do I replace everything with E2E," it is "how do I use E2E to replace the specific verification task that line-by-line review was failing at."

Run a user-flow verification in one command

Describe the flow in plain English. Assrt drives a real browser, records the spec, and gives you a pass or fail.

4. Handling Flaky Tests Without Losing the Signal

The fastest way to kill an E2E suite is to let a flaky spec live in the main pipeline. Teams see three reds in a row that turn out to be unrelated, stop trusting the suite, and start merging on red. Once that norm is established, the suite is functionally gone. Four rules, in order, protect the signal.

Isolate test data per run. Never share a record, account, or email across two runs. Flakes caused by data collisions look like logic bugs and burn hours.
Never sleep. Always wait for a specific observable state. sleep(2000) is the single largest source of flakes in the wild.
Retry at the scenario level, not the step level. Step-level retries hide the bug whose symptom is intermittent UI state. Scenario-level retries confirm reproducibility.
Quarantine fast. If a spec goes red twice in a day without a related diff, pull it from the blocking suite within twenty four hours, file a bug, and repair before returning it. Do not let it fester.

Self-healing selectors eliminate the largest mechanical source of flakes (DOM drift) by rebinding to semantic anchors automatically. The four rules above address the remaining causes: data, timing, and retry strategy.

5. Tiered Trust

Not every change deserves the same verification budget. Three tiers hold up.

tier 1  UI copy, layout, internal tools
        -> E2E only

tier 2  business logic, new features, CRUD
        -> E2E plus static diff scan

tier 3  auth, billing, infra, migrations
        -> E2E plus human line-by-line review

The point of tiering is that you reclaim time on the ninety percent of changes that are tier one, and you concentrate the deep review budget on the ten percent that actually need it. A tier flag can live in the PR template or be assigned automatically from the paths the diff touches.

6. The Complete Trust-But-Verify Workflow

End to end, for a typical tier one or tier two change:

Prompt the agent with the feature intent and any constraints. Let it write the code.
Ship to a preview deploy. Vercel, Fly, and Render all do this in under a minute.
Run the existing E2E smoke suite against the preview URL. A tool like Assrt can self-heal selectors if the UI shifted.
If the feature is new, ask an AI browser agent to exercise it in plain English, then store the resulting spec as a new regression test.
Glance at the file list. Flag tier three paths (auth, billing) for human review before merge.
Merge. Let a thirty minute synthetic of the production URL catch anything that slipped.

The total human time per feature is under five minutes on tier one, under twenty on tier two, and the deep review is reserved for tier three. The fatigue that drives people to say they hate Claude Code is gone because line-by-line review is gone. Trust is earned by the test layer, not by a human eye on every token.

Frequently Asked Questions

Why not just read the AI's code?

At prototype scale you can. At production scale, the agent generates hundreds of lines across many files per session. Reading every line is slower than writing it yourself, which defeats the purpose. The scale that makes AI coding worthwhile is the scale at which line-by-line review stops working. E2E tests that verify the user-facing outcome scale in the way review cannot.

What does 'treat it like a contractor PR' mean in practice?

Three rules. One, you trust the contractor to pick reasonable variable names and structure; you do not police style. Two, you verify the deliverable works by exercising the feature, not by reading the source. Three, you keep a clean interface contract so that if the contractor's code is replaced next month, the tests still pass. Apply the same rules to Claude Code output and the friction drops.

Isn't this dangerous if the AI wrote security bugs?

Security review is a separate concern and should not be skipped. But security review is also a different skill from line-by-line comprehension. Use a specialized static analyzer (Semgrep, CodeQL) or a security-focused agent on the diff, and a separate E2E suite for functional correctness. Each layer catches a different class of issue. Reading every line catches very few of either class.

What if the AI writes code that passes tests but still does the wrong thing?

That is the case E2E coverage catches and unit tests miss. An E2E test describes the user intent: 'a logged in user can create an invoice, email it to a customer, and see it in the sent folder.' The only way to make that pass is to actually deliver that outcome. The AI cannot shortcut this the way it can shortcut a mocked unit test.

How do you handle flaky E2E tests?

Flakes are the reason most teams abandon E2E suites. Four rules hold up: first, isolate test data per run; second, never sleep, always wait for a specific observable state; third, retry at the scenario level, not the step level, because step-level retries hide bugs; fourth, quarantine a flaky spec within a day, do not let it live in the main suite. Self-healing selectors solve the largest class of flakes (DOM drift) automatically.

How much of the trust can E2E actually earn back?

Enough to stop reading diffs on routine features. Not enough to skip review on changes that touch auth, billing, or infrastructure. The mental model is proportional trust: tier one changes (UI copy, internal tools) get E2E only, tier two (business logic) get E2E plus a diff scan, tier three (security, money) get E2E plus human line-by-line. The tiering lets you reclaim time on the ninety percent that is tier one.

Where does Assrt fit and how does it compare to Playwright alone?

Assrt is an AI browser agent that generates Playwright specs from natural language and runs them against a live URL with self-healing. Playwright alone works; the difference is authoring speed and selector drift. For a small suite that you maintain by hand, Playwright alone is fine. For a suite that grows with every AI-authored PR, the generation speed is the thing that keeps the suite aligned with what the code actually ships.

Verify AI-Generated Code With Real Browser Runs

Assrt drives a real browser against your preview URL, records regression specs, and self-heals when the DOM drifts.

View on GitHub