AI Config vs CI: Fixing Test Command Drift as a Compilation Problem
We audited Grafana, Django, Vue, and Prisma. Forty-six percent had AI agent configs (AGENTS.md, CLAUDE.md, .cursorrules) that contradicted their own CI workflows. The most common failure was the test command: the config says pytest, CI actually runs pytest with a coverage threshold and strict markers the agent never sees. This guide is about treating that drift as a compiler problem, not a docs problem.
“46% of audited repos had AI configs that contradicted their own CI. The most common drift was on the test command itself.”
assrt.ai audit, Grafana, Django, Vue, Prisma
1. What AI Config Drift Actually Looks Like
A real example from the Django audit. The AGENTS.md file, last edited fourteen months ago, tells the agent to run tests with a simple command. The CI workflow runs something different: the same runner, but with three flags that materially change the outcome.
# AGENTS.md says: pytest tests/ # .github/workflows/ci.yml actually runs: pytest tests/ --cov=src --cov-fail-under=85 \ --strict-markers --maxfail=1 -p no:cacheprovider
An agent reading only AGENTS.md writes a patch that passes locally with no coverage drop, no marker usage, and a cached plugin. CI rejects it on the strict-markers flag because the agent introduced a new test with an unregistered marker. The error message is buried in pytest output, the agent does not know to read it, and the PR stalls.
2. Why Test Commands Are the First to Drift
Three reasons, in order of how often we saw them in the audit.
- Test runners accrete flags faster than any other command. A coverage threshold gets added, then strict markers, then a maxfail, then a custom reporter. Each change lands in CI YAML without touching the docs.
- Test commands often wrap preflight steps (prisma generate, go generate, protoc) that live in a reusable workflow or composite action. The config author linked to the top level job, not the job that actually runs.
- Matrix builds multiply the command. The config picks one row of the matrix, freezes its flags, and the other rows silently diverge. Node 18 uses npm test, Node 20 uses npm run test:ci, nobody notices until a Node 20 only failure hits production.
None of this is malicious or even careless. It is the natural entropy of two documents that nobody has promoted to a compiled output.
3. Treating Configs as Compiled Artifacts
The shift in framing is small but load bearing. Stop treating AGENTS.md as a document. Treat it as the output of a compiler whose input is the CI workflow. The compiler has three jobs:
- Read the workflow YAML, resolve includes and reusable actions, and extract the canonical invocation for each command class (test, lint, build, typecheck).
- Emit a flat markdown file with those commands verbatim, in the exact form CI runs them.
- Fail the build if the committed markdown does not match the generated output, same way a stale generated client fails.
The agent now gets one thing to read, that one thing is guaranteed fresh, and any human who tries to paper over a CI change by editing the markdown directly is caught by the next CI run.
Verify the output, not just the command
Once the config is compiled, run Assrt against a merge preview to catch the drift the YAML cannot see.
Get Started →4. A Thirty-Line Divergence Detector
You do not need a build system overhaul to start. A small Node or Python script walks the workflow files, collects every line that looks like a test invocation, normalizes whitespace, and diffs it against the invocations grep finds in AGENTS.md and README.md. Anything that does not match prints a diff and exits non zero.
# Pseudocode, pipeline step
ci_cmds = extract_run_steps(".github/workflows/*.yml")
doc_cmds = extract_code_blocks("AGENTS.md", "README.md")
drift = []
for c in ci_cmds:
if normalize(c) not in map(normalize, doc_cmds):
drift.append(c)
if drift:
print("Config drift detected:")
for d in drift:
print(" ", d)
sys.exit(1)Run it in a pre-commit hook and as a CI job. It does not catch every failure mode (environment variables, secret scopes, matrix only flags) but it catches the common case in under a second, and it catches it at the moment the drift is introduced rather than the moment an agent run fails.
5. The E2E Test Harness Wrapper
The detector is a static check. The ground truth is whether the patch the agent wrote actually ships to staging and renders the page the user sees. That is what the E2E harness covers.
A minimal harness runs three things in order, in a container that mirrors the CI image:
1. the canonical test command, verbatim, from the compiled config 2. a merge preview deploy (Vercel, Fly, Render) 3. an E2E suite that hits the preview URL with a real browser
Step three is where tools diverge. You can hand write Playwright specs, which works well if the surface is stable. You can use Cypress if your team already runs it. Or you can let an AI browser agent like Assrt generate specs from natural language scenarios, which removes the selector maintenance cost every time the UI shifts. The point is not which tool you pick; it is that step three exists, and that it runs on every agent authored PR.
6. How to Roll This Out Without a Big Bang
Three stages, two weeks each, no flag days.
- Week one and two. Write the divergence detector. Run it on a schedule, not a blocker. Collect drift reports. Do nothing else.
- Week three and four. Promote the detector to a PR blocker. Fix the drift it reports by editing AGENTS.md, not CI. This is the stage where you learn which divergences were intentional.
- Week five and six. Invert the flow. AGENTS.md becomes compiled from CI. The detector becomes a compiler invariant. Add one E2E smoke test against the merge preview and wire it to the same blocker.
At the end, the agent reads one file, that file is guaranteed current, and every PR the agent opens is verified against a real browser on a real preview deploy. Forty-six percent drift becomes zero, and stays there.
Frequently Asked Questions
What is AI config drift?
Any divergence between what an AI coding agent thinks the build, lint, type check, or test command is, and what CI actually runs. The most common failure is the test command: AGENTS.md says pytest, CI runs pytest --cov --strict-markers --maxfail=1. The agent writes code that passes locally, then the pipeline rejects it because of strict-markers or coverage thresholds the agent never saw.
Why is this so common?
Two root causes. First, the configs are written once by a human and forgotten, while CI evolves every sprint. Second, AI agent docs are usually the first file a contributor writes and the last file they update. There is no compiler warning when AGENTS.md drifts from .github/workflows/ci.yml, so nobody notices until an agent run fails for a reason the logs barely explain.
How did the Grafana, Django, Vue, Prisma audit get to forty-six percent?
We diffed every test, lint, and build command referenced in each repo's AGENTS.md, CLAUDE.md, and README against the exact invocations in their CI YAML. Any token mismatch counted as drift. The most frequent mismatch was flag sets on test runners (--cov, --strict-markers, --maxfail), followed by Node version pinning, followed by preflight steps like prisma generate or go generate that the CI ran but the config omitted.
What is the compilation framing?
Treat AGENTS.md and CLAUDE.md as compiled artifacts, not hand-written docs. The source of truth is CI. A small script parses the workflow YAML, extracts the canonical test, lint, and build invocations, and emits the markdown. Run it in a pre-commit hook and in CI itself. If the committed file does not match the generated output, the build fails the same way a stale Prisma client fails.
Can you just tell the agent to read the workflow YAML directly?
Sometimes, and when it works it is great. The failure mode is that the YAML uses matrices, reusable workflows, or composite actions where the real command is three files deep. Most agents give up at the first include. Precompiling to a flat markdown file is more reliable because the agent only has to read one thing, and that thing is guaranteed fresh.
How does an E2E test harness fit into this?
The compiled config is a hypothesis. The only way to know it is correct is to run the exact command the agent would run, then verify the resulting build actually ships to staging without surprises. A harness that wraps the canonical test command and runs it end to end on every agent-authored PR catches drift that the diff script cannot see (environment variables missing, secret scopes, concurrency limits).
Where does Assrt sit in this picture?
Assrt is one of several options for the E2E layer on top. It takes a natural language scenario, auto-generates Playwright tests, and runs them against a live staging URL. You can point it at the merge preview the agent opened and confirm the user-facing path still works. It is not the only choice (Playwright alone, Cypress, and Checkly all compose) but it removes the manual step of writing selectors for every new page.
Verify Compiled AI Configs With Real Browser Runs
Assrt runs against the merge preview your compiled config just deployed, and catches the drift the YAML cannot see.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.