Agentic Testing as an Engineering Discipline: Beyond Vibe Coding to Structured AI Workflows
There is a meaningful difference between casually prompting an AI to write code and building disciplined multi-agent workflows where each agent has a defined role. When the tester is a separate agent with its own system prompt and objectives, it catches the blind spots that single-agent setups consistently miss.
“Teams using separate AI agents for code generation and testing reported 35% fewer production incidents than those using a single agent for both.”
State of AI in Software Engineering, 2025
1. Vibe Coding vs. Agentic Engineering
The term "vibe coding" describes a workflow where a developer prompts an AI model, accepts the generated code with minimal review, and iterates by describing changes in natural language. It is fast, accessible, and often good enough for prototypes. But it is fundamentally a single-loop process: one human, one AI, one conversation, no structured verification.
Agentic engineering is something different. In an agentic architecture, multiple AI agents collaborate on the software development lifecycle, each with a specific role, system prompt, and set of tools. One agent writes code. Another reviews it. A third generates and runs tests. A fourth monitors production behavior. Each agent has access to different context and operates under different constraints.
The distinction matters because it mirrors how high-performing human teams work. A developer who writes code and also writes the tests for that code is less likely to catch their own assumptions than a separate QA engineer who approaches the feature with fresh eyes and different priorities. The same principle applies to AI agents: an agent tasked exclusively with finding bugs will find more bugs than an agent that just finished writing the code and is now asked to verify it.
Multi-agent SDLC architectures are emerging from companies like Cognition (Devin), Factory, and several open-source projects built on frameworks like CrewAI, AutoGen, and LangGraph. The common pattern is a coordinator agent that delegates tasks to specialist agents, each with narrowly scoped permissions and objectives. The testing agent in these systems is not an afterthought; it is a first-class participant with its own tool access and evaluation criteria.
2. The Blind Spot Problem: When AI Tests Its Own Code
When a single AI agent writes both the code and the tests, a predictable failure mode emerges: the tests validate the implementation rather than the requirements. The agent "knows" what it wrote, so it writes tests that confirm the code does what the code does, not tests that verify the code does what the user needs.
This is the AI equivalent of confirmation bias. If the agent implemented a login form that does not handle rate limiting, it will not write a test for rate limiting because rate limiting was never part of its mental model. The test suite passes with 100% coverage of the implemented code, while the actual requirements remain untested.
Research from multiple groups has documented this pattern. In benchmarks where a single LLM generates both code and tests, the tests catch fewer than 20% of injected bugs compared to tests written by a separate model instance with a different system prompt. The issue is not capability; it is context contamination. The generating agent carries forward its assumptions about how the code should work, and those assumptions shape the tests.
The practical consequence is that vibe-coded applications often ship with test suites that provide false confidence. The tests pass, the coverage numbers look reasonable, but the tests are not testing the right things. Edge cases, error states, concurrency issues, and integration failures slip through because the test author shared the same blind spots as the code author.
3. Splitting the Tester into a Separate Agent
The fix is architectural: give the testing agent a completely different system prompt, different context, and ideally a different model or temperature setting. The coding agent receives a system prompt that emphasizes correctness, clean architecture, and feature completeness. The testing agent receives a system prompt that emphasizes adversarial thinking, edge case discovery, and failure mode enumeration.
In practice, this means the testing agent does not see the implementation source code (or sees it only after generating its test plan from the requirements). It starts from the user-facing specification: what should this feature do? What inputs does it accept? What happens when those inputs are invalid, missing, or malicious? The testing agent derives its test cases from requirements and expected behavior, not from the implementation.
Several tools facilitate this separation. Assrt takes an agentic approach to test generation by crawling a running application and discovering test scenarios from the user's perspective, independent of how the code is structured. It generates real Playwright tests based on what it observes in the browser, not based on reading source files. This architectural separation between the code that builds the application and the agent that tests it is what makes the generated tests genuinely useful for catching regressions.
Other approaches include using Meticulous for replay-based testing, QA Wolf for human-in-the-loop test creation, and custom setups where CrewAI or AutoGen orchestrate a dedicated testing agent with browser access via Playwright MCP. The specific tool matters less than the principle: the agent that tests should not be the agent that built.
4. Agentic Testing Loops and Regression Prevention
The real power of agentic testing emerges when it runs continuously, not as a one-time generation step. In an agentic testing loop, the testing agent runs after every code change, compares current behavior to expected behavior, and flags regressions before they reach production. This is similar to traditional CI/CD testing, but with one critical addition: the agent can adapt its tests when the application intentionally changes.
Consider a scenario where a developer (or a coding agent) modifies a checkout flow. In a traditional setup, any test that referenced the old flow structure would break, requiring manual updates. In an agentic setup, the testing agent re-crawls the application, detects that the flow has changed, determines whether the change is intentional (by checking the commit message or PR description), and updates its test expectations accordingly. Self-healing selectors are part of this: when a button moves from one location to another but retains its accessible name, the agent adapts without intervention.
This loop catches a specific category of bugs that traditional testing misses: unintended side effects. A code change to the payment module should not affect the user profile page. An agentic testing loop that exercises the entire application after every change will notice when a profile page element disappears or behaves differently, even though no profile-related code was modified. Humans writing targeted unit tests for the payment module would never catch this.
The economics work better than you might expect. While running a full agentic exploration on every commit is expensive, running targeted agentic tests on affected flows (determined by static analysis of the diff) brings the cost down to a few dollars per day for most applications. Combined with a traditional Playwright suite for critical paths, this creates a two-tier testing strategy that balances thoroughness with cost.
5. Building an Agentic SDLC That Scales
Moving from vibe coding to agentic engineering is not a binary switch. Most teams adopt it incrementally, starting with a dedicated testing agent that runs alongside their existing development workflow. The first step is usually automated test discovery: point a tool like Assrt at your staging environment and let it generate a baseline test suite. From there, integrate the generated tests into CI so they run on every pull request.
The second step is adding a code review agent. Tools like CodeRabbit, Ellipsis, and Graphite's AI reviewer analyze diffs and flag potential issues. When combined with a testing agent, you get two independent AI perspectives on every change: one examining the code statically and one verifying behavior dynamically. Disagreements between the two agents are often the most valuable signals because they indicate areas where the code's intent and its behavior diverge.
The third step is closing the loop. When the testing agent finds a failure, it should be able to create a bug report (or even a fix) that routes back to the coding agent. Several teams have implemented this using GitHub Actions that trigger on test failures: the action captures the failure context, sends it to an AI agent, and the agent opens a PR with a proposed fix. The human developer reviews the fix rather than debugging the failure from scratch.
Scaling this architecture requires clear boundaries between agents. Each agent should have the minimum permissions necessary for its role. The testing agent needs read access to the running application and write access to the test directory; it should not have access to modify application code. The coding agent should not have access to approve its own pull requests. These constraints prevent the agents from undermining each other and maintain the separation of concerns that makes the architecture valuable.
The teams that succeed with agentic engineering treat it as a discipline, not a shortcut. They define clear roles for each agent, maintain strict separation between code generation and verification, and invest in the orchestration layer that coordinates the agents. The payoff is a development process that scales with AI capabilities while maintaining the quality guarantees that production software demands.
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.