AI Agents in Test Automation: MCP, Accessibility Trees, and Practical Limits
AI agents can now browse the web, interact with applications, and report what they find. But how well do they actually work for test automation? The answer depends on what you are testing and how you connect the agent to the browser.
“Accessibility tree-based interactions are 4x more resilient to UI changes than CSS selector-based approaches.”
Microsoft Research, Agentic Web Navigation Study, 2025
1. What MCP Is and Why It Matters for Testing
The Model Context Protocol (MCP), introduced by Anthropic in late 2024, is a standardized way for AI agents to interact with external tools and data sources. Think of it as a USB-C port for AI: instead of building custom integrations for every tool, you expose your tool as an MCP server and any MCP-compatible agent can use it.
For test automation, MCP matters because it standardizes how an AI agent connects to a browser. An MCP server for Playwright (like the official Playwright MCP server) exposes browser actions as tools: navigate to a URL, click an element, fill a form field, take a screenshot, read the accessibility tree. The AI agent calls these tools through the MCP protocol, and the server executes them against a real browser.
The significance is composability. An AI agent with MCP access to both a browser and your codebase can navigate your application, find a bug, read the relevant source code, and suggest a fix. This is fundamentally different from traditional test automation, which only reports pass or fail. The agent has context across the entire development workflow.
Several MCP browser servers already exist: Playwright MCP (from Microsoft), Browserbase MCP (cloud browsers), and various community implementations. The ecosystem is young but growing quickly, and the standardization means you are not locked into any single vendor.
2. Accessibility Trees vs. CSS Selectors
The most important technical decision in AI-driven browser interaction is how the agent identifies elements on the page. The two main approaches are CSS/XPath selectors (the traditional method) and accessibility tree navigation (the approach favored by AI agents).
CSS selectors identify elements by their position in the DOM, their class names, IDs, or data attributes. They are precise and fast, but they are brittle. When a developer refactors the HTML structure, renames a CSS class, or moves a component, selectors break. This is the primary cause of test maintenance burden in traditional automation.
Accessibility trees represent the page as the browser exposes it to screen readers and assistive technology. Elements are identified by their role (button, textbox, link), their accessible name (usually the visible label text), and their state (disabled, expanded, checked). These properties are semantically meaningful and change far less often than CSS classes.
When an AI agent reads the accessibility tree, it sees something like "button named Submit, enabled" rather than "div.form__submit-btn.primary." This is how the Playwright MCP server works by default: it provides the accessibility tree snapshot to the AI agent, and the agent decides which element to interact with based on semantic meaning, not DOM structure.
The practical benefit is resilience. An accessibility tree-based test that clicks "the Submit button" will continue working even if the button is moved to a different part of the page, wrapped in a new container, or restyled completely. It only breaks if the button's accessible name or role changes, which is a much rarer event. Tools like Assrt and Playwright's own getByRole locators lean into this approach for exactly this reason.
3. AI Suggesting Fixes, Not Just Describing Failures
Traditional test automation tells you that something broke. A test failed, here is the error message, here is a screenshot. Figuring out why it broke and how to fix it is entirely your problem. AI agents can do better.
When an AI agent has access to both the browser (via MCP) and your source code (via file system or GitHub MCP), it can correlate a test failure with the relevant code. If a button click does not trigger the expected navigation, the agent can read the onClick handler, trace the routing logic, and identify whether the issue is a broken link, a missing route definition, or a conditional that prevents navigation.
This is already possible with tools like Claude Code, Cursor, and GitHub Copilot Workspace. You can paste a test failure, and the AI will suggest code changes. The MCP-based approach takes this further by letting the agent discover the failure itself, without a human running the test and copying the error.
The workflow looks like this: the agent navigates your application, encounters an error (a broken link, a form that does not submit, a page that crashes), reads the relevant source code, generates a fix, and opens a pull request. Anthropic, Google, and several startups have demonstrated this workflow in prototype form. It is not production-ready for complex applications, but it works surprisingly well for common issues like missing null checks, broken imports, and incorrect API endpoint URLs.
4. Practical Limits of Agentic Testing
It is tempting to imagine AI agents replacing your entire QA process. The reality is more nuanced. Current AI agents have specific strengths and specific weaknesses for testing, and understanding both is essential for using them effectively.
The biggest limitation is cost and speed. Every interaction an AI agent has with the browser requires an LLM call to decide the next action. A simple five-step user flow might require 10 to 15 LLM calls (including reading the accessibility tree, deciding which element to interact with, verifying the result). At current API pricing, running 100 agentic test scenarios costs $5 to $20 and takes 10 to 30 minutes. A traditional Playwright test suite covering the same scenarios runs in seconds for essentially zero marginal cost.
Another limitation is determinism. AI agents make probabilistic decisions. The same agent with the same accessibility tree might choose different elements to interact with on different runs. This introduces a new category of flakiness that is harder to debug than traditional timing-based flakiness because the agent's decision process is opaque.
Complex multi-step workflows also challenge AI agents. Flows that require maintaining state across many pages (like a checkout process with address validation, payment, and confirmation) can confuse agents that lose context or make incorrect assumptions about page state. Traditional scripted tests handle these flows reliably because every step is explicitly defined.
Authentication is another pain point. Most AI agents cannot handle MFA, CAPTCHA, or OAuth flows without custom workarounds. If your application requires authentication (and most do), you need to set up pre-authenticated sessions or bypass mechanisms for the agent, which adds complexity.
5. When AI Helps and When It Does Not
AI agents excel at exploratory testing and test discovery. Turning an agent loose on your application to find broken links, console errors, accessibility violations, and unexpected behavior is genuinely valuable. The agent will try interaction patterns that no human tester would think of, and it will do it tirelessly.
AI agents are also good at generating initial test scripts. Instead of writing Playwright tests from scratch, you can have an agent navigate your application and record its actions as reproducible test code. Assrt does this through automated crawling and Playwright test generation. The Playwright MCP server can record agent sessions as test scripts. The generated tests need human review, but they save significant time compared to manual test authoring.
Where AI agents struggle is in replacing deterministic regression suites. If you have a critical checkout flow that must be verified on every deployment, you want a scripted Playwright test that runs the exact same steps every time. An AI agent might navigate the checkout flow slightly differently each run, making it unreliable as a regression gate. The scripted test is faster, cheaper, and more predictable.
The practical recommendation is to use AI agents for discovery and generation, then convert their output into deterministic test scripts for ongoing regression. Use agents periodically (weekly or after major releases) to explore for new issues, but rely on scripted tests for the CI pipeline. This hybrid approach gives you the creativity of AI exploration with the reliability of traditional automation.
MCP makes this hybrid approach easier by standardizing the interface between AI agents and browsers. As the protocol matures and more tools adopt it, switching between agent-based exploration and scripted regression will become seamless. The teams that benefit most will be those that understand both modes and deploy each where it is strongest.
Ready to automate your testing?
Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.