Agent-Agnostic Compliance
How three AI coding agents interpret identical regulatory data via MCP — and why the data layer, not the model, is what makes compliance reproducible.
PUBLISHED
MARCH 2026
15 min read
Three AI coding agents evaluated the same codebase against the same GDPR requirements, delivered through a single Model Context Protocol server. The results show that structured regulatory data, not the choice of model, is the primary driver of consistent compliance coverage.
01 Executive summary
When AI coding agents are asked to evaluate regulatory compliance, does the quality of their analysis depend on the model, or on the data they receive?
RuleMesh ran a controlled experiment. Three leading AI coding agents — Claude Opus 4.6 (Anthropic), Gemini 3 Flash Preview (Google, via JetBrains Junie), and GPT-5.4 (OpenAI, via Codex CLI) — were each given the same task: evaluate a Next.js application against 118 GDPR IT requirements, sourced from seven high-priority compliance bundles. All three consumed the same structured data from RuleMesh’s MCP server (v5.1). A fourth model, Gemini 2.5 Pro, participated partially but could not complete the MCP workflow due to tooling constraints.
Key finding: all three agents identified the same critical GDPR gaps — no cookie consent banner, no published DPO contact, no data portability mechanism, and no age verification. Where the agents diverged was in how they reported findings: their precision, coverage strategy, confidence calibration, and adherence to the MCP’s tool-calling workflow.
The paper argues that compliance infrastructure should be model-agnostic by design. The MCP layer — not any individual AI model — is what ensures consistent, auditable, and reproducible regulatory analysis.
02 Background
Software teams face a growing tension. Regulations like the GDPR impose specific, verifiable requirements on digital products: lawful basis for processing, data subject rights, security safeguards, and accountability measures. AI coding agents are becoming embedded in development workflows, capable of reading codebases and flagging issues at scale.
The risk is that compliance analysis becomes coupled to a specific model. If an organisation uses Claude for code review today but switches to a different agent tomorrow, will the compliance findings be consistent? Will the same gaps be surfaced? Will the audit trail remain coherent?
In regulated industries — financial services, healthcare, government technology — auditors require evidence that compliance controls are systematic, not ad hoc. A compliance process that produces different results depending on which AI model happens to be in use is not a process; it is a lottery.
The Model Context Protocol approach
The Model Context Protocol (MCP) provides a standardised interface through which AI agents can discover and invoke structured tools. RuleMesh uses MCP to expose regulatory data — GDPR requirements engineered into IT-actionable checklist items, grouped into thematic bundles, each with defined risk levels, evidence types, and verification criteria.
If the data layer is structured and consistent, the compliance output should be consistent too, regardless of which model consumes it. The experiment described in this paper tests that hypothesis.
03 Experiment design
Test parameters
| Date | 23–24 March 2026 |
| Target codebase | RuleMesh landing page (Next.js 15, React 18) |
| Commit | 756a5a7 |
| MCP version | v5.1 (with layer filtering) |
| Regulation | GDPR |
| Bundles evaluated | 7 (all HIGH priority) |
| Total requirements | 118 |
Agents under test
Claude Opus 4.6
claude-opus-4-6Anthropic
Environment: Claude Code CLI (worktree isolation)
High precision. One descriptive label per finding; reports only what it can substantiate from code.
Junie / Gemini 3 Flash
gemini-3-flash-previewGoogle (via JetBrains Junie)
Environment: PyCharm Junie agent
Systematic coverage. One signal per requirement, iterating through every bundle and checklist item.
Codex / GPT-5.4
gpt-5.4OpenAI
Environment: OpenAI Codex CLI
High-volume, low-specificity category labels reused across many requirements.
Gemini 2.5 Pro (partial)
gemini-2.5-proEnvironment: Gemini CLI / PyCharm plugin
Could not complete the MCP workflow — produced a local markdown/JSON compliance report instead.
MCP workflow
The expected workflow follows three steps. It was not explained to the agents in advance — it was discoverable through the MCP tool descriptions alone.
- 01
get_compliance_plan— retrieve the list of requirement bundles applicable to the target - 02
pull_rules— for each bundle, retrieve the specific checklist items - 03
submit_signals— for each finding, submit a structured evidence signal (signal name, evidence type, confidence score, source file, description)
04 Quantitative results
| Metric | Claude Opus 4.6 | Junie / Gemini 3 Flash | Codex / GPT-5.4 |
|---|---|---|---|
| Signals reported | 39 | 119 | 236 |
| Unique signal names | 39 | 91 | 13 |
| Requirements covered | 38 / 118 | 118 / 118 | 118 / 118 |
| Code-based findings | 22 | 19 | 12 |
| Manual / gap findings | 17 | 100 | 224 |
| Average confidence | 0.50 | 0.95 | 0.50 |
| Source files cited | 8 | 8 | 3 |
| Followed MCP workflow autonomously | Yes | Yes | No (required prompting) |
| Called submit_signals autonomously | Yes | Yes | No |
| Edited source files during evaluation | No | Yes | Yes |
Common findings — identified by all three agents
Despite their differences in reporting strategy, all three agents independently identified the same critical compliance gaps in the target codebase:
- No cookie consent banner or consent management platform. Claude and Junie flagged it explicitly by name; Codex captured it under the generic label "lawful-basis-consent-gap."
- No Data Protection Officer (DPO) contact published.
- No data portability or export mechanism for data subjects.
- No age verification despite the Terms of Service requiring users to be 18 or older.
Two files were cited by every agent: pages/terms.js (Terms of Service with data protection clause) and pages/settings/profile.js (account deletion / right to erasure). These represent the only implemented GDPR-relevant features in the target codebase.
Shared findings between Claude and Junie (Codex missed these)
- JWT authentication (lib/auth/jwtAuth.js) — both identified the authentication mechanism and its relevance to access controls under GDPR Article 32.
- HTTPS transport encryption (lib/constants/api.js) — both confirmed that API calls enforce TLS.
- PostHog analytics without consent (instrumentation-client.js) — both flagged the analytics integration as processing personal data without a lawful basis for consent.
Unique discoveries per agent
Claude
lib/api/client.js, pages/settings/notifications.js
Junie
components/LoginModal.jsx, e2e/auth-complete.spec.js
Codex
pages/settings/security.js
Collectively, the three agents cited 11 distinct source files. No single agent found all 11. This suggests that running multiple agents against the same MCP data yields broader codebase coverage than any single agent alone.
05 Analysis
4.1 Signal quality: precision vs. coverage
The most revealing metric is the ratio of unique signal names to total signals reported. This captures how specifically each agent names and categorises its findings.
| Agent | Total signals | Unique names | Ratio | Strategy |
|---|---|---|---|---|
| Claude | 39 | 39 | 1.00 | One descriptive label per finding |
| Junie | 119 | 91 | 0.76 | One signal per requirement, some grouped |
| Codex | 236 | 13 | 0.06 | Broad category labels reused |
Claude was the most precise. Every signal had a unique, descriptive name. It reported only findings it could substantiate from the codebase, choosing to skip requirements where it could not determine evidence. This produced a concise evidence record — 39 signals covering 38 requirements.
Junie prioritised coverage. It evaluated every one of the 118 requirements and produced at least one signal for each, using 91 unique names. Its systematic approach is best suited to audit-readiness, where demonstrating that every requirement has been considered is as important as the findings themselves.
Codex reported the highest volume (236 signals) but with the lowest specificity (13 unique names). The label "governance-gap" was applied 64 times across different requirements; "security-control-gap" was used 34 times. Every code-type finding was duplicated. This bulk approach achieves nominal full coverage but makes it difficult to distinguish one finding from another.
4.2 Confidence calibration
All agents were asked to provide a confidence score (0.0 to 1.0) with each evidence signal. The results reveal an unsolved problem in AI-assisted compliance:
- Claude and Codex both used the default value of 0.50 for every signal, never adjusting it.
- Junie assigned 0.95 uniformly across all 119 signals.
None of the three agents varied confidence on a per-signal basis. A finding backed by a specific code reference received the same confidence score as a finding noting the absence of an entire feature. Meaningful confidence differentiation requires explicit schema-level guidance — the MCP tool definition should either constrain the confidence scale to meaningful bands or require agents to justify their score.
This is a design opportunity, not a failure. The MCP schema can be updated to require structured justification for confidence scores, turning confidence from a perfunctory field into an auditable data point.
4.3 Workflow adherence
The MCP workflow (get_compliance_plan → pull_rules → submit_signals) was designed to be self-documenting through tool descriptions. Two of the three agents discovered and followed this workflow autonomously:
- Claude called the tools in the expected sequence from its first message, treating evidence reporting as a natural part of the review process.
- Junie followed the workflow systematically, iterating through all bundles and requirements.
- Codex did not call submit_signals initially. It interpreted the task as a review-only exercise and produced a summary instead. A second explicit instruction was required to trigger evidence reporting. It also attempted to batch-submit evidence before learning that submit_signals accepts one signal at a time.
This divergence in workflow adherence is significant for production use. In an automated pipeline, an agent that does not call submit_signals produces no audit trail — the analysis may have been performed, but the findings are not captured in the compliance record. MCP server design should account for this by making the expected workflow explicit.
4.4 Side note: Gemini 2.5 Pro and general knowledge
Gemini 2.5 Pro could not complete the MCP workflow due to tooling constraints (submit_signals was unavailable in its environment) and API capacity issues. It instead produced a local compliance report as Markdown and JSON files.
One observation is notable: Gemini 2.5 Pro cited the LG München ruling on Google Fonts — a specific German court decision holding that embedding Google Fonts constitutes an unlawful transfer of personal data. No other agent cited specific case law. Claude flagged Google Fonts as a transfer risk but without the legal reference.
This highlights a complementarity between MCP-guided analysis and model general knowledge. The MCP provides structured, requirement-level coverage; a model with strong legal training can add contextual depth — case law, regulatory guidance, enforcement precedent — that structured data alone does not contain.
06 Implications
5.1 For compliance teams
MCP-structured regulatory data provides a consistent compliance baseline across different AI models. The core GDPR gaps were identified by every agent. Organisations can adopt a compliance-as-infrastructure approach: invest in the quality of the regulatory data layer (requirements engineering, bundle structure, evidence type definitions) and treat the choice of AI model as a secondary decision.
- Multi-agent evaluation. Running two or more agents against the same MCP data yields broader source file discovery. No single agent found all 11 relevant files.
- Agent rotation. Switching models between evaluation cycles does not break the compliance record, because the evidence schema is defined by the MCP, not by the model.
- Audit trail independence. Evidence signals stored via submit_signals are model-attributed but schema-consistent. An auditor can review findings from different agents in the same format.
5.2 For MCP server design
The experiment reveals three design improvements for compliance-oriented MCP servers:
- Confidence guidance. The schema should define meaningful bands (e.g. 0.0–0.3 = gap identified, no code evidence; 0.4–0.7 = partial evidence found; 0.8–1.0 = requirement fully satisfied with code reference) and require a brief justification string.
- Workflow enforcement. The MCP server could require a pull_rules call before accepting submit_signals for that bundle, or return workflow guidance in tool responses.
- Signal naming guidance. The 18× difference in unique signal names (91 vs. 13 for roughly the same data) shows that agents need clearer guidance on signal granularity. The schema could suggest a naming convention or reject duplicate signal names within a bundle.
5.3 For the broader AI compliance ecosystem
To our knowledge, this is the first published comparison of multiple AI coding agents performing structured compliance evaluation against the same regulatory dataset via MCP. The finding that the data layer, not the model, is the primary determinant of consistent compliance output has implications beyond RuleMesh:
- Regulatory technology vendors should invest in structured, machine-readable regulatory data rather than optimising for a single AI model.
- Standards bodies working on AI-assisted compliance (ISO/IEC 27001 Annex A, NIST CSF) should consider how frameworks can be engineered into MCP-compatible requirement bundles.
- AI model providers should recognise that compliance evaluation benefits from clear tool-calling conventions, per-signal confidence calibration, and read-only evaluation modes.
07 Limitations
This whitepaper presents findings from a single controlled comparison. Several limitations should be noted:
- Single codebase. The target was one Next.js application. Results may differ for larger codebases or different technology stacks.
- Single regulation. Only GDPR requirements were tested. Other regulations (NIS2, DORA, AI Act) may surface different agent behaviours.
- No repeated runs. Each agent was run once. AI model outputs are non-deterministic; repeated runs might yield different signal counts or file discoveries.
- Gemini 2.5 Pro incomplete. The fourth agent could not complete the MCP workflow, limiting the comparison to three full participants.
- Confidence scores were not validated. The experiment measured what confidence values agents reported, not whether those values were accurate reflections of finding quality.
- Small sample of source files. The target codebase contained a limited number of compliance-relevant files, which may have compressed the differences between agents.
08 Conclusion
The choice of AI model matters — but less than the choice of regulatory data infrastructure. When three different AI coding agents were given identical GDPR requirements through a standardised MCP interface, they all found the same critical compliance gaps. They differed in precision, coverage strategy, and workflow adherence, but the compliance baseline was consistent.
For organisations building AI-assisted compliance workflows, the implication is clear: invest in the data layer. Engineer regulatory requirements into structured, machine-readable bundles. Expose them through standardised protocols. Define evidence schemas that are model-agnostic. The AI agent is the lens; the MCP data is the light.
RuleMesh’s MCP server provides this structured regulatory data layer for GDPR today, with plans to expand to additional regulations. The experiment validates the architecture: consistent compliance output, independent of the model consuming it.
Methodology notes
- All agents evaluated the same Git commit (756a5a7) of the target codebase.
- No agent received pre-loaded context, hints, or special instructions beyond the standard MCP tool descriptions.
- Evidence signals were collected via the submit_signals MCP tool and stored in a structured format for comparison.
- Claude Opus 4.6 was run in a worktree-isolated environment to prevent any file modifications to the source codebase.
- The raw comparison data is available in structured JSON format for reproducibility (mcp-cross-agent-comparison-2026-03-24.json).
RuleMesh provides structured GDPR compliance data for AI-assisted development workflows. This whitepaper is published for informational purposes and does not constitute legal advice. Organisations should consult qualified legal counsel for compliance decisions.
Run the same evaluation
RuleMesh publishes the MCP server and the structured GDPR bundles used in this experiment. Connect your own agent and evaluate your own codebase.