White Paper — Technical Series 04

Agent-agnostic compliance.

How three AI coding agents interpret identical regulatory data via MCP — and why the data layer, not the model, is what makes compliance reproducible.

March 2026·15 min read

Three AI coding agents evaluated the same codebase against the same GDPR requirements, delivered through a single Model Context Protocol server. The results show that structured regulatory data, not the choice of model, is the primary driver of consistent compliance coverage.

Section 01

Executive summary

When AI coding agents are asked to evaluate regulatory compliance, does the quality of their analysis depend on the model, or on the data they receive?

RuleMesh ran a controlled experiment. Three leading AI coding agents — Claude Opus 4.6 (Anthropic), Gemini 3 Flash Preview (Google, via JetBrains Junie), and GPT-5.4 (OpenAI, via Codex CLI) — were each given the same task: evaluate a Next.js application against 118 GDPR IT requirements, sourced from seven high-priority compliance bundles. All three consumed the same structured data from RuleMesh’s MCP server (v5.1). A fourth model, Gemini 2.5 Pro, participated partially but could not complete the MCP workflow due to tooling constraints.

Key finding: all three agents identified the same critical GDPR gaps — no cookie consent banner, no published DPO contact, no data portability mechanism, and no age verification. Where the agents diverged was in how they reported findings: their precision, coverage strategy, confidence calibration, and adherence to the MCP’s tool-calling workflow.

The paper argues that compliance infrastructure should be model-agnostic by design. The MCP layer — not any individual AI model — is what ensures consistent, auditable, and reproducible regulatory analysis.

Section 02

Background

Software teams face a growing tension. Regulations like the GDPR impose specific, verifiable requirements on digital products: lawful basis for processing, data subject rights, security safeguards, and accountability measures. AI coding agents are becoming embedded in development workflows, capable of reading codebases and flagging issues at scale.

The risk is that compliance analysis becomes coupled to a specific model. If an organisation uses Claude for code review today but switches to a different agent tomorrow, will the compliance findings be consistent? Will the same gaps be surfaced? Will the audit trail remain coherent?

In regulated industries — financial services, healthcare, government technology — auditors require evidence that compliance controls are systematic, not ad hoc. A compliance process that produces different results depending on which AI model happens to be in use is not a process; it is a lottery.

The Model Context Protocol approach

The Model Context Protocol (MCP) provides a standardised interface through which AI agents can discover and invoke structured tools. RuleMesh uses MCP to expose regulatory data — GDPR requirements engineered into IT-actionable checklist items, grouped into thematic bundles, each with defined risk levels, evidence types, and verification criteria.

If the data layer is structured and consistent, the compliance output should be consistent too, regardless of which model consumes it. The experiment described in this paper tests that hypothesis.

Section 03

Experiment design

Test parameters

Date23–24 March 2026
Target codebaseRuleMesh landing page (Next.js 15, React 18)
Commit756a5a7
MCP versionv5.1 (with layer filtering)
RegulationGDPR
Bundles evaluated7 (all HIGH priority)
Total requirements118

Agents under test

Claude Opus 4.6

claude-opus-4-6

Anthropic

Claude Code CLI (worktree isolation)

High precision. One descriptive label per finding; reports only what it can substantiate from code.

Junie / Gemini 3 Flash

gemini-3-flash-preview

Google (via JetBrains Junie)

PyCharm Junie agent

Systematic coverage. One signal per requirement, iterating through every bundle and checklist item.

Codex / GPT-5.4

gpt-5.4

OpenAI

OpenAI Codex CLI

High-volume, low-specificity category labels reused across many requirements.

Gemini 2.5 Pro (partial)

gemini-2.5-pro

Google

Gemini CLI / PyCharm plugin

Could not complete the MCP workflow — produced a local markdown/JSON compliance report instead.

MCP workflow

The expected workflow follows three steps. It was not explained to the agents in advance — it was discoverable through the MCP tool descriptions alone.

  1. 01
    get_compliance_plan retrieve the list of requirement bundles applicable to the target
  2. 02
    pull_rules for each bundle, retrieve the specific checklist items
  3. 03
    submit_signals for each finding, submit a structured evidence signal (signal name, evidence type, confidence score, source file, description)
Section 04

Quantitative results

Signals reported
Claude
39
Junie
119
Codex
236
Unique signal names
Claude
39
Junie
91
Codex
13
Requirements covered
Claude
38 / 118
Junie
118 / 118
Codex
118 / 118
Code-based findings
Claude
22
Junie
19
Codex
12
Manual / gap findings
Claude
17
Junie
100
Codex
224
Average confidence
Claude
0.50
Junie
0.95
Codex
0.50
Source files cited
Claude
8
Junie
8
Codex
3
Followed MCP workflow autonomously
Claude
Yes
Junie
Yes
Codex
No (required prompting)
Called submit_signals autonomously
Claude
Yes
Junie
Yes
Codex
No
Edited source files during evaluation
Claude
No
Junie
Yes
Codex
Yes

Common findings — identified by all three agents

Despite their differences in reporting strategy, all three agents independently identified the same critical compliance gaps in the target codebase:

  • No cookie consent banner or consent management platform. Claude and Junie flagged it explicitly by name; Codex captured it under the generic label "lawful-basis-consent-gap."
  • No Data Protection Officer (DPO) contact published.
  • No data portability or export mechanism for data subjects.
  • No age verification despite the Terms of Service requiring users to be 18 or older.

Two files were cited by every agent: pages/terms.js (Terms of Service with data protection clause) and pages/settings/profile.js (account deletion / right to erasure). These represent the only implemented GDPR-relevant features in the target codebase.

Shared findings between Claude and Junie (Codex missed these)

  • JWT authentication (lib/auth/jwtAuth.js) — both identified the authentication mechanism and its relevance to access controls under GDPR Article 32.
  • HTTPS transport encryption (lib/constants/api.js) — both confirmed that API calls enforce TLS.
  • PostHog analytics without consent (instrumentation-client.js) — both flagged the analytics integration as processing personal data without a lawful basis for consent.

Unique discoveries per agent

Claude
lib/api/client.js, pages/settings/notifications.js
Junie
components/LoginModal.jsx, e2e/auth-complete.spec.js
Codex
pages/settings/security.js

Collectively, the three agents cited 11 distinct source files. No single agent found all 11. This suggests that running multiple agents against the same MCP data yields broader codebase coverage than any single agent alone.

Section 05

Analysis

4.1 Signal quality: precision vs. coverage

The most revealing metric is the ratio of unique signal names to total signals reported. This captures how specifically each agent names and categorises its findings.

Claude
Signals39
Unique39
Ratio1.00
One descriptive label per finding
Junie
Signals119
Unique91
Ratio0.76
One signal per requirement, some grouped
Codex
Signals236
Unique13
Ratio0.06
Broad category labels reused

Claude was the most precise. Every signal had a unique, descriptive name. It reported only findings it could substantiate from the codebase, choosing to skip requirements where it could not determine evidence. This produced a concise evidence record — 39 signals covering 38 requirements.

Junie prioritised coverage. It evaluated every one of the 118 requirements and produced at least one signal for each, using 91 unique names. Its systematic approach is best suited to audit-readiness, where demonstrating that every requirement has been considered is as important as the findings themselves.

Codex reported the highest volume (236 signals) but with the lowest specificity (13 unique names). The label "governance-gap" was applied 64 times across different requirements; "security-control-gap" was used 34 times. Every code-type finding was duplicated. This bulk approach achieves nominal full coverage but makes it difficult to distinguish one finding from another.

4.2 Confidence calibration

All agents were asked to provide a confidence score (0.0 to 1.0) with each evidence signal. The results reveal an unsolved problem in AI-assisted compliance:

  • Claude and Codex both used the default value of 0.50 for every signal, never adjusting it.
  • Junie assigned 0.95 uniformly across all 119 signals.

None of the three agents varied confidence on a per-signal basis. A finding backed by a specific code reference received the same confidence score as a finding noting the absence of an entire feature. Meaningful confidence differentiation requires explicit schema-level guidance — the MCP tool definition should either constrain the confidence scale to meaningful bands or require agents to justify their score.

This is a design opportunity, not a failure. The MCP schema can be updated to require structured justification for confidence scores, turning confidence from a perfunctory field into an auditable data point.

4.3 Workflow adherence

The MCP workflow (get_compliance_plan → pull_rules → submit_signals) was designed to be self-documenting through tool descriptions. Two of the three agents discovered and followed this workflow autonomously:

  • Claude called the tools in the expected sequence from its first message, treating evidence reporting as a natural part of the review process.
  • Junie followed the workflow systematically, iterating through all bundles and requirements.
  • Codex did not call submit_signals initially. It interpreted the task as a review-only exercise and produced a summary instead. A second explicit instruction was required to trigger evidence reporting. It also attempted to batch-submit evidence before learning that submit_signals accepts one signal at a time.

This divergence in workflow adherence is significant for production use. In an automated pipeline, an agent that does not call submit_signals produces no audit trail — the analysis may have been performed, but the findings are not captured in the compliance record. MCP server design should account for this by making the expected workflow explicit.

4.4 Side note: Gemini 2.5 Pro and general knowledge

Gemini 2.5 Pro could not complete the MCP workflow due to tooling constraints (submit_signals was unavailable in its environment) and API capacity issues. It instead produced a local compliance report as Markdown and JSON files.

One observation is notable: Gemini 2.5 Pro cited the LG München ruling on Google Fonts — a specific German court decision holding that embedding Google Fonts constitutes an unlawful transfer of personal data. No other agent cited specific case law. Claude flagged Google Fonts as a transfer risk but without the legal reference.

This highlights a complementarity between MCP-guided analysis and model general knowledge. The MCP provides structured, requirement-level coverage; a model with strong legal training can add contextual depth — case law, regulatory guidance, enforcement precedent — that structured data alone does not contain.

Section 06

Implications

5.1 For compliance teams

MCP-structured regulatory data provides a consistent compliance baseline across different AI models. The core GDPR gaps were identified by every agent. Organisations can adopt a compliance-as-infrastructure approach: invest in the quality of the regulatory data layer (requirements engineering, bundle structure, evidence type definitions) and treat the choice of AI model as a secondary decision.

  • Multi-agent evaluation. Running two or more agents against the same MCP data yields broader source file discovery. No single agent found all 11 relevant files.
  • Agent rotation. Switching models between evaluation cycles does not break the compliance record, because the evidence schema is defined by the MCP, not by the model.
  • Audit trail independence. Evidence signals stored via submit_signals are model-attributed but schema-consistent. An auditor can review findings from different agents in the same format.

5.2 For MCP server design

The experiment reveals three design improvements for compliance-oriented MCP servers:

  1. Confidence guidance. The schema should define meaningful bands (e.g. 0.0–0.3 = gap identified, no code evidence; 0.4–0.7 = partial evidence found; 0.8–1.0 = requirement fully satisfied with code reference) and require a brief justification string.
  2. Workflow enforcement. The MCP server could require a pull_rules call before accepting submit_signals for that bundle, or return workflow guidance in tool responses.
  3. Signal naming guidance. The 18× difference in unique signal names (91 vs. 13 for roughly the same data) shows that agents need clearer guidance on signal granularity. The schema could suggest a naming convention or reject duplicate signal names within a bundle.

5.3 For the broader AI compliance ecosystem

To our knowledge, this is the first published comparison of multiple AI coding agents performing structured compliance evaluation against the same regulatory dataset via MCP. The finding that the data layer, not the model, is the primary determinant of consistent compliance output has implications beyond RuleMesh:

  • Regulatory technology vendors should invest in structured, machine-readable regulatory data rather than optimising for a single AI model.
  • Standards bodies working on AI-assisted compliance (ISO/IEC 27001 Annex A, NIST CSF) should consider how frameworks can be engineered into MCP-compatible requirement bundles.
  • AI model providers should recognise that compliance evaluation benefits from clear tool-calling conventions, per-signal confidence calibration, and read-only evaluation modes.
Section 07

Limitations

  • Single codebase. The target was one Next.js application. Results may differ for larger codebases or different technology stacks.
  • Single regulation. Only GDPR requirements were tested. Other regulations (NIS2, DORA, AI Act) may surface different agent behaviours.
  • No repeated runs. Each agent was run once. AI model outputs are non-deterministic; repeated runs might yield different signal counts or file discoveries.
  • Gemini 2.5 Pro incomplete. The fourth agent could not complete the MCP workflow, limiting the comparison to three full participants.
  • Confidence scores were not validated. The experiment measured what confidence values agents reported, not whether those values were accurate reflections of finding quality.
  • Small sample of source files. The target codebase contained a limited number of compliance-relevant files, which may have compressed the differences between agents.
Section 08

Conclusion

The choice of AI model matters — but less than the choice of regulatory data infrastructure. When three different AI coding agents were given identical GDPR requirements through a standardised MCP interface, they all found the same critical compliance gaps. They differed in precision, coverage strategy, and workflow adherence, but the compliance baseline was consistent.

For organisations building AI-assisted compliance workflows, the implication is clear: invest in the data layer. Engineer regulatory requirements into structured, machine-readable bundles. Expose them through standardised protocols. Define evidence schemas that are model-agnostic. The AI agent is the lens; the MCP data is the light.

RuleMesh’s MCP server provides this structured regulatory data layer for GDPR today, with plans to expand to additional regulations. The experiment validates the architecture: consistent compliance output, independent of the model consuming it.

Methodology notes

  • All agents evaluated the same Git commit (756a5a7) of the target codebase.
  • No agent received pre-loaded context, hints, or special instructions beyond the standard MCP tool descriptions.
  • Evidence signals were collected via the submit_signals MCP tool and stored in a structured format for comparison.
  • Claude Opus 4.6 was run in a worktree-isolated environment to prevent any file modifications to the source codebase.
  • The raw comparison data is available in structured JSON format for reproducibility (mcp-cross-agent-comparison-2026-03-24.json).

RuleMesh provides structured GDPR compliance data for AI-assisted development workflows. This whitepaper is published for informational purposes and does not constitute legal advice. Organisations should consult qualified legal counsel for compliance decisions.

Run the same evaluation.

RuleMesh publishes the MCP server and structured GDPR bundles used in this experiment. Connect your own agent and evaluate your own codebase.

MCP documentationarrow_forwardBack to reports