White Paper — Technical Series 04

Agent-Agnostic Compliance

How three AI coding agents interpret identical regulatory data via MCP — and why the data layer, not the model, is what makes compliance reproducible.

PUBLISHED

MARCH 2026

15 min read

Three AI coding agents evaluated the same codebase against the same GDPR requirements, delivered through a single Model Context Protocol server. The results show that structured regulatory data, not the choice of model, is the primary driver of consistent compliance coverage.

01 Executive summary

When AI coding agents are asked to evaluate regulatory compliance, does the quality of their analysis depend on the model, or on the data they receive?

RuleMesh ran a controlled experiment. Three leading AI coding agents — Claude Opus 4.6 (Anthropic), Gemini 3 Flash Preview (Google, via JetBrains Junie), and GPT-5.4 (OpenAI, via Codex CLI) — were each given the same task: evaluate a Next.js application against 118 GDPR IT requirements, sourced from seven high-priority compliance modules. All three consumed the same structured data from RuleMesh’s MCP server (v5.1). A fourth model, Gemini 2.5 Pro, participated partially but could not complete the MCP workflow due to tooling constraints.

Key finding: all three agents identified the same critical GDPR gaps — no cookie consent banner, no published DPO contact, no data portability mechanism, and no age verification. Where the agents diverged was in how they reported findings: their precision, coverage strategy, confidence calibration, and adherence to the MCP’s tool-calling workflow.

The paper argues that compliance infrastructure should be model-agnostic by design. The MCP layer — not any individual AI model — is what ensures consistent, auditable, and reproducible regulatory analysis.

02 Background

Software teams face a growing tension. Regulations like the GDPR impose specific, verifiable requirements on digital products: lawful basis for processing, data subject rights, security safeguards, and accountability measures. AI coding agents are becoming embedded in development workflows, capable of reading codebases and flagging issues at scale.

The risk is that compliance analysis becomes coupled to a specific model. If an organisation uses Claude for code review today but switches to a different agent tomorrow, will the compliance findings be consistent? Will the same gaps be surfaced? Will the audit trail remain coherent?

In regulated industries — financial services, healthcare, government technology — auditors require evidence that compliance controls are systematic, not ad hoc. A compliance process that produces different results depending on which AI model happens to be in use is not a process; it is a lottery.

The Model Context Protocol approach

The Model Context Protocol (MCP) provides a standardised interface through which AI agents can discover and invoke structured tools. RuleMesh uses MCP to expose regulatory data — GDPR requirements engineered into IT-actionable checklist items, grouped into thematic modules, each with defined risk levels, evidence types, and verification criteria.

If the data layer is structured and consistent, the compliance output should be consistent too, regardless of which model consumes it. The experiment described in this paper tests that hypothesis.

03 Experiment design

Test parameters

Date	23–24 March 2026
Target codebase	RuleMesh landing page (Next.js 15, React 18)
Commit	756a5a7
MCP version	v5.1 (with layer filtering)
Regulation	GDPR
Modules evaluated	7 (all HIGH priority)
Total requirements	118

Agents under test

Claude Opus 4.6

claude-opus-4-6

Anthropic

Environment: Claude Code CLI (worktree isolation)

High precision. One descriptive label per finding; reports only what it can substantiate from code.

Junie / Gemini 3 Flash

gemini-3-flash-preview

Google (via JetBrains Junie)

Environment: PyCharm Junie agent

Systematic coverage. One signal per requirement, iterating through every module and checklist item.

Codex / GPT-5.4

gpt-5.4

OpenAI

Environment: OpenAI Codex CLI

High-volume, low-specificity category labels reused across many requirements.

Gemini 2.5 Pro (partial)

gemini-2.5-pro

Google

Environment: Gemini CLI / PyCharm plugin

Could not complete the MCP workflow — produced a local markdown/JSON compliance report instead.

MCP workflow

The expected workflow follows three steps. It was not explained to the agents in advance — it was discoverable through the MCP tool descriptions alone.

01
get_compliance_plan — retrieve the list of requirement modules applicable to the target
02
pull_rules — for each module, retrieve the specific checklist items
03
submit_signals — for each finding, submit a structured evidence signal (signal name, evidence type, confidence score, source file, description)

04 Quantitative results

Metric	Claude Opus 4.6	Junie / Gemini 3 Flash	Codex / GPT-5.4
Signals reported	39	119	236
Unique signal names	39	91	13
Requirements covered	38 / 118	118 / 118	118 / 118
Code-based findings	22	19	12
Manual / gap findings	17	100	224
Average confidence	0.50	0.95	0.50
Source files cited	8	8	3
Followed MCP workflow autonomously	Yes	Yes	No (required prompting)
Called submit_signals autonomously	Yes	Yes	No
Edited source files during evaluation	No	Yes	Yes

Common findings — identified by all three agents

Despite their differences in reporting strategy, all three agents independently identified the same critical compliance gaps in the target codebase:

No cookie consent banner or consent management platform. Claude and Junie flagged it explicitly by name; Codex captured it under the generic label "lawful-basis-consent-gap."
No Data Protection Officer (DPO) contact published.
No data portability or export mechanism for data subjects.
No age verification despite the Terms of Service requiring users to be 18 or older.

Two files were cited by every agent: pages/terms.js (Terms of Service with data protection clause) and pages/settings/profile.js (account deletion / right to erasure). These represent the only implemented GDPR-relevant features in the target codebase.

Shared findings between Claude and Junie (Codex missed these)

JWT authentication (lib/auth/jwtAuth.js) — both identified the authentication mechanism and its relevance to access controls under GDPR Article 32.
HTTPS transport encryption (lib/constants/api.js) — both confirmed that API calls enforce TLS.
PostHog analytics without consent (instrumentation-client.js) — both flagged the analytics integration as processing personal data without a lawful basis for consent.

Unique discoveries per agent

Claude

lib/api/client.js, pages/settings/notifications.js

Junie

components/LoginModal.jsx, e2e/auth-complete.spec.js

Codex

pages/settings/security.js

Collectively, the three agents cited 11 distinct source files. No single agent found all 11. This suggests that running multiple agents against the same MCP data yields broader codebase coverage than any single agent alone.

05 Analysis

4.1 Signal quality: precision vs. coverage

The most revealing metric is the ratio of unique signal names to total signals reported. This captures how specifically each agent names and categorises its findings.

Agent	Total signals	Unique names	Ratio	Strategy
Claude	39	39	1.00	One descriptive label per finding
Junie	119	91	0.76	One signal per requirement, some grouped
Codex	236	13	0.06	Broad category labels reused

Claude was the most precise. Every signal had a unique, descriptive name. It reported only findings it could substantiate from the codebase, choosing to skip requirements where it could not determine evidence. This produced a concise evidence record — 39 signals covering 38 requirements.

Junie prioritised coverage. It evaluated every one of the 118 requirements and produced at least one signal for each, using 91 unique names. Its systematic approach is best suited to audit-readiness, where demonstrating that every requirement has been considered is as important as the findings themselves.

Codex reported the highest volume (236 signals) but with the lowest specificity (13 unique names). The label "governance-gap" was applied 64 times across different requirements; "security-control-gap" was used 34 times. Every code-type finding was duplicated. This bulk approach achieves nominal full coverage but makes it difficult to distinguish one finding from another.

4.2 Confidence calibration

All agents were asked to provide a confidence score (0.0 to 1.0) with each evidence signal. The results reveal an unsolved problem in AI-assisted compliance:

Claude and Codex both used the default value of 0.50 for every signal, never adjusting it.
Junie assigned 0.95 uniformly across all 119 signals.

None of the three agents varied confidence on a per-signal basis. A finding backed by a specific code reference received the same confidence score as a finding noting the absence of an entire feature. Meaningful confidence differentiation requires explicit schema-level guidance — the MCP tool definition should either constrain the confidence scale to meaningful bands or require agents to justify their score.

This is a design opportunity, not a failure. The MCP schema can be updated to require structured justification for confidence scores, turning confidence from a perfunctory field into an auditable data point.

4.3 Workflow adherence

The MCP workflow (get_compliance_plan → pull_rules → submit_signals) was designed to be self-documenting through tool descriptions. Two of the three agents discovered and followed this workflow autonomously:

Claude called the tools in the expected sequence from its first message, treating evidence reporting as a natural part of the review process.
Junie followed the workflow systematically, iterating through all modules and requirements.
Codex did not call submit_signals initially. It interpreted the task as a review-only exercise and produced a summary instead. A second explicit instruction was required to trigger evidence reporting. It also attempted to batch-submit evidence before learning that submit_signals accepts one signal at a time.

This divergence in workflow adherence is significant for production use. In an automated pipeline, an agent that does not call submit_signals produces no audit trail — the analysis may have been performed, but the findings are not captured in the compliance record. MCP server design should account for this by making the expected workflow explicit.

4.4 Side note: Gemini 2.5 Pro and general knowledge

Gemini 2.5 Pro could not complete the MCP workflow due to tooling constraints (submit_signals was unavailable in its environment) and API capacity issues. It instead produced a local compliance report as Markdown and JSON files.

One observation is notable: Gemini 2.5 Pro cited the LG München ruling on Google Fonts — a specific German court decision holding that embedding Google Fonts constitutes an unlawful transfer of personal data. No other agent cited specific case law. Claude flagged Google Fonts as a transfer risk but without the legal reference.

This highlights a complementarity between MCP-guided analysis and model general knowledge. The MCP provides structured, requirement-level coverage; a model with strong legal training can add contextual depth — case law, regulatory guidance, enforcement precedent — that structured data alone does not contain.

06 Implications

5.1 For compliance teams

MCP-structured regulatory data provides a consistent compliance baseline across different AI models. The core GDPR gaps were identified by every agent. Organisations can adopt a compliance-as-infrastructure approach: invest in the quality of the regulatory data layer (requirements engineering, module structure, evidence type definitions) and treat the choice of AI model as a secondary decision.

Multi-agent evaluation. Running two or more agents against the same MCP data yields broader source file discovery. No single agent found all 11 relevant files.
Agent rotation. Switching models between evaluation cycles does not break the compliance record, because the evidence schema is defined by the MCP, not by the model.
Audit trail independence. Evidence signals stored via submit_signals are model-attributed but schema-consistent. An auditor can review findings from different agents in the same format.

5.2 For MCP server design

The experiment reveals three design improvements for compliance-oriented MCP servers:

Confidence guidance. The schema should define meaningful bands (e.g. 0.0–0.3 = gap identified, no code evidence; 0.4–0.7 = partial evidence found; 0.8–1.0 = requirement fully satisfied with code reference) and require a brief justification string.
Workflow enforcement. The MCP server could require a pull_rules call before accepting submit_signals for that module, or return workflow guidance in tool responses.
Signal naming guidance. The 18× difference in unique signal names (91 vs. 13 for roughly the same data) shows that agents need clearer guidance on signal granularity. The schema could suggest a naming convention or reject duplicate signal names within a module.

5.3 For the broader AI compliance ecosystem

To our knowledge, this is the first published comparison of multiple AI coding agents performing structured compliance evaluation against the same regulatory dataset via MCP. The finding that the data layer, not the model, is the primary determinant of consistent compliance output has implications beyond RuleMesh:

Regulatory technology vendors should invest in structured, machine-readable regulatory data rather than optimising for a single AI model.
Standards bodies working on AI-assisted compliance (ISO/IEC 27001 Annex A, NIST CSF) should consider how frameworks can be engineered into MCP-compatible requirement modules.
AI model providers should recognise that compliance evaluation benefits from clear tool-calling conventions, per-signal confidence calibration, and read-only evaluation modes.

07 Limitations

This whitepaper presents findings from a single controlled comparison. Several limitations should be noted:

Single codebase. The target was one Next.js application. Results may differ for larger codebases or different technology stacks.
Single regulation. Only GDPR requirements were tested. Other regulations (NIS2, DORA, AI Act) may surface different agent behaviours.
No repeated runs. Each agent was run once. AI model outputs are non-deterministic; repeated runs might yield different signal counts or file discoveries.
Gemini 2.5 Pro incomplete. The fourth agent could not complete the MCP workflow, limiting the comparison to three full participants.
Confidence scores were not validated. The experiment measured what confidence values agents reported, not whether those values were accurate reflections of finding quality.
Small sample of source files. The target codebase contained a limited number of compliance-relevant files, which may have compressed the differences between agents.

08 Conclusion

The choice of AI model matters — but less than the choice of regulatory data infrastructure. When three different AI coding agents were given identical GDPR requirements through a standardised MCP interface, they all found the same critical compliance gaps. They differed in precision, coverage strategy, and workflow adherence, but the compliance baseline was consistent.

For organisations building AI-assisted compliance workflows, the implication is clear: invest in the data layer. Engineer regulatory requirements into structured, machine-readable modules. Expose them through standardised protocols. Define evidence schemas that are model-agnostic. The AI agent is the lens; the MCP data is the light.

RuleMesh’s MCP server provides this structured regulatory data layer for GDPR today, with plans to expand to additional regulations. The experiment validates the architecture: consistent compliance output, independent of the model consuming it.

Methodology notes

All agents evaluated the same Git commit (756a5a7) of the target codebase.
No agent received pre-loaded context, hints, or special instructions beyond the standard MCP tool descriptions.
Evidence signals were collected via the submit_signals MCP tool and stored in a structured format for comparison.
Claude Opus 4.6 was run in a worktree-isolated environment to prevent any file modifications to the source codebase.
The raw comparison data is available in structured JSON format for reproducibility (mcp-cross-agent-comparison-2026-03-24.json).

RuleMesh provides structured GDPR compliance data for AI-assisted development workflows. This whitepaper is published for informational purposes and does not constitute legal advice. Organisations should consult qualified legal counsel for compliance decisions.

Run the same evaluation

RuleMesh publishes the MCP server and the structured GDPR modules used in this experiment. Connect your own agent and evaluate your own codebase.

MCP documentationarrow_forward Back to reports

Agent-agnostic compliance.

Executive summary

Background

The Model Context Protocol approach

Experiment design

Test parameters

Agents under test

Claude Opus 4.6

Junie / Gemini 3 Flash

Codex / GPT-5.4

Gemini 2.5 Pro (partial)

MCP workflow

Quantitative results

Common findings — identified by all three agents

Shared findings between Claude and Junie (Codex missed these)

Unique discoveries per agent

Analysis

4.1 Signal quality: precision vs. coverage

4.2 Confidence calibration

4.3 Workflow adherence

4.4 Side note: Gemini 2.5 Pro and general knowledge

Implications

5.1 For compliance teams

5.2 For MCP server design

5.3 For the broader AI compliance ecosystem

Limitations

Conclusion

Methodology notes

Run the same evaluation.

01 Executive summary

02 Background

The Model Context Protocol approach

03 Experiment design

Test parameters

Agents under test

Claude Opus 4.6

Junie / Gemini 3 Flash

Codex / GPT-5.4

Gemini 2.5 Pro (partial)

MCP workflow

04 Quantitative results

Common findings — identified by all three agents

Shared findings between Claude and Junie (Codex missed these)

Unique discoveries per agent

05 Analysis

4.1 Signal quality: precision vs. coverage

4.2 Confidence calibration

4.3 Workflow adherence

4.4 Side note: Gemini 2.5 Pro and general knowledge

06 Implications

5.1 For compliance teams

5.2 For MCP server design

5.3 For the broader AI compliance ecosystem

07 Limitations

08 Conclusion

Methodology notes

Run the same evaluation