← Back to all articles
Challenges

Finance-Grade Assurance for Agentic AI: Monoculture Risk and the Heterogeneity Score

By Marc Molas·March 30, 2026·12 min read

Most AI governance discussion treats safety as a single property of an individual system. Banks and insurers don't have that luxury. When agentic AI ships into financial workflows — credit decisions, trade execution, claims handling, AML review — the risk surface includes not just the per-agent failure mode but the systemic failure mode: many agents across many institutions, all sharing the same model family, all making correlated bad decisions at the same time, all reacting to the same prompt distribution.

That's not a hypothetical. It's the same kind of correlated-failure risk that drove regulators to care about model monoculture in quantitative finance two decades ago. The current paper Finance-Grade Assurance for Agentic AI (Fradelos, January 2026) takes the verifiable-governance pattern and extends it explicitly for high-stakes financial workflows. The headline contributions: a layered control system the paper calls FG-VGA, and an operational metric called the Heterogeneity Score (HS) that treats model monoculture as a first-class auditable risk.

This is the paper to read if you're a CTO at a financial institution shipping agents into anything regulators care about. It's also useful well outside finance, because the architectural pattern generalizes.

What Makes Governance "Finance-Grade"

Finance-grade assurance is not just "more rigorous" governance. It's a specific shape that supervisory regimes (model risk management, operational resilience, ESRB/FSB systemic-risk concerns) actually require. The paper identifies four properties that current AI governance approaches typically lack:

  1. Machine-verifiable policy gating for agentic actions — not "the model is supposed to follow this policy," but "the runtime cannot execute the action unless policy verification passes."
  2. Evidence packets that bind intent, tool calls, and outcomes — every action produces a signed record that ties the agent's stated intent, the actual tool call, and the observed outcome together. Reconstructable. Tamper-evident.
  3. Attestation-linked deployment controls — agents only run on attested execution environments. The evidence packet links to the attestation, so an auditor can verify that a given action was taken by the expected code on the expected hardware.
  4. An operational metric that treats correlated agent behavior as a first-class risk — not just per-agent risk, but the systemic risk of many agents converging on the same answer because they share the same underlying model.

The first three are extensions of the verifiable-governance architecture pattern. The fourth is the genuinely new contribution.

The Heterogeneity Score

The Heterogeneity Score (HS) is an enforceable, auditable metric of how much model and vendor diversification exists in a given agentic deployment. The intent is to operationalize what's been a hand-wavy concern in AI risk discussion: the fact that if every bank's agentic AI for credit decisions is built on the same two foundation models, the failure mode of those models becomes systemic.

The HS is computed against the agentic deployment in scope and is gated as an authorization condition. Above the threshold, deployment is permitted. Below the threshold, the deployment is blocked or requires explicit risk-acceptance from a senior accountable individual.

Three things make HS practical:

It's measurable

The HS is constructed from concrete inputs: the set of model families in use, the set of vendors, the correlation of agent behavior on a benchmark distribution. These are auditable quantities. They're not perfect — model behavior correlation is a hard thing to measure rigorously — but they're concrete enough to gate on.

It's a deployment gate, not a reporting metric

This is the operational difference. Most "diversity" requirements in AI risk frameworks are reporting requirements: you describe what you're doing, the regulator reviews it, deployment proceeds. The HS is a gate: the deployment runtime checks the score and refuses to proceed if it's below threshold. The refusal is a property of the system, not a property of human judgment.

It maps to systemic risk concerns regulators are already raising

ESRB, FSB, FINMA, and others have been signaling concern about model monoculture in financial AI. The HS is designed to be the concrete metric supervisors can examine, not just a vague claim that "we use multiple vendors."

The Four Auditable Currencies

The deeper architectural move in the paper is decomposing safety into four auditable "currencies":

  • Probabilistic safety: how likely the system is to violate safety bounds, with quantitative evidence.
  • Energy and compute safety: the resource cost of operating the system, including peak load and correlated demand.
  • Epistemic safety: the knowledge integrity of the system — does it know what it knows, does it flag uncertainty, does it cross-check.
  • Social and environmental safety: the externalities of operating the system — fairness, environmental footprint, social impact.

Each currency has its own measurement methodology, evidence format, and audit cadence. The governance pipeline reassembles them into a deployment authorization decision.

The reason this decomposition matters is that the four currencies don't trade off cleanly. A system can be probabilistically safe and energetically wasteful. It can be epistemically rigorous and socially harmful. Treating "AI safety" as a single scalar metric obscures these trade-offs. Treating it as four separately accounted currencies makes the trade-offs explicit and auditable.

What an Evidence Packet Actually Contains

The evidence packet is the unit of auditable record. For each agent action with regulatory significance, the packet should bind:

  • Intent: the agent's stated objective for the action, as derived from its reasoning trace.
  • Authorization context: the policy decisions evaluated, the seniority of the agent, the multi-party signoffs (if any).
  • Tool call: the precise tool invocation, parameters, target system.
  • Pre-action state: what was true before the action.
  • Outcome: what the tool returned and what state changed.
  • Post-action state: what's true after.
  • Attestation pointer: a cryptographic reference to the runtime attestation (the agent ran on this code on this hardware in this configuration).

These packets are signed by the Watchdog, stored in an immutable evidence store, and made available to internal and external auditors on demand. They become the substrate of compliance: not "we trust the agent to behave," but "here's the cryptographically signed record of what the agent actually did."

Why Model Risk Management Needs Updating

Existing model risk management (MRM) frameworks were designed for predictive models. The model is a fixed artifact; you validate it, you monitor it for drift, you re-validate periodically. Agentic AI breaks this pattern in two ways:

  1. The agent's behavior changes with context. The same model can take different actions depending on the prompt, the conversation history, the available tools, the user's role. MRM that validates "the model" doesn't tell you what the agent will do.

  2. The risk surface is action-shaped, not prediction-shaped. Predictive models produce outputs that humans then act on. Agents produce actions directly. The risk from agents is action risk, not prediction risk. MRM frameworks designed for prediction risk are missing the relevant unit.

The FG-VGA pattern addresses both: validation is at the policy and authorization level, not the model level; monitoring is on action distributions, not output distributions; the immutable evidence store provides the per-action record that action-level risk management requires.

What CTOs at Financial Institutions Should Be Doing

Three concrete actions for any financial institution that's actively deploying agentic AI:

1. Adopt action-level evidence packets now

Whether or not your regulator currently requires it, build the evidence packet generation into the agent runtime. The cost of retrofitting it later is dramatically higher than building it in. The internal value alone — debugging, incident analysis, capability evaluation — usually justifies the cost.

2. Measure your Heterogeneity Score even informally

Even if you don't formalize the HS computation, audit your model diversification. If your fraud-detection agent, your AML agent, your KYC agent, and your customer-service agent are all on the same foundation model from the same vendor, you have an unmeasured monoculture risk. Diversification across model families is the practical mitigation.

3. Plan for attestation

Confidential computing and remote attestation are not yet mainstream in production AI deployments, but the regulatory direction is clear. Agentic AI in regulated workflows will need attestable execution within the next few years. Building toward attestation-ready deployment now is much cheaper than retrofitting.

What This Means Outside Finance

The pattern generalizes well beyond finance. Any sector with:

  • High-stakes irreversible actions (healthcare, legal, infrastructure)
  • Regulatory accountability requirements (utilities, insurance, public services)
  • Systemic correlated-failure concerns (anywhere an AI mistake at scale creates cascading harm)

…benefits from the same architecture. The Heterogeneity Score concept applies to any deployment where many independent operators might converge on the same model. The evidence packet pattern applies to any deployment where post-incident reconstruction matters. The four-currency decomposition applies anywhere safety isn't a scalar.

Finance-grade assurance is, in effect, the high-bar version of agentic AI governance. The mid-bar versions look very similar with relaxed audit cadences and lighter attestation requirements. CTOs who build for the high-bar version end up with infrastructure that works for the mid-bar version automatically. Building only for mid-bar typically requires a rebuild when the bar moves.

The bar is moving. Finance is just one of the early movers.


Source: Fradelos, G. Finance-Grade Assurance for Agentic AI: Verifiable Governance, Systemic Risk Mitigation, and Sustainability/Compute Accounting Architecture for banks, insurers, and major financial services providers (Geneva, January 11, 2026). SSRN 6306980.

Shipping agentic AI into a regulated environment and need engineering capacity that already builds with attestation, evidence packets, and heterogeneity-aware deployment? Talk to a CTO about deploying a nearshore squad with the discipline finance-grade work requires.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.