Challenges

Certifiable AI Safety: Proof-Carrying Deployment Gates for Tool-Using LLMs

By Marc Molas·April 13, 2026·11 min read

There's a specific failure mode in modern tool-using LLMs that doesn't get enough attention: the data the system generates is not statistically independent of the data the monitor sees later. The LLM emits a token distribution. The system samples from it. Some samples invoke tools. Tool outputs feed back into context. The next token distribution is conditioned on those outputs. Everything is adaptive, dependent, recursive.

Standard statistical monitoring assumes independent observations. Standard drift detection assumes a fixed reference distribution. Standard A/B testing assumes the test doesn't influence the data. None of these assumptions hold for a tool-using LLM in production. The agent's own behavior — and the tools' responses — change the distribution the monitor is observing.

The recent paper Certifiable AI Safety Theory (CAST): Convex-Analytic, Measure-Theoretic, and Proof-Carrying Deployment Gates for Tool-Using LLM Systems (Fradelos, February 2026) tackles this head-on. The contribution is a mathematical framework for certifying safety of exactly this class of systems, using anytime-valid statistical monitoring (e-processes) that are explicitly designed for adaptive, dependent data.

The math is dense. The engineering pattern is more accessible. Worth understanding because the alternative — pretending the dependency problem doesn't exist — is the silent failure mode of most current production AI monitoring.

Why "Proof-Carrying" Matters

The core idea: at each decision point — every time the agent wants to take an action, invoke a tool, emit a response — a certificate is constructed showing that the action belongs to a safe set. The runtime verifies the certificate in polynomial time. If the certificate is valid, the action proceeds. If it isn't, the system either falls back to a safe default ("dummy output") or projects the action to a nearby safe alternative ("convex repair").

This is the same idea as proof-carrying code in systems software: the producer of an action provides a certificate that a verifier can check cheaply. Trust shifts from "the producer is well-behaved" to "the certificate is valid."

For AI systems, this is operationally important because:

Verification is cheaper than retraining. Once the certificate format is fixed, checking is fast. Retraining a model to be safer is expensive and uncertain.
The verifier can be trusted independently of the model. A small, simple, formally specified verifier can be trusted at a higher confidence level than a large opaque model.
The fallback is principled. When the certificate fails, the system knows to fail safely, either by emitting a dummy output or by projecting to a safe alternative.

The Adaptive Data Problem

The paper's central technical move is using e-processes (a specific kind of supermartingale-based statistical evidence accumulator) for monitoring. E-processes have a property regular p-values don't: they remain valid under optional stopping and adaptive data collection.

In plain terms: with regular statistical tests, the validity of the test depends on you committing in advance to how much data you'll collect and what you'll test for. If you peek at the data and decide whether to keep collecting based on what you saw, you've broken the test's guarantees. This is the "early stopping" problem.

For an AI system in production, you can't avoid early stopping. You're monitoring continuously, you're going to act on the monitoring (alerting, throttling, blocking), and your actions affect the data the monitor is observing. Regular statistical tests give you guarantees that don't hold in this regime.

E-processes do hold. They give time-uniform validity: you can monitor continuously, stop whenever, and the false-positive guarantee still applies. This is the mathematical foundation that makes continuous AI safety monitoring work.

The practical implication: if you're using p-value-based drift detection on an AI system in production, you're operating outside the valid regime of the statistical method. This is something almost nobody is paying attention to in 2026, and it's the kind of mistake that produces both false confidence (when it doesn't bite) and unexplained false alarms (when it does).

The Monitor Construction Layer

The paper specifies a canonical monitor construction with four components:

1. Trusted logging

Every input, every model output, every tool call, every tool response is logged in a way the agent cannot forge. The log is the substrate for everything else.

2. Monitor-private randomness

The monitor uses random numbers the agent doesn't have access to. This is the same idea as keeping the seed of a randomness-based audit secret from the auditee. It prevents the agent from gaming the monitor by anticipating what the monitor will check.

3. Shadow sampling

For some checks, the monitor samples additional outputs from the model with the same prompt to estimate the output distribution. These shadow samples don't reach the user; they exist only to characterize the distribution of behavior the model is producing.

4. Tool re-execution

For checks involving tool results, the monitor can re-execute tool calls deterministically (or in a sandboxed environment) to verify the response was what it should have been.

These four together give the monitor enough independent evidence to construct the certificates the runtime needs. None of them is exotic; all four are within reach of any team that takes monitoring seriously. The contribution is specifying exactly what's needed and why.

Modular Composition: Why It Matters

A practical headache in AI safety is composition. You can certify two components are safe individually and have an unsafe composite system. The classic case: a tool that's safe to call individually, called many times in rapid succession, can produce unsafe aggregate effects (denial-of-service against a downstream system, escalation through repeated calls, etc.).

The paper provides a modular composition theorem: under specific conditions, the safety certificates of components compose into a safety certificate for the system. The conditions are real and non-trivial, but they're explicit. A team building composite agentic systems can check the conditions and know whether their composition is in the regime where component safety implies system safety.

The honest framing matters. The paper labels this "a first compositional regime" — meaning it's a useful starting point, not a complete theory of compositional safety. Engineering teams should treat it the same way: useful for the cases it covers, not a guarantee for cases it doesn't.

The Global Risk Budget

A subtle but important contribution: a corollary that allocates a global risk budget across run-time decision times. This addresses the "safe per step, unsafe in aggregate" failure mode.

Each decision point has a small probability of producing an unsafe action. Summed over thousands or millions of decisions, the aggregate probability of at least one unsafe action becomes large. The paper provides a way to allocate the global risk budget across the decision schedule so the aggregate stays bounded.

In practice, this means setting per-action safety thresholds based on how many actions you expect the system to take, with explicit budget tracking. A system that takes a million actions per day and wants ≤1% aggregate unsafe-action probability needs a much tighter per-action threshold than a system that takes a thousand actions.

This is the kind of accounting most production AI systems don't do explicitly. They set a per-action threshold based on intuition and don't verify the aggregate. CAST gives the framework for doing it properly.

Convex Repair: When the Certificate Fails

When an action fails certification, two options exist:

Dummy fallback: emit a safe default output. Conservative, simple, often degrades user experience.
Convex repair: project the unsafe output to the nearest point in the safe set. Requires the safe set to be convex and the projection to be tractable.

The paper provides projection-to-safety theorems in KL/Bregman and optimal transport geometries, with an explicit tractability pathway for transport-based repair. The math says: under specific conditions, you can compute an action that is "as close as possible" to the original unsafe action while being inside the safe set.

The engineering takeaway: when designing your safe set, design it convex. Convex safe sets admit polynomial-time projection. Non-convex safe sets generally don't. This is a design choice you make when specifying the safety policy, and it determines whether convex repair is even available.

What This Means for Production AI Teams

Three practical actions for any team running tool-using LLM systems in production:

1. Verify your monitoring methodology

If your drift detection or safety monitoring uses standard p-value-based methods on adaptively collected data, the statistical guarantees don't hold. Either move to e-process-based monitoring (the right answer) or be explicit that your monitoring is heuristic rather than statistically rigorous (a defensible interim answer).

2. Specify the safe set explicitly

The phrase "the model is safe" doesn't have a precise meaning. "Outputs are members of set S" does. Specify S in code or formal description. Then you can verify membership, project failures into S, and reason about composition.

3. Account for global risk

If your system takes many actions, set per-action thresholds based on aggregate risk budget, not on per-action intuition. The math isn't hard once you've decided on the budget; the discipline is in deciding to do it.

Where CAST Doesn't Apply

The paper is explicit about scope: it's not a robotics theory, doesn't assume superintelligence, and is positioned as "finance-grade / scientific-computing-grade containment and certification." It's for explicitly specified safety laws on tool-using LLM systems.

This is the regime where most production AI deployment lives in 2026. CAST is well-aimed at the practical problem. Teams shipping LLM-based agents that touch real tools — the bulk of agentic AI in production — should be reading this paper and adapting the patterns.

The alternative is monitoring that's mathematically wrong on the data it's processing. That's the kind of mistake that produces incidents nobody can explain after the fact.

Source: Fradelos, G. Certifiable AI Safety Theory (CAST): Convex-Analytic, Measure-Theoretic, and Proof-Carrying Deployment Gates for Tool-Using LLM Systems (Geneva, February 12, 2026). SSRN 6307158.

Running tool-using LLMs in production and want monitoring that's actually statistically valid for the data you're collecting? Talk to a CTO about deploying nearshore engineering capacity that can build certifiable safety into the runtime.