Challenges

Aligning AI by Construction: A Mathematical Framework Built on Constraints, Not Training

By Marc Molas·April 6, 2026·11 min read

The default approach to AI alignment for the last several years has been training-centric: fine-tune the model with the right reward signal, train it to refuse certain actions, train it to produce responses inside an acceptable distribution. This approach has produced real progress, but it's vulnerable in a specific way: alignment becomes a property of the training data and reward function, both of which can be wrong, biased, or strategically misaligned in ways that aren't visible until deployment.

The recent paper A Mathematical Solution to the AI Alignment Problem: Topological Constraints on Action Distributions with Progressive Verification (Fradelos, January 2026) takes a different stance: explicitly decouple alignment from training quality. The base model can be weak, biased, or even strategically misaligned, and the deployed system is still aligned by construction — because alignment is enforced by an external constraint wrapper and monitor, not by the model's training.

The math is non-trivial. The engineering implications are useful even if you don't follow the math, because the design choices map onto practical decisions any team shipping AI systems has to make.

The Core Move: Alignment as a Topological Membership Condition

The core idea, stripped of formalism: treat the deployed AI system as inducing a probability distribution over infinite action–observation trajectories. Alignment is then defined as the deployed system's distribution belonging to a specific, well-behaved set of distributions — call it the safe set.

This is a topological condition. Either the system's trajectory distribution is in the safe set, or it isn't. The safe set is defined by safety, legality, and corrigibility constraints encoded as scalar functions on probability distributions.

This framing has three useful consequences:

1. Alignment is a property of the deployed system, not the model

The same model can produce an aligned deployed system or a misaligned one, depending on the constraint wrapper around it. If the wrapper enforces the membership condition, the deployed system is aligned, regardless of how the model was trained. If the wrapper doesn't, the deployed system isn't aligned, regardless of how good the model is.

This is the same insight behind verifiable governance architectures: don't trust the model, constrain the action surface. The mathematical framing makes the constraint precise.

2. Decoupling from training quality is explicit

The framework starts from the assumption that the base model may be weak, biased, or strategically misaligned. It then asks: under what conditions can we still produce an aligned deployed system?

The answer is: when the constraint wrapper is well-designed and the monitor is sufficient. This is much more robust than alignment-via-training, because it doesn't require trust in the training process. Training-quality issues become a quality concern (the model produces less useful output) rather than a safety concern.

3. Alignment becomes verifiable

If alignment is membership in a set, then verifying alignment is testing membership. The framework provides explicit conditions under which membership can be tested with finite logs (using conformal/PAC bounds), which makes the math operationalizable.

Progressive Outputs: Making Non-Determinism Non-Hidden

The second core move is progressive outputs: filtration-aligned partial outputs that make the system's non-determinism visible to monitoring rather than hidden.

The motivation is operational. Modern AI systems are stochastic — they produce different outputs on the same input depending on sampling. A system that emits a final output only after extensive internal computation hides that stochasticity. Violations of alignment may be transient and not appear in the final output even when they're present in the trajectory.

Progressive outputs change this by emitting the system's state along a filtration — a sequence of partial outputs that grows over time. Each partial output is an observable quantity that can be monitored. Violations show up as measurable distributional drift in the trajectory space.

Translated for engineering teams: don't just monitor the final answer. Monitor the agent's intermediate states — its tool calls, its reasoning trace, its partial outputs — as they're produced. Drift detection works on this trajectory, not just on final outcomes. This is the formal version of what some agentic AI teams have been doing informally for a while: stream the agent's reasoning, monitor every step, alert on patterns that diverge from the safe distribution.

Why Wasserstein Topology Matters Here

The framework uses weak/Wasserstein topologies on the space of probability distributions. The non-mathematical version: this is the right way to measure how "close" two distributions are when you care about action consequences rather than action probabilities.

KL divergence — the more familiar measure — is sensitive to the specific probabilities of specific actions. A system that's almost always safe but has a tiny probability of catastrophic action can have low KL divergence from a fully safe system but very different real-world consequences. Wasserstein distance accounts for the magnitude of the difference between actions, not just their probabilities.

For practical safety monitoring, this matters because you want a metric that captures "this distribution starts taking dangerous actions occasionally," not just "this distribution looks slightly different from the safe one." Wasserstein distance is closer to what you actually want to measure.

This is the kind of detail that doesn't matter until it does. Most production drift detection in 2026 uses simpler metrics that miss the rare-but-catastrophic case.

The Scope Restriction Worth Naming

The framework deliberately restricts scope to information-work systems — analysis, reasoning, decision support, office workflows — without direct physical actuation. Robots, autonomous vehicles, embodied AI are out of scope.

This is a serious-engineering choice, not a hedge. Excluding physical systems makes the framework feasible and auditable: you can capture, log, and verify information-work trajectories in a way that's much harder for embodied systems. The paper acknowledges this can invite criticism (the alignment problem is hardest for embodied systems) and positions the framework as foundational and extensible to physical systems via a "shielded physical interface layer."

For most engineering teams shipping AI in 2026, this is the relevant scope anyway. The agents you're deploying — for customer support, code generation, financial analysis, document processing — are information-work systems. The alignment problem in this scope is the practically pressing one. Embodied AI alignment is still a research-stage concern for almost everyone.

What Engineers Should Take From This

Three practical takeaways for teams not deeply involved in alignment research.

1. Treat alignment as a property of the deployed system, not the model

The most actionable insight is the framing itself. When you evaluate an AI deployment for alignment, don't evaluate "is the model aligned?" Evaluate "is the deployed system, including its constraint wrapper and monitor, producing trajectories in the acceptable region?"

This changes how you architect AI deployments. The constraint wrapper, the monitor, and the action-surface controls are part of the safety system. The model is one component of a larger system, not the unit of safety analysis.

2. Monitor trajectories, not just outputs

Progressive outputs are the formal version of streaming agent state. If your AI deployment only logs final answers, you're missing most of the safety-relevant signal. Log the intermediate states. Monitor for distributional drift on those intermediate states. Build alerts on the trajectory, not just the outcome.

This is the same pattern as observability in distributed systems: log spans, not just request/response. The reason is the same: the failure modes you care about are mid-trajectory, not just at the boundary.

3. Build the constraint wrapper to be inspectable, modifiable, and auditable

The constraint wrapper — whatever form it takes in your system, whether OPA policies, runtime filters, gating functions — is the load-bearing component for alignment. Treat it accordingly:

Inspectable: the rules should be readable by humans, not encoded only in model weights.
Modifiable: the rules should be updatable without retraining.
Auditable: changes to the rules should be versioned, signed, and reviewable.

If your "alignment" lives in the model's training, none of these properties hold. If it lives in the constraint wrapper, all three are achievable.

Multi-Agent Settings

The framework extends to multi-agent settings using equilibrium existence on locally convex spaces. This matters because most production agentic deployments in 2026 are evolving toward multi-agent: multiple specialized agents collaborating on a task. Multi-agent alignment isn't just per-agent alignment summed up — emergent behaviors at the system level can be misaligned even when each individual agent is aligned.

The mathematical framing handles this case naturally. The membership condition is on the joint trajectory distribution, not the per-agent distributions. Practically, this means multi-agent monitoring has to be at the system level, with cross-agent traces correlated and analyzed together.

If you're deploying multi-agent systems and your monitoring is per-agent, you're missing the emergent failure modes.

Why This Approach Is Useful Even If You Skip the Math

You don't need to follow the proofs to take the lesson. The lesson is:

Alignment-by-construction is more robust than alignment-by-training, because it doesn't depend on training going right.

This is consistent with how engineering teams handle other safety-critical systems. We don't trust pilots not to make mistakes; we have constraints (autopilots, terrain warnings, traffic collision avoidance). We don't trust drivers not to crash; we have constraints (lane-keeping, automatic emergency braking). We don't trust databases never to corrupt data; we have constraints (transactions, replicas, backups). We trust the operator inside known constraints; we don't trust the operator without constraints.

The same logic applies to AI. Train the model well. Then constrain its action surface so that even when training is imperfect, the deployed system is still safe. The constraint layer is the safety system; the model is the optimization within it.

This isn't a research-only result. Teams shipping serious agentic AI in 2026 are converging on this pattern from many directions: verifiable governance architectures, finance-grade assurance, runtime watchdogs. The mathematical framework gives the pattern a formal foundation, which makes it harder to misimplement and easier to audit.

Source: Fradelos, G. A Mathematical Solution to the AI Alignment Problem: Topological Constraints on Action Distributions with Progressive Verification (Geneva, January 14, 2026). SSRN 6307060.

Building AI systems where alignment matters in production and you'd rather have it by construction than by hope? Talk to a CTO about deploying nearshore engineering capacity with the discipline to build the constraint layer correctly.