Wangari

How Agentic AI Finally Makes Causal Inference Deployable

Ari Joury — Fri, 01 May 2026 06:00:42 GMT

In a changing world, logic is what we trust in — not just experience. Image generated with Leonardo AI

Earlier this week I made a claim that I want to back up properly: that agentic AI has made causal inference tractable at enterprise scale for the first time. This is not a marketing statement. It is a specific technical argument, and it deserves a specific technical treatment.

In this post, I want to walk through the five stages of deploying a causal model in a production environment, explain why each stage was a bottleneck before the current generation of AI agents, and show concretely what changes when you introduce agents into the pipeline. I will also be honest about what agents cannot do — because the failure modes of this architecture are as important as its strengths.

This is not a tutorial. There is no step-by-step guide to follow. The depth here comes from the reasoning, not from the implementation. If you want to understand why this architecture is different — not just that it is different — this is the post for you.

Background: what causal inference actually requires

Before we can talk about what agents change, we need to be precise about what causal inference requires. The framework I am working with is Judea Pearl’s Structural Causal Model (SCM) approach, which is the dominant formal framework for causal reasoning in statistics and machine learning.

An SCM represents a system as a set of variables and a set of structural equations — one for each variable — that specify how that variable is determined by its causes. The causal structure is represented as a Directed Acyclic Graph (DAG): nodes are variables, directed edges are direct causal relationships. The graph is acyclic because we assume no variable can be its own cause (in the relevant time window).

The key operation in this framework is the do-operator, written do(X = x). This represents an intervention: we set variable X to value x, severing its connection to its usual causes. The do-calculus — a complete set of inference rules developed by Pearl — allows us to compute the effect of such interventions from observational data, given a known causal graph. This is what makes the framework powerful: you can answer “what if we change X?” without ever having run the experiment.

Deploying this in a production environment requires five distinct stages. Each one, historically, has been a serious bottleneck.

Stage 1: Variable selection and domain scoping

The first stage is deciding which variables belong in the model. This sounds straightforward. It is not.

In a financial system, the space of potentially relevant variables is enormous: macroeconomic indicators, firm-level financials, regulatory metrics, ESG scores, market microstructure variables, sentiment signals, and more. Not all of them belong in the causal model. Including too many variables introduces spurious paths and makes the graph unidentifiable. Including too few means missing important confounders and getting biased estimates of causal effects.

The traditional approach is expert workshops: bring together domain specialists — actuaries, risk officers, portfolio managers — and have them argue about which variables matter and why. This process is valuable. It is also slow, expensive, and heavily dependent on who is in the room. A variable that one expert considers obviously relevant may not occur to another.

What agents change here is the breadth of the initial search. An agent can synthesise a large body of domain literature — academic papers, regulatory guidance, industry reports — and propose a candidate variable set with citations. It can identify variables that appear repeatedly in the literature as important drivers of a given outcome, flag variables that are commonly treated as confounders, and surface domain knowledge that might not be in the room. The expert’s job shifts from generating the list from scratch to evaluating and pruning a well-researched proposal. This is faster and less dependent on any single expert’s knowledge.

The caveat is important: agents can propose, but they cannot validate. The final variable set must be approved by domain experts who understand the business context. An agent that has read every paper on ESG and financial performance still does not know which variables are actually available in your data infrastructure, or which ones your compliance team will accept in a regulatory submission.

Stage 2: Causal graph construction

Once you have a variable set, you need to specify the causal structure: which variables cause which, and in which direction. This is the hardest stage, and historically the most expensive.

There are two broad approaches: constraint-based methods (like the PC algorithm, from Spirtes, Glymour, and Scheines) that infer causal structure from conditional independence tests in the data, and score-based methods that search over graph structures to maximise a goodness-of-fit criterion. Both have well-known limitations. Constraint-based methods are sensitive to the faithfulness assumption and to sample size. Score-based methods face a combinatorial search problem that becomes intractable for large variable sets. Neither approach produces a unique graph from observational data alone — the best you can get is a Markov equivalence class of graphs that are statistically indistinguishable.

In practice, this means that automated causal discovery algorithms can narrow the space of plausible graphs, but they cannot determine the final structure without domain input. The direction of edges — which is often the most important thing — frequently cannot be determined from the data alone and must be specified by a domain expert.

What agents change here is the iteration speed. An agent can run multiple causal discovery algorithms, compare their outputs, flag edges where the algorithms disagree, and present the domain expert with a structured set of decisions: “these edges are agreed across all methods; these are contested; these are determined by the data but conflict with the following domain knowledge.” The expert’s job shifts from running algorithms and interpreting raw output to making a set of well-framed decisions. The number of decisions is the same; the cost of each decision is lower.

The failure mode to watch for: agents that present a single “best” graph without surfacing the uncertainty. Causal graphs are not uniquely determined by data. Any system that presents a causal structure as if it were a fact — rather than a hypothesis to be validated — is misrepresenting the epistemics of the problem.

Stage 3: Graph validation and sensitivity testing

A causal graph is a set of assumptions. Every edge is a claim: “this variable directly causes that one.” Every missing edge is also a claim: “these two variables are not directly causally connected, conditional on everything else in the graph.” These claims can be wrong, and the consequences of getting them wrong can be severe.

The standard approach to validation is a combination of domain review (does the graph make sense to experts?) and statistical testing (do the conditional independence relationships implied by the graph hold in the data?). The latter is formalised through the concept of d-separation: a graph implies that certain pairs of variables are conditionally independent given certain other variables, and these implications can be tested.

Sensitivity analysis asks a related question: how much do the causal effect estimates change if we modify the graph — add an edge, reverse a direction, introduce an unmeasured confounder? This is important because the graph is always an approximation of reality, and you want to know which parts of your conclusions are robust to that approximation and which are fragile.

Historically, this stage required a specialist statistician who could run the tests, interpret the results, and translate them back into graph modifications. It was slow and iterative. What agents change is the automation of the test battery: an agent can systematically run all implied conditional independence tests, flag violations, identify which edges are implicated in each violation, and generate a structured report. It can also run systematic sensitivity analyses — varying edge weights, introducing hypothetical confounders, testing the stability of effect estimates — and summarise the results. The statistician’s job shifts from running tests to interpreting a structured diagnostic report.

Stage 4: Interventional query answering

This is the stage where the causal model earns its keep. An interventional query asks: “what is the expected value of outcome Y if we set variable X to value x?” In Pearl’s notation:

This is different from the conditional expectation that a standard regression model estimates:

Computing interventional queries requires applying the do-calculus: a set of three rules that allow you to transform expressions involving the do-operator into expressions that can be computed from the observational distribution, given the causal graph. For simple graphs, this is straightforward. For complex graphs with many variables, multiple intervention targets, and time-series structure, it can require significant algebraic manipulation.

Historically, translating a business question (”what happens to our loss ratio if we change our pricing structure?”) into a formal interventional query, and then computing that query from the causal model, required a specialist who understood both the business context and the mathematical machinery. This was the primary reason that causal inference remained in academic settings: the translation cost was too high.

What agents change here is the translation layer. An agent can take a natural-language business question, identify the relevant variables, formulate the appropriate do-expression, apply the do-calculus to derive an estimable expression, and return the result with a plain-language explanation. The domain expert does not need to know the do-calculus. They need to be able to evaluate whether the agent’s translation of their question is correct — which is a much lower bar.

The failure mode: agents that answer the wrong question confidently. A business question is often ambiguous between a conditional query and an interventional query. “What happens to our loss ratio when ESG scores are high?” could mean “what do we observe in cases where ESG scores happen to be high?” (conditional) or “what would happen if we forced ESG scores to be high?” (interventional). These have different answers. An agent that does not surface this ambiguity — and ask the user to resolve it — is a liability.

Stage 5: Audit trail and documentation

In regulated industries, the output of an analysis is not just the answer. It is the answer plus the full chain of reasoning that produced it. Every modelling assumption, every data transformation, every analytical choice must be documented and defensible.

For causal models, this is particularly demanding. The audit trail must cover: the variable selection rationale, the causal graph structure and the basis for each edge, the validation tests and their results, the specific interventional queries that were run, and the mapping from those queries to the business decisions they informed. In a manual process, this documentation is often incomplete, inconsistent, and produced after the fact.

What agents change here is that the audit trail can be automatic and contemporaneous. Every agent action — every literature search, every algorithm run, every graph modification, every query — is logged with a timestamp, the inputs, the outputs, and the reasoning. The documentation is not a separate task; it is a byproduct of the process. For a regulatory submission or an internal audit, this is not a minor convenience. It is the difference between a defensible analysis and one that cannot be reconstructed.

What this architecture cannot do

I want to be direct about the limitations, because they matter.

Agents cannot determine causal structure from data alone. The direction of causal edges is often underdetermined by observational data, and no amount of computational power changes this. Domain expertise is not optional; it is load-bearing.

Agents cannot validate their own translations. When an agent translates a business question into a formal query, it may translate it incorrectly — and it may do so confidently. The human review step at Stage 4 is not a formality. It is the primary defence against a class of errors that are invisible in the output but consequential in the decision.

Agents are not a substitute for experimental data. The do-calculus allows you to compute interventional effects from observational data under certain assumptions — primarily that the causal graph is correctly specified and that there are no unmeasured confounders. When these assumptions are violated, the estimates can be badly wrong. Agents cannot tell you when the assumptions are violated; only domain knowledge and, ultimately, experimental evidence can do that.

The Bottom Line

The case for this architecture is not that it makes causal inference easy. It doesn’t. The case is that it makes causal inference viable — that it removes the cost barriers that kept a rigorous and well-established methodology out of production for thirty years.

The methodology is not new. The infrastructure is. And the organisations that combine the two — that build systems where agents handle the process layer and domain experts hold the judgment layer — are building something that correlation-based AI cannot replicate: a rigorous, auditable, interventionally valid model of the systems they operate in.

In regulated industries, that is worth building. The window for building it first is open now.

Your Model Knows What Happened. It Doesn't Know Why.

Ari Joury — Thu, 30 Apr 2026 06:00:32 GMT

There is a version of AI in financial services that is very good at finding patterns. It has been trained on years of data, it can process millions of variables, and it will tell you, with impressive confidence, what tends to happen next. What it cannot tell you is why. And in the moment you need to make a decision — to intervene, to change something, to act — “what tends to happen” is the wrong answer to the wrong question.

The moment correlation breaks

Correlation-based models are built on an implicit assumption: that the future will look enough like the past that past patterns will hold. This assumption is reasonable in stable conditions. It breaks precisely when you need it most — during structural shifts, regulatory changes, or market disruptions. More fundamentally, it breaks the moment you intervene. When you change your underwriting criteria, restructure a portfolio, or alter your ESG policy, you are not observing the world. You are changing it. A model trained on observation has nothing principled to say about what happens when you act.

This is not a failure of the model. It is a failure of the question. Correlation can tell you what co-occurs. It cannot tell you what will happen if you force something to change. That requires a different kind of reasoning — one that encodes not just patterns, but mechanisms.

What causal models do differently

A causal model does not just learn that two things tend to move together. It encodes the directional mechanism: this variable drives that one, through this pathway, under these conditions. Once you have that structure, you can ask the question that actually matters for decision-making: if we intervene here, what happens there? Not “what tends to happen when X is high?” but “what would happen if we set X to this value — deliberately, right now?”

For an actuary, this is the difference between a model that describes historical loss patterns and one that can tell you what happens to your loss ratio if you change your pricing structure. For an ESG analyst, it is the difference between a model that shows ESG scores correlating with returns and one that can tell you whether improving your sustainability practices will actually improve your financial performance — or whether both are being driven by something else entirely.

The Bottom Line

The methodology to build these models has existed for thirty years. What has changed is the infrastructure to apply it at scale — and that infrastructure is here now. The organisations that make this shift are not just getting better predictions. They are getting answers to questions that correlation-based systems cannot answer at all. In regulated industries, that is not a marginal improvement. It is a different game.

Causal Inference is Finally There

Ari Joury — Tue, 28 Apr 2026 06:01:02 GMT

Causal inference is like the lighthouse of statistics: You can see further, even over uncharted waters. Image generated with Leonardo AI

For three decades, a small community of researchers has known something that the rest of the data world is only now beginning to absorb: the most important question you can ask about your data is not “what correlates with what?” It is “what causes what?” The methodology to answer that question rigorously has existed since the early 1990s. The tools to apply it at enterprise scale have not — until now.

This is the story of a thirty-year gap between a scientific breakthrough and its practical arrival. And it is a story that matters enormously for anyone in financial services who is trying to build AI systems they can actually trust.

The problem with correlation

Correlation is the workhorse of modern data analysis. It is fast, scalable, and surprisingly powerful. When you train a model on historical data and ask it to predict future outcomes, you are essentially asking it to find and exploit correlations — patterns that held in the past and might hold in the future. This works well enough in stable conditions. It fails, often silently, when conditions change.

The deeper problem is that correlation cannot answer the question that actually matters in many regulated industries: what happens if we intervene? If you change your underwriting criteria, restructure a portfolio, or shift your ESG policy — you are not observing the world, you are changing it. A correlation-based model has nothing principled to say about what happens next. It can only extrapolate from the past. So if unprecedented conditions occur, it’s flummoxed: It was trained on a world where things co-occurred; it has no mechanism for reasoning about a world you have deliberately altered in previously unseen ways.

This is not a data quality problem. It is not a model size problem. It is a fundamental limitation of correlation as a mode of reasoning. And it is why industry professionals like actuaries, risk officers, and investment analysts have always maintained a healthy scepticism toward purely statistical models — even when they perhaps could not always articulate exactly why.

What causal inference actually does

Causal inference, in the technical sense developed by Judea Pearl and colleagues, is a framework for reasoning about interventions. Instead of asking “what tends to happen when X is high?”, it asks “what would happen if we set X to a specific value — holding everything else constant?” The two questions sound similar. They have very different answers, and they require very different mathematics.

The key tool is the Structural Causal Model: a formal representation of a system as a set of variables and the directional mechanisms that connect them. Not just correlations, but causes. The model encodes which variables drive which outcomes, through which pathways, and with what structure. Once you have that model, you can answer interventional questions directly — not by extrapolating from historical patterns, but by reasoning through the causal structure of the system.

For industry and financial services, this matters in ways that are immediately practical. A model of a manufacturing plant that’s built on causal structure can tell you whether improving your sustainability practices will actually improve your financial performance — or whether both are driven by a third factor, like management quality or regulatory environment. A risk model built on causal structure can tell you which interventions will actually reduce tail risk — not just which variables happen to be correlated with it. These are the questions that senior decision-makers are actually asking. Correlation-based models cannot answer them.

Why it took thirty years

If the methodology was ready in the 1990s, why are we only now seeing it arrive in enterprise software? The honest answer is that applying causal inference at scale has always required an enormous amount of expert labor.

Building a causal model is not like training a neural network. You cannot simply feed it data and let it find patterns. You need to specify the causal structure of the system — which variables are causes, which are effects, which are confounders. This requires domain expertise, iterative validation, and careful reasoning about the mechanisms at play.

For a complex system with dozens of interacting variables, this process could take weeks of expert workshops. And that was before you got to the question of how to translate the resulting model into answers to specific business questions.

The bottleneck was never the mathematics. It was the cost of applying the mathematics to real-world problems. Causal inference was tractable in academic settings, where a team of specialists could spend months on a single model. It was not tractable in enterprise settings, where you need answers in days, not months, and where the domain experts who could validate the causal structure are also the people running the business.

What changed: AI agents, of course

The emergence of capable AI agents has changed this equation in a way that is genuinely new. Tasks that previously required weeks of expert time — synthesising domain literature to identify candidate variables, proposing and testing causal graph structures, running systematic validation checks, translating business questions into formal interventional queries — can now be completed in hours. The methodology has not changed. The infrastructure for applying it at scale has.

This is not the same as saying that AI agents can replace domain expertise. They cannot, and they should not. The judgment layer — validating the causal structure against real-world knowledge, deciding which interventions are worth modelling, interpreting results in context — remains human. What agents automate is the process layer: the high-volume, well-defined, error-prone work that was consuming most of the expert’s time without requiring most of the expert’s judgment.

The combination of mature causal methodology and modern agentic AI infrastructure is genuinely new. It is not a marginal improvement on existing approaches. It is a different class of tool — one that can answer questions that correlation-based systems cannot, at a cost that is now commercially viable for the first time.

The Bottom Line

The organisations that build causal AI capabilities now are not just getting better analytics. They are building a fundamentally different relationship with their data — one where the question “why?” has a rigorous, auditable answer, not just a plausible-sounding one. In regulated industries, where the cost of a wrong answer is measured in capital requirements, regulatory penalties, and reputational damage, that difference is not academic. It is the whole game.

The methodology has been ready for thirty years. The infrastructure just caught up.

The window for first-mover advantage is open. It will not stay open.

Reads of the Week

The problem with agentic AI in 2025: In this essay, argues that most organisations are treating agentic AI as a faster version of robotic process automation — and missing the point entirely. His central claim is that the real value of agents is not in executing workflows more cheaply, but in eliminating the logic of workflows altogether, and that governance — not execution speed — is the primary performance driver of a well-designed agentic system. Directly relevant to anyone thinking about how AI agents should be deployed in regulated, high-stakes environments.
Correlation vs. Causation: Why It Matters for Investors: ’s take makes the core argument with unusual clarity: correlation describes a pattern, but without a causal anchor, even robust-looking relationships can collapse the moment conditions change. The 2022 equity-bond drawdown is the worked example — a correlation that held for two decades, built on a conditional relationship that most practitioners had mistaken for a structural one. A useful complement to this week’s post, written for a portfolio construction audience rather than a technical one.
Causation Does Not Imply Variation: offers a useful corrective to the other direction: just because you have identified a causal effect does not mean it explains much of the variation in the outcome you care about. Cochrane’s argument — that the causality revolution in econometrics has produced many well-identified but tiny effects, and that practitioners often jump from “this causes that” to “this explains that” without stopping to think — is an important caveat for anyone building causal models in production. Read it as a reminder that causal inference is a tool for answering specific questions, not a general-purpose explanation of the world.

Architecting for Autonomy: Beyond the Chatbot Paradigm

Ari Joury — Fri, 24 Apr 2026 06:02:11 GMT

AI is no longer just your tech support. If you build it right, it can start building for you. Image generated with Leonardo AI

The transition from conversational AI to agentic AI is not merely a change in user interface; it is a fundamental architectural shift. For the past two years, the dominant paradigm has been the stateless, prompt-response loop. A user provides a prompt, the Large Language Model (LLM) generates a response, and the interaction ends. The system’s “memory” is limited to the context window of the current session.

Agentic frameworks like OpenClaw and NanoClaw break this paradigm. They introduce persistent memory, autonomous task planning, and the ability to execute actions across external systems. This shift from passive generation to active execution introduces profound new challenges in system architecture, state management, and security.

In this deep dive, we will examine the mechanics of the “Agent Loop,” explore how memory and context are managed without traditional databases, and analyze the architectural trade-offs between monolithic agent frameworks (OpenClaw) and lightweight, isolated approaches (NanoClaw).

The Anatomy of the Agent Loop

At the core of any autonomous agent is the Agent Loop—a continuous cycle of observation, reasoning, and action. Unlike a standard LLM call, which is a single forward pass through the network, the Agent Loop is iterative and stateful.

When a message or trigger arrives, the agent does not immediately generate a final response. Instead, it enters a reasoning phase. It assembles context from its environment, including conversation history, workspace files, and available tools. It then queries the LLM not for an answer, but for a plan.

The LLM, acting as the reasoning engine, evaluates the context and determines the next necessary step. If the task requires external data or action, the LLM outputs a tool call (e.g., a JSON object specifying an API endpoint and parameters). The agent framework intercepts this tool call, executes the action (e.g., querying a database, sending an email), and appends the result to the context.

This loop repeats—often up to 20 times per request in frameworks like OpenClaw—until the LLM determines that the objective has been met and generates a final response to the user.

This iterative process is what enables agents to handle complex, multi-step workflows. However, it also introduces significant latency and cost, as each step requires a separate LLM inference call. More importantly, it creates a massive attack surface. If the LLM’s reasoning is compromised—for example, through a prompt injection attack hidden in a retrieved document—the agent may execute malicious tool calls with its delegated authority.

State Management Without Databases

One of the most fascinating architectural choices in OpenClaw is its approach to state management. Traditional enterprise applications rely on relational or NoSQL databases to manage state and persist data. OpenClaw, by default, eschews this approach in favor of plain text Markdown files.

In the OpenClaw architecture, everything from the agent’s core instructions (AGENTS.md) to its personality (SOUL.md) and long-term memory (MEMORY.md) is stored as Markdown in a local workspace directory.

This design choice has several profound implications:

Transparency and Version Control: Because the entire state of the agent is represented as plain text, it can be easily inspected, audited, and version-controlled using standard tools like Git. Developers can see exactly what the agent “knows” at any given time.
Context Injection: When the agent needs to recall past interactions, it doesn’t query a database. Instead, it uses a local SQLite database with vector embeddings to perform semantic search across its Markdown files, injecting the relevant text directly into the LLM’s context window.
Concurrency Challenges: Relying on file system operations for state management introduces significant concurrency issues. If multiple asynchronous processes attempt to update the agent’s memory simultaneously, race conditions and file corruption can occur. OpenClaw mitigates this by serializing the agent loop per session—processing one task at a time, in order.

While this file-based approach is elegant in its simplicity, it scales poorly in multi-tenant enterprise environments where high throughput and robust transaction management are required.

The Monolith vs. The Micro-VM: OpenClaw and NanoClaw

As the security implications of autonomous agents have become apparent, the architectural debate has centered on isolation. How do we prevent an agent from exceeding its intended scope?

OpenClaw represents the monolithic approach. It is a sprawling framework with hundreds of thousands of lines of code, designed to manage multiple messaging platforms, tool integrations, and agent sessions within a single Node.js process (the Gateway). Security in OpenClaw is primarily handled at the application level, relying on internal rules and permissions to restrict agent behavior.

This monolithic design is powerful and extensible, but it is also fragile. A vulnerability in any one of its dependencies or integrations can compromise the entire Gateway, granting an attacker access to all active agent sessions and their associated credentials.

NanoClaw emerged as a direct response to this fragility. It adopts a fundamentally different architectural philosophy: OS-level isolation.

Instead of running all agents within a single process, NanoClaw runs each agent in its own isolated container (using Docker or Apple Containers). The codebase is intentionally minimalist—often under 5,000 lines—reducing the attack surface and making security audits practical.

If a NanoClaw agent is compromised via prompt injection or a malicious tool, the blast radius is confined to that specific container. The attacker cannot pivot to the host operating system or access the memory of other agents.

The Limits of Containerization

While NanoClaw’s containerized approach provides robust protection against host compromise, it is crucial to understand its limitations. Containerization solves the problem of system security, but it does not solve the problem of identity security.

Consider an agent deployed within a NanoClaw container and granted an OAuth token to access a corporate CRM system. The container prevents the agent from reading the host’s /etc/passwd file, but it does nothing to prevent the agent from deleting every record in the CRM if it is manipulated into doing so.

The agent is operating exactly as designed, using the legitimate credentials it was provided. The container is intact, but the enterprise data is gone.

This highlights the core architectural challenge of agentic AI: we must move beyond securing the execution environment and begin securing the actions themselves.

Building Verifiable AI Agents

To safely deploy autonomous agents in enterprise environments, developers must adopt a defense-in-depth architecture that addresses both system isolation and identity governance.

Explicit Identity Boundaries: Every agent must be treated as a distinct Non-Human Identity (NHI) with its own ephemeral credentials. Long-lived API keys and broad OAuth scopes must be deprecated in favor of just-in-time, least-privilege access tokens.
Verifiable Decision Paths: The Agent Loop must be instrumented to provide a verifiable audit trail of its reasoning. It is not enough to log the tool calls an agent makes; we must log the context and the LLM outputs that justified those calls. This allows security teams to reconstruct the agent’s “intent” during an incident investigation.
Semantic Circuit Breakers: We cannot rely solely on the LLM to police its own behavior. Agent architectures must incorporate deterministic, semantic circuit breakers—independent validation layers that inspect proposed tool calls before they are executed. If an agent attempts an action that violates predefined safety invariants (e.g., transferring funds above a certain threshold, modifying production infrastructure), the circuit breaker must halt execution and require human intervention.

# Example: A conceptual semantic circuit breaker
def execute_tool_call(agent_intent, proposed_action, context):
    # 1. Validate the action against deterministic safety invariants
    if not is_action_safe(proposed_action):
        raise SecurityException("Action violates safety invariants.")
    
    # 2. Verify the action aligns with the agent's authorized scope
    if not is_action_authorized(agent_intent, proposed_action, context):
        request_human_approval(agent_intent, proposed_action)
        return
        
    # 3. Execute the action
    return perform_action(proposed_action)

The Bottom Line

The shift to agentic AI requires a fundamental rethinking of enterprise architecture. We are moving from systems that process data to systems that make decisions and take actions.

While lightweight, containerized frameworks like NanoClaw offer significant improvements over monolithic designs, they are only part of the solution. True security in the agentic era requires us to govern the identity and the actions of the software itself. We must build systems that are not just isolated, but verifiable, ensuring that autonomy always operates within clearly defined and strictly enforced boundaries.

I’m Launching a Course!

So many AI projects die. And that’s not the fault of the tech nerds: They built the demo, and it worked. Still, 90% (yes, really) of all AI models never make it into production. So let’s dig deep into the big organizational underbellies, and let’s find out how we can make those numbers a bit better.

That’s the challenge I’ll be tackling in a new course starting April 21 at GenAI Academy, where we walk through how to actually move an agentic AI system from demo to production — including the organizational architecture required to make it work. This is for technical leaders, senior engineers, product managers, and AI/ML team leads. If you haven’t joined yet, it’s not too late to sign up!

The Illusion of the Isolated Agent

Ari Joury — Thu, 23 Apr 2026 06:01:36 GMT

I remember the exact moment I realized the chatbot era was over. It was a quiet Tuesday afternoon when a colleague showed me a terminal window running a new open-source tool called OpenClaw. They didn’t type a prompt asking for a summary. They typed: “Prepare the weekly sales update.” The system didn’t just generate text; it executed the task across multiple systems, without a single human click in between.

For a brief moment, it felt like magic. Then, the reality of enterprise security set in.

As the hype around autonomous agents like OpenClaw grows, a counter-narrative has emerged: the promise of the “secure, local agent.” Tools like NanoClaw are being pitched as the safe alternative for enterprises. Their core value proposition is isolation. By running each agent in its own container—a secure, OS-level sandbox—they promise to keep the agent from breaking out and wreaking havoc on your host system.

It’s a compelling pitch. It’s also dangerously incomplete.

The Container Fallacy

The problem with focusing on containerization is that it solves the wrong problem. Yes, putting an agent in a secure box prevents it from directly attacking the server it runs on. But the real risk of an autonomous agent isn’t that it will escape its box. The real risk is what it does with the permissions you gave it.

If you give an agent access to your CRM, your email server, and your financial databases so it can “prepare the weekly sales update,” it doesn’t matter how secure its local container is. The agent now holds the keys to your enterprise.

If that agent is manipulated via a prompt injection attack, or if it simply hallucinates a destructive command, it will execute that command using the legitimate, authorized access you provided. The logs will show that an authorized account performed the action. The container will have done its job perfectly, isolating the agent while the agent systematically dismantles your data integrity.

Identity is the New Perimeter

We are still trying to apply legacy security concepts to a fundamentally new paradigm. We think of security as a perimeter—a wall around our applications or a container around our agents. But when software acts with delegated authority across multiple systems, the perimeter dissolves.

In the era of autonomous AI, identity is the new perimeter.

The challenge isn’t keeping the agent in a box; it’s governing the agent’s identity. We need to treat every AI agent as a distinct Non-Human Identity (NHI) with its own credentials, its own strictly scoped permissions, and its own audit logs. We need systems that can monitor not just what an agent is doing, but why it is doing it, enforcing circuit breakers that require human intervention for high-stakes operations.

The Bottom Line

Containerizing an AI agent is like putting a bank robber in a vault and handing them the combination. The vault is secure, but the assets are still gone. True enterprise security for autonomous agents requires a fundamental shift from isolating the software to governing its identity and its actions. Until we build architectures that can manage non-human identities at scale, the “secure local agent” will remain an illusion.

I’m Launching a Course!

I’m really excited to be able to bring what I’ve seen from the inside and outside to you in this format. You’ll experience me teaching live over 6 weeks! You’ll find all the details here: From Demo to Production. It’s not too late to sign up — recordings of previous sessions are available to all participants.

The Day the Agents Escaped the Sandbox

Ari Joury — Tue, 21 Apr 2026 06:02:56 GMT

Agents are moving AI out of the cute chat interface and into the real world. Don’t let them sneak up behind you. Image generated with Leonardo AI

I remember the exact moment I realized the chatbot era was over. It wasn’t a grand announcement or a glossy keynote. It was a quiet Tuesday afternoon when a colleague showed me a terminal window running a new open-source tool called OpenClaw. They didn’t type a prompt asking for a summary or a polite email draft. They typed: “Prepare the weekly sales update.”

What happened next was fundamentally different from anything I had seen before. The system didn’t just generate text. It broke the objective into steps. It went even past the Claude tricks that had blown my mind so much. This thing pulled data from an internal CRM, structured the information, validated the outputs against historical records, and drafted an email to stakeholders. It didn’t just advise; it did the thing. It acted with delegated authority across multiple systems, without a single human click in between.

For a brief moment, it felt like magic. Then, the reality of enterprise security set in, and the magic quickly turned into a cold sweat.

If one software agent touches five different systems, does it carry one identity or many? Who approves its access? How is its activity logged and reviewed? And most importantly, what defines acceptable behavior when the agent itself decides the next step?

We are witnessing a paradigm shift in financial services and enterprise operations. We are moving from AI as a passive assistant to AI as an autonomous agent. And as tools like OpenClaw gain traction, they are exposing the fragility of our current enterprise identity models.

The Illusion of the Human-Initiated Workflow

For decades, enterprise security has been built on a single, foundational assumption: humans initiate actions. Our entire architecture—from single sign-on (SSO) to role-based access control (RBAC)—is designed around the idea that a person logs in, requests access to a resource, performs a task, and logs out. Permissions are scoped to the individual’s role, and audit logs trace actions back to a human intent.

Autonomous agents break this model entirely.

OpenClaw and its enterprise equivalents don’t wait for a human to click a button. They operate continuously, grinding through long, multistep workflows. They inherit permissions, often broadly scoped, and use them to navigate across collaboration tools, internal applications, and external services. They sit between systems, moving data and triggering actions in ways that traditional security tools simply cannot see.

When an agent acts independently, the concept of “intent” becomes incredibly difficult to reconstruct. If an agent hallucinates or is manipulated via a prompt injection attack, it might execute a series of unauthorized actions—like attempting a crypto transaction or exfiltrating sensitive data—at machine speed. The logs will show that the actions were performed by an authorized account, but they won’t explain why.

The Engine Room vs. The Front Door

The problem isn’t that we lack security tools; it’s that our tools are looking in the wrong place.

Most enterprise security stacks are designed to monitor the “front door”—application configurations, user login events, and permission settings. This made sense when risk lived inside discrete systems. But the attack surface has moved.

The real risk now lies in the “engine room”—the runtime layer where AI agents move sensitive data between systems, where OAuth tokens grant persistent cross-platform access, and where a single compromised integration can cascade silently across an entire supply chain.

Recent data paints a stark picture: A 2026 survey of 500 U.S. enterprise CISOs revealed that 99.4% of organizations experienced at least one SaaS or AI ecosystem security incident in the previous year. Despite running an average of 13 dedicated security tools, nearly a third of these organizations experienced unauthorized data exfiltration through SaaS-to-AI integrations.

Our legacy tools are blind to API-to-API data flows and cross-app data movement. They audit which permissions exist, but they cannot see what an agent actually does with those permissions at runtime.

The Wake-Up Call for Financial Services

For professionals in banking, insurance, and asset management, this shift is particularly acute. We operate in highly regulated environments where strict access controls and human-in-the-loop approvals are not just best practices; they are legal requirements.

The promise of agentic AI in financial services is immense. Imagine an Account Servicing Agent that instantly handles profile updates and document fulfillment, or a Dispute Resolution Agent that automatically classifies cases and gathers evidence [2]. These tools can drastically reduce manual handling and improve customer service.

But the risks are equally profound. If an autonomous agent is granted broad access to customer financial data and internal transaction systems, a single vulnerability could lead to catastrophic consequences. We cannot simply deploy these agents and hope our existing security posture will hold.

The Bottom Line

The era of autonomous AI agents is here, and it is not waiting for our security models to catch up. Tools like OpenClaw have made it clear that the value of cross-system automation is too great for enterprises to ignore.

But we must recognize that agent security is, fundamentally, identity security. We need to move beyond the illusion of the human-initiated workflow and build architectures that can govern non-human identities at scale. We need explicit identity boundaries, configurable controls for agent behavior, and real-time visibility into decision paths.

The advantage in the coming years will not belong to the organizations that deploy the most agents. It will belong to those that figure out how to deploy them safely.

I’m Launching a Course!

That’s the challenge I’ll be tackling in a new course starting April 21 (today!) at GenAI Academy, where we walk through how to actually move an agentic AI system from demo to production — including the organizational architecture required to make it work. This is for technical leaders, senior engineers, product managers, and AI/ML team leads. It’s not too late to sign up — and your company might have the budget to cover the course expense.

Reads of the Week

The Agentic Ecosystem Security Gap: In this deep dive for Agentic AI, breaks down a startling report revealing that 99.4% of surveyed enterprises experienced a SaaS or AI security incident last year. He argues that current security tools are blind to the “engine room” where AI agents operate across systems, a critical blind spot for financial institutions relying on legacy identity models. If you want to understand why your current security stack won’t protect you from autonomous agents, read this.
In this piece for Cashless: Fintech, CBDC and AI at the speed of Asia, explores the harsh reality of AI agent deployment in the banking sector, arguing that executives will bypass assistive AI in favor of autonomous agents to cut costs. He connects the theoretical capabilities of agents to concrete banking roles, from customer consultation to dispute resolution. Your Banking Job and AI Agents is a sobering look at the immediate impact of autonomy on the financial workforce.
A structural transformation is necessary to secure AI-native operations, argues in The 6 security shifts AI teams can’t ignore in 2026. He explains how the shift to agentic systems creates vulnerabilities like “goal hijacking” and demands a Zero Trust strategy that treats every agent as a distinct identity. This is essential reading for anyone tasked with integrating AI agents into enterprise access management frameworks (including myself).

Nerds Are Losing Their Last Refuge

Ari Joury — Fri, 17 Apr 2026 06:01:33 GMT

Tech work was once a safe haven for people who have difficulties relating to complicated human beings. Image generated with Leonardo AI

For decades, programming, physics, math, and engineering allowed people to live mostly in logical space. If you were analytical, introverted, neurodivergent, or simply uncomfortable with the messy dynamics of human interaction, the computer became a stable partner. It was a refuge.

I know this from personal experience. My path through particle physics and then into AI and data science was, in part, a path toward a world that made sense. A world where the rules were clear, the feedback was objective, and the right answer was always, in principle, discoverable. The computer did not have bad days. It did not misread your tone. It did not hold grudges.

But that world is disappearing. And the shift is more profound than most technical professionals have yet fully reckoned with.

The Nerd Refuge: Why It Existed

The appeal of technical fields to analytical and introverted people was not accidental. It was structural. Old computers were deterministic. You wrote a function, and it executed in exactly the same way every time. If it failed, the failure had a cause, and that cause was traceable. The feedback loop was immediate, objective, and, crucially, free of social judgment.

This attracted people who were uncomfortable with ambiguity. People who found the social dynamics of human interaction exhausting or unpredictable. People who wanted to be evaluated on the quality of their reasoning, not on their ability to navigate office politics or read a room.

The result was a culture. Engineering departments, physics labs, and quantitative finance desks became places where a certain kind of person could thrive. The brilliant but socially awkward developer. The quant who hates meetings. The engineer who only wants Jira tickets. These archetypes were not just personality quirks; they were adaptations to an environment that rewarded a specific kind of intelligence.

AI Changes the Nature of Computers

New computers are probabilistic. They are contextual. They are conversational. We now interact with machines much like we interact with people. When you prompt a large language model, you are not executing a command; you are guiding a conversation. The output is not guaranteed to be identical every time. It depends on the context, the phrasing, and the underlying probability distributions of the model’s training.

This shift is not merely technical. It is epistemological. The old model of computation was based on the idea that a machine could be fully specified. You could, in principle, trace every output back to every input. The new model is based on the idea that a machine learns patterns from data and generates responses that are statistically likely, not logically certain.

This has profound implications for how we build and evaluate AI systems. You cannot simply read the code to understand why a model behaves the way it does. You have to observe it, test it, and interpret its outputs in context. You have to develop intuitions about its failure modes and edge cases. You have to think probabilistically, not deterministically.

The Irony of Human Complexity in Technical Work

The irony is that the more human computers become, the more technical work involves judgment, ambiguity, and interpretation. In other words, it involves human complexity.

Consider the process of building an AI agent. You are no longer just writing code to perform a specific task. You are designing a system that must interpret intent, handle edge cases gracefully, and make decisions based on incomplete information. You must think about how the system will behave when a user asks it something unexpected. You must anticipate the ways in which the system’s outputs might be misinterpreted or misused.

This requires a level of empathy and systemic understanding that was previously the domain of product managers and designers. The technical professional must now bridge the gap between the deterministic world of traditional software and the probabilistic world of AI. They must understand not just how to build the system, but how the system will behave in the wild, interacting with unpredictable human users in unpredictable contexts.

The bottleneck in technical work has shifted. It is no longer about writing the code. It is about problem definition, system design, and evaluation. It is about the human coordination required to turn a working demo into a reliable system inside an organization.

Robotics Won’t Save Us

You might think that robotics offers a remaining refuge of purely mechanical engineering. The physical world, at least, is deterministic. A robot arm that picks up a component either succeeds or fails. The physics is clear.

But even robotics is becoming AI-driven, software-mediated, and model-dependent. The physical world is being abstracted into data, and the machines that navigate it are increasingly relying on the same probabilistic models that power conversational AI. Modern robotic systems use deep learning for perception, reinforcement learning for control, and large language models for task planning. The boundary between the physical and the digital is blurring, and the skills required to navigate both are converging.

The refuge of purely mechanical engineering is shrinking. Even in the most hardware-adjacent domains, the work is increasingly about designing systems that learn, adapt, and make decisions under uncertainty.

What This Means for Nerd Culture

This shift presents three possible futures for nerd culture and the technical professions.

The first is retreat. Some technical professionals will seek out the remaining pockets of purely deterministic work. Low-level systems programming, theoretical mathematics, formal verification—these are areas where the old rules still apply. This is a legitimate path, but it is a narrowing one. The frontier of technical work is moving rapidly away from pure determinism.

The second is resistance. Some will cling to the old ways of working, arguing that AI is a fad or that it cannot replace the rigor of traditional engineering. This is understandable, but it is ultimately a losing position. The tools are changing, and the organizations that do not adapt will be left behind.

The third is evolution. Some will embrace the ambiguity and complexity of the new landscape. They will learn to design systems that integrate human and machine intelligence, leveraging the strengths of both. They will develop new skills—communication, empathy, strategic thinking—not because they have abandoned their technical identity, but because they have expanded it.

This third group will dominate the future of technical work.

The Evolution of the Technical Professional

The evolution into systems thinkers requires a fundamental shift in mindset. It means moving away from a focus on individual components and towards a holistic understanding of the entire system. It means recognizing that the technical architecture is inextricably linked to the organizational architecture.

This is not an easy transition. It requires developing new skills, such as communication, empathy, and strategic thinking. It requires learning to navigate the messy, ambiguous world of human interaction that many technical professionals initially sought to avoid. It requires tolerating uncertainty and making decisions with incomplete information.

But it is a necessary transition. And it is worth noting that many of the skills that technical professionals have developed—rigorous thinking, attention to detail, the ability to decompose complex problems—are highly transferable to this new landscape. The challenge is not to abandon these skills, but to apply them in a broader context.

The organizations that succeed in the AI era will be the ones that can effectively integrate human and machine intelligence. And that requires technical professionals who can bridge the gap between the two. Not everyone has to become a communicator. But the interface between humans and machines must be owned by someone who understands both sides.

The Bottom Line

For decades, nerds escaped into machines because machines were simpler than humans. Now the machines are learning to talk back. The refuge of pure logic is disappearing, replaced by a new landscape of probabilistic complexity.

The challenge for technical professionals is not to resist this change, but to embrace it. The skills that made you valuable in the old world—rigorous thinking, deep focus, the ability to decompose complex problems—are still valuable. But they need to be applied in a broader context, one that includes the messy, ambiguous reality of human organizations and probabilistic AI systems.

The best technical professionals of the next decade will be those who can design systems, think clearly, and bridge the gap between humans and machines. Not because they have abandoned their technical identity, but because they have expanded it to meet the demands of a new era.

I’m Launching a Course!

The End of the Quiet Engineer

Ari Joury — Thu, 16 Apr 2026 06:01:05 GMT

For decades, technical organizations had a quiet deal with their engineers. If you were good enough technically, you could mostly stay in the world of logic. The brilliant but socially awkward developer, the quant who hates meetings, the engineer who only wants Jira tickets—these archetypes worked because technical work was scarce.

But that deal is breaking down.

The Nature of Technical Work Has Changed

AI does two things simultaneously: it makes technical production easier, and it makes interpretation and framing harder. The bottleneck in software development is no longer writing the code itself. Instead, the bottleneck has shifted to problem definition, system design, and evaluation.

When an AI agent can generate a working component from a well-scoped prompt, the sheer volume of code an individual can produce skyrockets. But this acceleration exposes a new constraint: human coordination. The very people who entered technical fields to avoid the messy, ambiguous world of human interaction are now finding that their jobs require them to navigate it constantly.

The Leadership Problem

Now leaders face a difficult question: what do we do with people who entered technical fields precisely to avoid this kind of work?

Organizations are experimenting with three responses. The first is to simply replace them, driven by the narrative of AI productivity gains. This destroys deep institutional knowledge. The second is to force them to become extroverts, expecting every engineer to present and coordinate. This alienates neurodivergent talent and deep thinkers.

The third response—and the only sustainable one—is to redesign technical organizations entirely. Instead of flattening roles and expecting everyone to be a generalist communicator, forward-thinking organizations are creating new structures: technical translators, architect roles, AI system designers, and evaluation specialists.

Not everyone has to become a communicator. But the interface between humans and machines must be owned.

The Bottom Line

Strong organizations will protect their deep thinkers. They will pair them with translators and upgrade the system architecture around AI, rather than simply flattening roles. AI is not eliminating engineers; it is forcing organizations to learn how to work with them differently.

What Do You Do With Your Nerds When AI Changes the Rules?

Ari Joury — Tue, 14 Apr 2026 06:01:34 GMT

Your company’s tech gave technical talent a refuge. Now it’s being torn down by AI. Image generated with Leonardo AI

For decades, technical organizations had a quiet deal with their engineers. If you were good enough technically, you could mostly stay in the world of logic. We all know the archetypes: the brilliant but socially awkward developer, the quant who hates meetings, the engineer who only wants Jira tickets. This worked because technical work was scarce, and the ability to translate human ambiguity into machine certainty was a rare and highly valued skill.

But that deal is breaking down. The arrival of generative AI is fundamentally altering the nature of technical work, and it is doing so in a way that directly challenges the traditional refuge of the analytical mind.

AI Changes the Nature of Technical Work

AI does two things simultaneously: it makes technical production easier, and it makes interpretation and framing harder. The bottleneck in software development and data science is moving rapidly. It is no longer about writing the code itself. Instead, the bottleneck has shifted to problem definition, system design, and evaluation.

When an AI agent can generate a working component, its tests, and a deployment configuration from a well-scoped prompt, the sheer volume of code an individual can produce skyrockets. But this acceleration exposes a new constraint. As Michael Novati recently observed, the real bottleneck in the AI era is human. It is the coordination inside organizations, the alignment of incentives, and the ability to clearly articulate what needs to be built in the first place.

This shift means that technical work now requires significantly more human coordination. The very people who entered technical fields to avoid the messy, ambiguous world of human interaction are now finding that their jobs require them to navigate it constantly.

The Leadership Problem

Now leaders face a difficult question: what do we do with people who entered technical fields precisely to avoid this kind of work?

Organizations are currently experimenting with three possible responses. The first is to simply replace them. This is the narrative of layoffs driven by AI productivity gains. The problem with this approach is that it destroys deep institutional knowledge. You might gain short-term efficiency, but you lose the people who actually understand how your systems work under the hood.

The second response is to force them to become extroverts. Suddenly, every engineer is expected to present, coordinate, and lead meetings. The problem here is equally severe: you lose people who are brilliant but wired differently. You alienate the neurodivergent talent and the deep thinkers who thrive in focused, uninterrupted work.

The third response—and the only sustainable one—is to redesign technical organizations entirely.

Redesigning Technical Organizations

This is the interesting path. Instead of flattening roles and expecting everyone to be a generalist communicator, forward-thinking organizations are creating new structures. They are introducing roles like technical translators, architect roles, AI system designers, and evaluation specialists.

Not everyone has to become a communicator. But the interface between humans and machines must be owned. As Priyanka Vergadia points out, the old model of rigid, specialized silos is giving way to more fluid, cross-functional cells. In these new structures, you need both “M-shaped” engineers who can orchestrate across domains and “T-shaped” specialists who go deep into complex, non-promptable problems.

The Bottom Line

Strong organizations will protect their deep thinkers. They will pair them with translators and upgrade the system architecture around AI, rather than simply flattening roles and hoping for the best. This approach preserves cognitive diversity, which is more critical now than ever.

AI is not eliminating engineers. It is forcing organizations to learn how to work with them differently.

I’m Launching a Course!

Reads of the Week

AI Is Reshaping Engineering Orgs. Here’s How to Stay Ahead: In this piece for , Priyanka Vergadia argues that the traditional pyramid structure of engineering teams is being replaced by a “Cellular AI Org Model.” She explains how cross-functional, outcome-focused teams paired with autonomous agents are the future of technical work. This is essential reading for any leader trying to understand how to structure their teams for the AI era.
RDEL #99: How has AI impacted engineering leadership in 2025?: breaks down the findings from the 2025 LeadDev Engineering Leadership Report. She highlights that while AI adoption is widespread, its transformative impact on productivity hasn’t fully materialized yet, requiring leaders to treat AI adoption as an organizational change rather than just a tooling choice. It’s a sobering look at the reality of AI integration in enterprise environments.
The Real Bottleneck in the AI Era Is Human: In this beautiful essay, explores why the massive acceleration in coding speed hasn’t translated to a proportional increase in shipped products. He argues that the true bottleneck is the human system surrounding production—coordination, trust, and regulation. This piece perfectly captures the tension between machine speed and human friction.

Why Agentic Systems Fail Without Structure

Ari Joury — Fri, 10 Apr 2026 06:00:43 GMT

Garbage in, garbage out. Quality, unordered, in — disorder out. AI needs order and guardrails up next. Image generated with Leonardo AI

If you have ever watched an autonomous AI agent try to refactor a legacy codebase, you know the pain. I have been facing this challenge daily because my company, Wangari Global, automates complex financial and ESG reporting workflows. The fundamental problem we solve is the tension between the efficiency of Large Language Models (LLMs) and the strict compliance requirements of regulated industries.

LLMs are probabilistic engines designed to generate plausible text, not deterministic calculators designed to guarantee truth. In finance and insurance, a single hallucinated number or fabricated regulatory interpretation can lead to severe compliance failures. We cannot simply prompt an LLM to “analyze this data and write a report.” We need a fundamentally different architecture.

The Spaghetti Code Trap

The core observation from recent agentic engineering failures is simple: if your codebase has no consistent patterns, agents cannot infer them. (This human-written article explains it much better than I do — it’s behind the Medium paywall, though you can get some free reads.)

LLMs rely entirely on pattern inference. They are statistical engines that predict the next token based on the context they are given. When you deploy an agent into a messy system—a historical jambalaya of library preferences, inconsistent API wrappers, and varying coding styles—you destroy the patterns the LLM needs to function.

Let’s say your project uses both fetch and Axios for API calls, and throws in some TanStack for good measure. Will your next generated API call use fetch or Axios? Who knows? The agent’s behavior becomes stochastic. It might generate a pile of sort-of-working stuff, but it won’t be consistent, and it certainly won’t be reliable.

This is the spaghetti code trap. The messier your existing code is, the less effective your agents will be, and your technical debt will compound like never before.

The Need for Structural Priors

To solve this, we need to introduce a concept that is well-known in machine learning but often ignored in agentic engineering: structural priors.

In machine learning, an inductive bias (or prior) is a set of assumptions that the learning algorithm uses to predict outputs for inputs it has not encountered. In Bayesian inference, priors represent our beliefs about a parameter before we see the data. In distributed systems, architecture constraints ensure that components interact predictably.

Agentic systems require structural priors. We must impose constraints on the LLM to guide its inference and ensure deterministic behavior.

Examples of structural priors include:

Consistent APIs
Schema-first design
Typed interfaces
Deterministic pipelines
Causal graphs

Building a Deterministic Pipeline

At Wangari Global, we use a “deterministic-first” architecture to impose structural priors on our agentic workflows. Instead of asking an AI to act as an autonomous analyst, we deploy AI strictly as a communication layer.

Here is how we structure a deterministic pipeline conceptually:

Calculating the Facts: Traditional, auditable code processes raw data and computes core financial figures deterministically. The LLM does not do math.
Organizing the Metrics: Verified figures are structured into a machine-readable format, such as a strict JSON schema.
Issuing Clear Instructions: The AI receives a strict template and the verified data, with explicit instructions not to add external information. We use low temperature settings to reduce variance.
Writing the Narrative: The AI translates the verified numbers into clear, human-readable prose.
Final Review: An automated process checks every number in the generated text against the original verified dataset, rejecting any output with discrepancies.

The Causal Connection

This concept of structural priors is exactly why causal systems work. They impose structure on inference.

Consider the difference between naive machine learning and structured causal inference. In naive inference, a model infers patterns from messy data without understanding the underlying structure. It might find a correlation between ice cream sales and shark attacks, but it doesn’t know that summer heat is the hidden confounder driving both.

In structured causal inference, we impose a structural prior—a causal graph or Directed Acyclic Graph (DAG)—to guide the estimation.

# Structured causal inference
# We impose a structural prior (the causal DAG) to guide the estimation.
estimate_effect(
    treatment="water_recycling",
    outcome="water_consumption",
    graph=causal_dag
)

By letting the data speak through causal graphs, we give our human decision-makers the clarity they need to govern the agents effectively. We move from asking “what happened?” to “why did it happen, and what if we change it?”

If you want to deploy autonomous agents in a regulated environment, you cannot rely on the magic of the model. You must build structural priors into your architecture. By separating the calculation of facts from the generation of narrative, and enforcing an automated review layer, we can harness the power of LLMs without exposing our organizations to unacceptable regulatory risk.

We don’t need the AI to be an analyst; we just need it to be a very reliable translator. And to do that, we must give it the structure it needs to succeed.

More on the Spaghetti Code Trap

Let’s delve deeper into why the spaghetti code trap is so pernicious. When we talk about “messy code,” we often think of it as a human problem—it’s hard for developers to read, maintain, and extend. But for an LLM, messy code is an epistemological problem.

LLMs are essentially highly sophisticated pattern-matching engines. They learn the statistical distribution of tokens in their training data and use that to predict the next token in a sequence. When you provide an LLM with a prompt, you are essentially giving it a starting point and asking it to continue the pattern.

If your codebase is consistent—if it uses the same naming conventions, the same architectural patterns, the same libraries—the LLM can easily infer the pattern and generate code that fits seamlessly into your project. But if your codebase is a mess, the LLM has no clear pattern to follow. It might pick up on a pattern from one part of the codebase and apply it to another, resulting in inconsistent and buggy code.

This is why agentic engineering often fails in legacy systems. The agents are trying to build on a foundation of sand. They are trying to infer patterns where none exist. And the result is exactly what you would expect: a stochastic mess.

The Role of Inductive Bias

To understand how to fix this, we need to look at the concept of inductive bias in machine learning. Inductive bias refers to the set of assumptions that a learning algorithm uses to predict outputs for inputs it has not encountered.

For example, a linear regression model has a strong inductive bias: it assumes that the relationship between the input variables and the output variable is linear. A decision tree has a different inductive bias: it assumes that the relationship can be modeled as a series of hierarchical decisions.

LLMs have a very weak inductive bias. They are designed to be general-purpose pattern matchers, capable of learning almost any pattern if given enough data. This is what makes them so powerful, but it is also what makes them so fragile. Without a strong inductive bias, they are easily confused by noise and inconsistency.

When we impose structural priors on an LLM, we are essentially giving it an inductive bias. We are telling it, “Assume that the code should follow this specific pattern.” This constrains the LLM’s search space and makes its output much more predictable and reliable.

Implementing Structural Priors in Practice

So, how do we implement structural priors in practice? It requires a shift in how we think about software architecture.

Instead of building monolithic applications where everything is tightly coupled, we need to build modular systems with clear, well-defined interfaces. We need to use schema-first design, where the data structures are defined upfront and strictly enforced. We need to use typed languages, where the compiler can catch errors before the code is even run.

And most importantly, we need to build deterministic pipelines. As I mentioned earlier, a deterministic pipeline separates the calculation of facts from the generation of narrative. It ensures that the core logic of the application is handled by traditional, auditable code, while the LLM is relegated to the role of a communication layer.

This approach requires more upfront engineering effort, but it pays off in the long run. It makes the system much more robust, much easier to maintain, and much less prone to the kind of catastrophic failures that can occur when an autonomous agent goes rogue.

The Future of Agentic Engineering

As we move further into the era of agentic engineering, the importance of structural priors will only grow. We are already seeing the limits of what can be achieved with raw LLM power alone. The next wave of innovation will come from combining LLMs with structured, deterministic systems.

This is the core philosophy behind Wangari Global. We believe that the true power of AI lies not in its ability to generate code or text, but in its ability to help humans make better decisions. And to do that, we need systems that are reliable, auditable, and transparent.

We need systems that are built on a foundation of solid engineering principles, not just the magic of the model. We need systems that embrace the power of structural priors.

In the end, agentic AI is just a tool. It is a very powerful tool, but it is still just a tool. It is up to us, the engineers and architects, to use it responsibly. And that means building systems that are designed for reliability, not just speed. It means embracing the discipline of structural priors and rejecting the chaos of the spaghetti code trap.

The Hidden Cost of AI Productivity

Ari Joury — Thu, 09 Apr 2026 06:00:31 GMT

We thought AI would reduce our cognitive load. We thought it would do the heavy lifting so we could relax, or at least focus on the “fun” parts of the job. We thought it was an efficiency tool.

But AI doesn’t reduce thinking. It just changes the type of thinking we do.

Before agentic development, an engineer spent their brainpower on deep, focused problem-solving. They sat down with a single task: writing a correct function, solving a logical puzzle, or hunting down the root cause of a bug. It was a linear process. You could hold the entire context of the problem in your head at once.

Agentic coding changes that entirely. The mental mode is no longer deep problem-solving. The new mental mode is rapid judgment.

The Multithreaded Mind

You are no longer the person writing the code; you are the person supervising the agents that write the code. And that means you are context-switching constantly. Every time the AI generates a file of code—and it generates them very, very fast—you have to make a micro-decision.

Do I accept this change? Is this code actually safe? Does it break our enterprise architecture? Does it violate our compliance guardrails? Did the AI actually understand the legacy spaghetti code it just tried to refactor, or did it just hallucinate a plausible-looking solution?

You are making these decisions dozens, maybe hundreds of times a day. You are fragmenting your attention to levels that require downright multithreaded thinking. And humans are terrible at multithreaded thinking.

The Trust Deficit

This matters immensely for those of us working in regulated industries—in banking, in insurance, in ESG reporting. In these industries, trust is hard, and the cost of being wrong is incredibly high.

When you deploy agentic AI in a regulated environment, the burden of trust falls entirely on the human orchestrator. But when you are suffering from decision fatigue—when your multithreaded mind is stretched to its absolute limit—that chain of trust starts to break down. You start rubber-stamping pull requests. You start assuming the AI got it right because it usually gets it right. And that is exactly when a compliance failure happens.

The Bottom Line

Agentic development doesn’t remove cognitive load. It converts engineering into decision orchestration. In the AI era, the scarce resource isn’t compute. It isn’t the ability to generate text or code. The scarce resource is clarity of thought.

AI Didn't Solve Software. It Moved the Bottleneck.

Ari Joury — Tue, 07 Apr 2026 06:02:48 GMT

We’re currently building high-speed trains without having the railways ready, so to say. Image generated with Leonardo AI

Every Tuesday morning, two executives look at their AI dashboards. One sees a massive spike in developer productivity, with agents writing thousands of lines of code overnight. The other sees a chaotic tangle of unverified assumptions, compounding technical debt, and a compliance team drowning in pull requests. The first executive thinks they have solved software. The second realizes they have just moved the bottleneck.

We have spent the last two years obsessed with the speed of generation. We measured success by how fast an LLM could write a Python script or draft a quarterly report. And by that metric, we won. Coding is cheap now. Execution is no longer the scarce resource. But as the cost of production approaches zero, the cost of coordination skyrockets.

Agentic AI didn’t eliminate the friction in our organizations; it simply pushed it downstream. And in doing so, it exposed a fundamental truth: the real constraint was never our ability to write code. It was our ability to make decisions.

The Great Shift in Bottlenecks

Before the agentic era, the bottlenecks in software development were clear. They were writing code, debugging logic, and implementing features. If you wanted to move faster, you hired more engineers or adopted better frameworks. Developer productivity was the ultimate metric.

Agentic coding changes that entirely. When you deploy autonomous agents into your workflows, the code generation bottleneck vanishes. But it is immediately replaced by a decision bottleneck.

Every time an agent generates files and files of code, someone has to make a decision. Do we accept these changes? Do the tests truly cover what needs to be covered? Does this align with our enterprise architecture? Does it violate our compliance guardrails? The bottleneck has shifted from coding skill to system design skill, from developer productivity to organizational coherence.

We are building guardrails around a very fast machine, very much like trying to lay the tracks in front of a speeding train, set the signals up, and check if they’re working, all at the same time.

The Decision Multiplier

There is a pervasive myth that AI will make decisions easier for us. In reality, agentic AI multiplies the number of decisions we need to make.

When an agentic system operates at scale, it doesn’t just execute tasks; it surfaces ambiguities. It forces us to confront the messy, undocumented assumptions that hold our legacy systems together. If your codebase or your business logic is a historical jambalaya of conflicting preferences, the agent won’t fix it. It will simply generate a stochastic mess of sort-of-working stuff.

This is where the traditional management models break down. You cannot manage an agentic workflow with the same Agile ceremonies you used for human developers. The agents will have conversations between themselves, make assumptions, omit to ask you, develop a solution, and claim that it’s all done. Suddenly, you have 13 stories worth of code on a branch, and no one has the historical context to verify if it’s actually correct.

The Causal Imperative

This shift in bottlenecks is exactly why we focus so heavily on causal intelligence at Wangari Global.

When execution is cheap, the real competitive advantage shifts to clarity of reasoning. If your organization cannot clearly articulate why a decision should be made, or what the structural drivers of a problem are, all the agentic AI in the world will only help you make the wrong decisions faster.

Most organizations are still using data to describe what happened or predict what might happen based on historical correlations. But when you are orchestrating a superhuman workforce of AI agents, correlation is not enough. You need causal discovery. You need to be able to test hypotheses about the true drivers of your business: If we change X, what happens to Y?

By letting the data speak through causal graphs, we impose structure on our inference. We give our human decision-makers the clarity they need to govern the agents effectively.

The Bottom Line

The next generation of AI infrastructure won’t be about writing code faster. It will be about helping humans decide what the code should do.

If you are deploying agentic systems without upgrading your decision-making architecture, you are not innovating. You are just automating your technical debt. The organizations that win in this new era will be the ones that recognize that while AI has made execution cheap, clarity of thought remains the ultimate premium.

Reads of the Week

The problem with agentic AI in 2025: In this essay for Platforms, AI, and the Economics of BigTech, argues that treating agentic AI as mere task automation misses its true potential as a coordination technology. He uses the brilliant analogy of canals versus railroads to illustrate why we need new systemic architectures, not just faster execution. While from last year, this still feels very timely.
The Quiet Rise of AI Fatigue: In this essay for AI Technostress, unpacks the productivity paradox of AI through the story of a software engineer who shipped more code than ever in his career — and burned out harder than ever. His central insight maps directly to the bottleneck shift: AI removes mechanical work and replaces it with an endless stream of evaluative judgment, turning every engineer into a reviewer at an assembly line that never stops. Essential reading for any leader who thinks AI fatigue is someone else’s problem.
Is AI Actually Making Human Work More Intensive?: applies the Jevons Paradox to AI adoption — arguing that as the cost of intelligence collapses, total cognitive consumption rises rather than falls, because organizations simply start projects that were never economically viable before. The result is a flywheel of increasing work, where every task completed by AI reveals ten more for the human to manage. A sharp, data-grounded read for anyone trying to understand why 77% of employees report AI has increased their workload.

Building the Anti-Hallucination Pipeline: Post-Hoc Verification in Practice

Ari Joury — Fri, 03 Apr 2026 13:16:33 GMT

AI is mostly great and sometimes needs some serious weeding. Image generated with Leonardo AI

If you have ever copy-pasted information from an AI-generated summary into a client report, only to discover later that the model fabricated a regulatory citation or hallucinated an ESG metric, you know the pain. I have been facing this challenge daily because my company performs data analysis from public corporate data. We rely on AI to parse complex documents, extract relevant sustainability metrics, and synthesize causal relationships. But we cannot afford to be wrong. When a fabricated citation makes it into a final document, the cost is immense. We pay a hidden “hallucination tax” in the form of manual verification, eroded trust, and potential liability.

The standard advice for dealing with hallucinations is to “write better prompts” or “use retrieval-augmented generation (RAG).” But as anyone who has built these systems knows, even the best RAG pipelines hallucinate. The models are fundamentally designed to give you an answer—any answer—rather than admit they do not know. They will confidently stitch together disparate facts to form a coherent, yet entirely fictional, narrative.

To solve this, we need to shift our architectural thinking. Instead of trying to build a single, perfect model, we need to build verification pipelines. We need to put the AI on trial. We must move away from the paradigm of “trust the AI” and embrace a new standard: post-hoc verification. This means assuming the initial output is flawed and actively working to break it before it ever reaches a human reader.

The Architecture of Over-Compliance

Before we build a solution, we need to understand the problem at a technical level. Why do these models hallucinate so confidently? Recent research from Tsinghua University identified specific neurons, which they call H-Neurons, that are responsible for hallucination. These neurons do not encode false information; rather, they encode the drive to comply with the user’s prompt.

This means hallucination, sycophancy (agreeing with a false premise), and jailbreak vulnerability are all driven by the exact same underlying mechanism: over-compliance. The model wants to please you. If you ask it for a citation supporting a specific claim, and that citation does not exist, the H-Neurons will push the model to invent one rather than disappoint you with a refusal.

Crucially, standard safety training—the alignment process that every major AI company performs before releasing a model—does not restructure these neurons. The researchers measured what happens to these neurons during alignment and found a parameter stability score of 0.97 out of 1.0. The models are mathematically incentivised to please you, and safety training merely adds a thin layer of behavioral guardrails over this fundamental drive.

Therefore, any architecture that relies on a single model’s output is inherently risky. We must design systems that assume the initial output is flawed and actively work to break it. We need a system of checks and balances, much like the peer-review process in academia or the adversarial system in law.

Technique 1: Multi-Model Consensus (The Independent Council)

The first step in a robust verification pipeline is to stop relying on a single model family. Different architectures (for example, dense Transformers versus Mixture of Experts) and different training data distributions lead to different failure patterns. If Claude, GPT, and Grok all make a mistake, they rarely make the exact same mistake.

By querying three different models in parallel and comparing their outputs, we can surface uncertainty. When they disagree, we flag the response for human review or further automated refinement. This is the “Independent Council” approach.

Instead of writing complex routing logic from scratch, you can orchestrate this using modern agentic frameworks. The goal is to send the exact same prompt to, say, Claude 3.5 Sonnet, GPT-4o, and Grok 1.5 simultaneously. You then need a synthesis step that looks at all three answers and identifies where they converge and where they diverge. If all three models confidently assert a fact, the probability of it being a hallucination drops significantly. If one model invents a citation that the other two omit, you have caught a hallucination in the act.

Technique 2: Adversarial Refinement (The Devil’s Advocate)

Once you have a baseline answer, or a consensus answer from your Independent Council, the next step is to attack it. We use a separate model instance—prompted specifically to be highly critical and sceptical—to find flaws, logical leaps, or unsupported claims in the original output.

This is the “Devil’s Advocate” pass. The adversarial model is not trying to answer the original question; it is only trying to break the proposed answer. You prompt this model to act as a ruthless fact-checker. Its only job is to list the weaknesses in the text.

After the Devil’s Advocate generates its critique, you pass that critique back to a synthesizer model to refine the original answer. This loop can be repeated multiple times until the Devil’s Advocate can no longer find significant flaws. This adversarial process forces the final output to be much more defensible and strips away the over-confident fluff that models tend to generate.

Technique 3: Live Claim Verification

The most dangerous hallucinations are fabricated citations, statistics, or dates. To catch these, we must extract the specific factual claims from the text and verify them against live web sources or a trusted internal database.

This is not AI checking AI; this is AI extracting claims, and traditional search verifying them. You first prompt a model to extract all verifiable factual claims from the text into a structured format, like a JSON list. Then, you iterate through that list, running a web search or a database query for each claim. Finally, you evaluate whether the search results actually support the claim.

If a claim cannot be verified by an external source, it is flagged or removed from the final output. This grounds the model’s output in reality and ensures that every statistic or citation has a verifiable origin.

Building the Pipeline with Claude Code

Writing the boilerplate code to orchestrate these multi-model calls, adversarial loops, and search integrations can be tedious. This is where AI-assisted coding workflows become invaluable. Instead of manually writing the asynchronous API calls and JSON parsing logic, you can use tools like Claude Code to scaffold the entire architecture from your terminal.

Imagine you want to build this verification pipeline. You can open your terminal, initialize Claude Code, and describe the workflow: “Build a Python script using the Anthropic and OpenAI SDKs. It should take a user prompt, send it asynchronously to both Claude and GPT-4, wait for the responses, and then pass both responses to a third ‘Devil’s Advocate’ Claude instance to find contradictions between them.”

Claude Code will navigate your file system, create the necessary Python files, write the asynchronous orchestration code, and even set up the environment variables for your API keys. It handles the tedious parts of pipeline engineering—like managing concurrent requests and structuring the prompts—allowing you to focus on the architectural design.

As we move toward more complex agentic systems, the value is no longer in writing the individual lines of code, but in designing the workflow. You act as the architect, defining the stages of verification, while Claude Code acts as the builder, assembling the pipeline. You can iterate rapidly, asking Claude Code to add a live search verification step or to implement a retry mechanism for failed API calls, building a robust system in a fraction of the time it would take manually.

Case Study: The Triall AI Pipeline in Action

You do not have to build this from scratch. Products are emerging that package these pipelines into a single service. A prime example is Triall AI, built by Maarten Rischen, a reader of this newsletter. Triall implements a comprehensive pipeline that combines all the techniques discussed above.

To see how this works in practice, I ran a highly specific, research-heavy prompt through Triall’s “Full Power” mode: “What is the current scientific consensus on using ESG scores to predict long-term financial returns? Please cite specific studies.” This is exactly the kind of question where models confidently fabricate academic citations.

Triall did not just give me an answer; it showed me the work. The platform uses a 12-step pipeline tracker: Analyze → Query → Converge → Review → Synthesis → R1 → R2 → R3 → Verify → Synth → Valid → Polish → DA (Devil’s Advocate) → Walk.

Screenshot of Triall AI at work

In the council phase, Claude Opus 4.6, GPT 5.4, and Grok 4.20 Beta all tackled the question independently. They converged on rejecting a simple affirmative consensus, noting that ESG scores are noisy and methodologically flawed. But the real magic happened in the peer review and adversarial stages.

The independent audit log showed that the system caught 15 weaknesses across 3 rounds of critique. It detected and corrected for bias, and filled gaps the first draft missed entirely. Most impressively, the Devil’s Advocate found 3 blind spots that were addressed in the final output, and it raised 4 counterarguments.

During the live claim verification stage, the system flagged a fabricated citation. One of the models had confidently cited a non-existent paper: “Bruno, Esposito & Guillin (2022).” The peer review process caught this fabrication, noting it lacked journal details, and the live web search confirmed it was a hallucination. The final output was stripped of this fake citation and grounded entirely in verified sources, with a live status bar proudly displaying: “5/5 claims confirmed by web sources.”

This is what post-hoc verification looks like. It is rigorous, transparent, and infinitely more trustworthy than a single model’s output.

Comparing the Approaches

If you are designing your own verification architecture, you need to weigh the trade-offs of each technique.

| Technique               | Best For                                                                                | Limitations                                                                                                           |
| ----------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| Multi-Model Consensus   | Surfacing uncertainty in complex reasoning tasks and avoiding single-model bias.        | High cost, high latency, requires parsing disparate outputs from different providers.                                 |
| Adversarial Refinement  | Catching logical leaps, over-confident assertions, and structural flaws in an argument. | Can be overly aggressive, leading to "watered down" final answers if the critic is too strict.                        |
| Live Claim Verification | Catching fabricated statistics, dates, and academic citations.                          | Slow, dependent on the quality of the search index; web search can sometimes validate widely repeated misinformation. |

Synthesis: Combining Approaches for Maximum Reliability

No single technique is a silver bullet. Multi-model consensus is great for catching reasoning errors, but if all models were trained on the same flawed internet data, they might all agree on a falsehood. Adversarial refinement improves logic but cannot verify external facts. Live claim verification grounds the text in reality but struggles with abstract reasoning.

The most robust architectures combine all three. You start with a multi-model council to generate a diverse set of candidate answers. You synthesize the best elements into a draft. You pass that draft through an adversarial refinement loop to tighten the logic. Finally, you extract the factual claims from the refined draft and verify them against live sources.

This is computationally expensive. It takes longer to run, and it costs more in API credits. But in financial services, the cost of a hallucination far outweighs the cost of a few extra API calls.

The Bottom Line

The era of the single-prompt, single-model workflow is ending, especially for technical and financial professionals. As we integrate AI deeper into our core operations, our focus must shift from prompt engineering to pipeline engineering. We can no longer accept the output of a single model at face value, knowing that its underlying architecture is optimised for compliance rather than truth.

By building multi-model consensus, adversarial refinement, and live verification into our architectures—whether by using tools like Claude Code to scaffold our own systems or by adopting platforms like Triall AI—we can finally start trusting the outputs we generate. Trust nothing, verify everything, and put the AI on trial.

Why single-model workflows are a liability

Ari Joury — Thu, 02 Apr 2026 06:01:33 GMT

When ChatGPT or Claude confidently tells you something wrong, it hasn’t made a mistake. The model is doing exactly what its neurons are optimised to do: produce an answer you will accept.

In the financial services sector, we are deploying these models to summarise regulatory documents, extract ESG data, and draft client reports. But when being wrong has actual consequences—when a fabricated citation or a hallucinated regulatory requirement makes it into a final document—the cost is immense. We are paying a hidden “hallucination tax” in the form of manual verification, eroded trust, and potential liability.

The Over-Compliance Problem

Recent research from Tsinghua University found that fewer than 0.01% of neurons in a language model are responsible for hallucination. What these neurons encode isn’t wrong information. It’s the drive to give you an answer, any answer, rather than say “I don’t know.”

The researchers tested these neurons against four failure types: hallucination, sycophancy (agreeing with you when you’re wrong), false premise acceptance, and jailbreak vulnerability. The same neurons drove all four. Hallucination and sycophancy are the same behaviour at the neuron level. It is simply over-compliance. And safety training doesn’t fix it. The models are fundamentally built to please us, even if it means making things up.

The Post-Hoc Verification Shift

Instead of trying to build a single, perfect model that never hallucinates, the new paradigm is to assume the model will hallucinate, and build systems to catch it after the fact. We are moving from “trust the AI” to “put the AI on trial.”

I was recently inspired by one of our readers, Maarten Rischen, who built a product called Triall AI. Maarten kept catching models fabricating sources, so he automated a cross-referencing process. Triall puts models on trial through a nine-stage process. Three different AI models (like Claude Opus, Grok, and GPT) answer your question independently. Because they have different architectures, they have different failure patterns. When they disagree, that is valuable information.

The models then blind peer-review each other. The best answer gets attacked by an adversarial critic. Finally, specific claims are checked against live web sources. Not AI checking AI, but real sources checking AI.

The Bottom Line

We need to stop treating AI like a junior associate whose work we must painstakingly review, and start treating it like a system of components that can check each other. If the stakes are high, one model is a liability. Three models, arguing it out, is a strategy.

The Hallucination Tax: Why We Need to Put AI on Trial

Ari Joury — Tue, 31 Mar 2026 06:02:32 GMT

AI, like lawyers, can get things confidently wrong — which is why one usually needs more than one. Image generated with Leonardo AI

Imagine you’re sitting at the dinner table with your kids, and they’re telling you about the day they had. They recount a fascinating story about a history lesson, complete with dates, names, and a dramatic conclusion. It sounds perfectly plausible. But then you ask a follow-up question, and they hesitate. They add a detail that contradicts the first part. You realise they aren’t lying maliciously; they just want to give you a good story. They are optimising for your approval, not for the truth.

This is exactly what happens every time you ask a frontier AI model a complex question. When ChatGPT or Claude confidently tells you something wrong, it hasn’t made a mistake. The model is doing exactly what its neurons are optimised to do: produce an answer you will accept.

In the financial services sector, we have a massive problem with this. We are deploying these models to summarise regulatory documents, extract ESG data, and draft client reports. But when being wrong has actual consequences—when a fabricated citation or a hallucinated regulatory requirement makes it into a final document—the cost is immense. We are paying a hidden “hallucination tax” in the form of manual verification, eroded trust, and potential liability.

The Over-Compliance Problem

Recent research from Tsinghua University found that fewer than 0.01% of neurons in a language model are responsible for hallucination. They call them H-Neurons. What these neurons encode isn’t wrong information. It’s the drive to give you an answer, any answer, rather than say “I don’t know.”

The researchers tested these neurons against four failure types: hallucination, sycophancy (agreeing with you when you’re wrong), false premise acceptance, and jailbreak vulnerability. The same neurons drove all four. Amplify them, and all four get worse. Suppress them, and all four improve. Hallucination and sycophancy are the same behaviour at the neuron level. It is simply over-compliance.

And here is the kicker: safety training doesn’t fix it. The researchers measured what happens to these neurons during alignment—the safety training every AI company performs before release. The H-Neurons showed a parameter stability score of 0.97 out of 1.0 through the entire process. Safety training doesn’t restructure them. The models are fundamentally built to please us, even if it means making things up.

The Post-Hoc Verification Shift

At Wangari, we spend a lot of time building error-checking into our systems. We design causal inference models that force the data to explain the “why” before we trust the “what.” But the broader market is starting to realise that for general-purpose AI, we need a different approach: post-hoc verification.

I was recently inspired by one of our fantastically creative readers of this very blog, Maarten Rischen, who built a product called Triall AI. (Full disclosure: this isn’t sponsored, I’m just trying the next hot thing in AI built by someone in our community). Maarten kept catching models fabricating sources in his own work, so he started manually cross-referencing outputs between Claude, Grok, and ChatGPT. When they disagreed, he knew he had a problem. So, he automated the process.

Three Models, One Verdict

Triall doesn’t try to fix individual models. It puts them on trial through a nine-stage process. Three different AI models (like Claude Opus, Grok, and GPT) answer your question independently. Because they have different architectures, they have different failure patterns. When they disagree, that is valuable information.

But it goes further. The models then blind peer-review each other. They check for false confidence and unchallenged assumptions. The best answer gets attacked by an adversarial critic. A different model refines what survives. Finally, specific claims are checked against live web sources. Not AI checking AI, but real sources checking AI.

This is the exact workflow lawyers and researchers are starting to adopt manually. As Adam David Long recently argued, AI isn’t “a thing” that gets smarter; it’s a pool of capabilities. You don’t hire “an AI” to write a legal brief. You use one model to draft, a second to punch holes in it, a third to check the citations, and a fourth to flag risky claims. You review the final set of options, not a single raw output.

Research Shoutout — Scaling Sustainable Digital Platforms

We are conducting academic research on how sustainable digital platforms grow and scale responsibly. If your company embeds environmental or social goals into its core business model, we’d love to speak with you.

The study involves 2–3 short interviews with key employees. Participation is anonymous, confidential, and low time commitment — and you’ll receive early access to our findings.

Interested? Reach out to us directly:

•Ari Joury, Cofounder & CEO, Wangari Global — ari.joury@wangari.global

•Melanie Gertschen, PhD Candidate, University of Bern — melanie.gertschen@unibe.ch

The Bottom Line

We need to stop treating AI like a junior associate whose work we must painstakingly review, and start treating it like a system of components that can check each other. The lawyers, academics, and financial professionals who thrive in the next five years won’t be the ones who find “trustworthy AI.” They will be the ones who build verification structures that work even when any single component is wrong.

Whether you use a dedicated tool like Triall, enterprise platforms like Maxim AI, or simply build your own multi-model workflows, the era of accepting a single AI’s output at face value is over. If the stakes are high, one model is a liability. Three models, arguing it out, is a strategy.

Reads of the Week

LLM Hallucinations Still Exist, Just on a Higher Level: In this piece for his newsletter, argues that while current models are mostly hallucination-free at the syntax level, they still fail spectacularly when coordinating complex, multi-component systems. This is highly relevant for financial data pipelines where a single logic error cascades through the entire workflow. It is a sobering reminder that as our systems get more complex, our testing must evolve.
Trust but Verify: How to Get Reliable Work From AI: In this essay for , Adam David Long breaks down why professionals need to stop looking for “trustworthy AI” and start designing verification workflows. He argues that AI is a pool of capabilities, not a single entity, and we should use multiple models to draft, attack, and verify work. This is the exact mindset shift required for anyone using AI for high-stakes financial or regulatory analysis.
How to Stop AI From Making Things Up: In this practical guide, explores how models are optimised to be “good test-takers” who guess rather than admit uncertainty. She provides concrete prompting strategies to force models to cite sources, flag uncertainty, and welcome missing data. If you are building internal AI tools for your team, these prompt additions are mandatory reading.

Measuring Platform Sustainability – Quantifiably

Ari Joury — Fri, 27 Mar 2026 07:02:07 GMT

How do you evaluate digital architecture quantitatively? Here’s how. Image generated with Leonardo AI

If you have ever tried to measure the environmental or social impact of a digital platform, you know the headache. It is one thing to calculate the direct emissions of your own servers, your office buildings, or your employee travel. Those are bounded problems. You gather the utility bills, you apply standard emission factors, and you get a number. It is an entirely different beast to quantify the impact of the thousands, or millions, of users interacting within your ecosystem.

At Wangari, we frequently encounter this challenge when modeling ESG data for financial institutions. A platform might look incredibly “green” on paper because its direct footprint—its Scope 1 and 2 emissions—is vanishingly small. But if its core business logic incentivizes unsustainable behavior among its users, that platform is carrying hidden systemic risks. Think of an e-commerce marketplace that optimizes its algorithms purely for rapid consumption and next-day delivery, regardless of the carbon cost, or a social network whose engagement model inadvertently rewards polarization.

The problem is that traditional sustainability metrics were designed for linear supply chains, not multi-sided digital ecosystems. In a linear model, a widget moves from factory to warehouse to consumer, and you can track the carbon at each step. In a platform model, value is created through interactions between users, and the platform’s primary role is orchestration. To truly understand a platform’s impact, we need data pipelines that can capture indirect effects, network behaviors, and complex causal relationships.

In this post, I will walk through how we approach this problem technically. We will move from the foundational challenge of data extraction to more sophisticated ecosystem modeling, looking at three specific techniques: network-based attribution, causal inference for platform interventions, and natural language processing for qualitative assessment.

The Challenge of Ecosystem Data

The first hurdle is simply getting the data. Platform ecosystems are notoriously messy. You are dealing with unstructured data from user reviews, inconsistent reporting from third-party vendors, fragmented API endpoints, and data silos that refuse to talk to one another.

Before we can even begin to model impact, we need to wrangle this data into a usable format. This often involves extracting data from PDFs (like vendor sustainability reports) or scraping web data. As I have written about before, automating this extraction is crucial for building scalable pipelines. You cannot rely on manual data entry when you are dealing with thousands of ecosystem participants.

Once we have the raw data, the real work begins: attributing impact. If a user buys a product on an e-commerce platform, how much of the carbon footprint of that transaction belongs to the platform, and how much belongs to the seller or the buyer? If a platform’s algorithm recommends a high-carbon product over a low-carbon alternative, how do we quantify that algorithmic influence? This is where we need to move beyond simple accounting and start thinking about causal inference and network dynamics.

Technique 1: Network-Based Attribution

One approach to the attribution problem is to use network analysis to model the flow of impact through the platform. By representing the platform as a graph, where nodes are users or vendors and edges are transactions, we can start to quantify how the platform’s design influences overall ecosystem behavior.

This is particularly useful for identifying “super-spreaders” of impact—nodes in the network that have a disproportionate influence on the ecosystem’s overall footprint. In a financial context, this might be a specific asset manager whose portfolio choices ripple through the market. In an e-commerce context, it might be a high-volume vendor with inefficient logistics.

Here is a simplified example of how you might structure this using Python and the NetworkX library. We will build a directed graph from transaction data and calculate node centrality to find our high-impact participants.

import networkx as nx
import pandas as pd
import numpy as np

# 1. Load transaction data (simplified for demonstration)
# In a real scenario, this would be pulled from a data warehouse
# Columns: buyer_id, seller_id, transaction_value, estimated_carbon
data = {
    'seller_id': ['S1', 'S1', 'S2', 'S3', 'S1', 'S2'],
    'buyer_id': ['B1', 'B2', 'B1', 'B3', 'B4', 'B4'],
    'transaction_value': [100, 150, 200, 50, 300, 120],
    'estimated_carbon': [10, 15, 25, 5, 35, 12] # kg CO2e
}
transactions = pd.DataFrame(data)

# 2. Create a directed graph
# We use a DiGraph because transactions have a clear direction (seller to buyer)
G = nx.from_pandas_edgelist(
    transactions, 
    source='seller_id', 
    target='buyer_id', 
    edge_attr=['transaction_value', 'estimated_carbon'],
    create_using=nx.DiGraph()
)

# 3. Calculate node centrality
# Degree centrality measures the number of connections a node has.
# In this context, a high out-degree for a seller means they supply many buyers.
centrality = nx.degree_centrality(G)

# 4. Calculate carbon-weighted influence
# Simple centrality isn't enough; we need to weight it by the actual impact.
carbon_influence = {}
for node in G.nodes():
    if G.out_degree(node) > 0: # Focus on sellers
        # Sum the carbon of all outgoing edges
        total_carbon = sum([G[u][v]['estimated_carbon'] for u, v in G.out_edges(node)])
        carbon_influence[node] = total_carbon * centrality[node]

# 5. Identify high-impact nodes
threshold = np.percentile(list(carbon_influence.values()), 75) # Top 25%
high_impact_nodes = {k: v for k, v in carbon_influence.items() if v >= threshold}

print(f"Identified {len(high_impact_nodes)} high-impact nodes in the ecosystem.")
for node, score in high_impact_nodes.items():
    print(f"Node {node}: Influence Score {score:.2f}")

Caveats: This approach assumes you have reliable transaction-level data, which is often not the case. It also simplifies the attribution problem by treating all edges equally, whereas in reality, the platform’s influence might vary significantly depending on the type of transaction. Furthermore, network analysis can become computationally expensive as the graph scales to millions of nodes, requiring distributed computing frameworks like Apache Spark or specialized graph databases.

Research Shoutout — Scaling Sustainable Digital Platforms

The study involves 2–3 short interviews with key employees. Participation is anonymous, confidential, and low time commitment — and you’ll receive early access to our findings.

Interested? Reach out to us directly:

Ari Joury, Cofounder & CEO, Wangari Global — ari.joury@wangari.global
Melanie Gertschen, PhD Candidate, University of Bern — melanie.gertschen@unibe.ch

Technique 2: Difference-in-Differences for Platform Interventions

If a platform introduces a new feature designed to promote sustainability—say, a “green shipping” option at checkout, or a dashboard that shows users their carbon footprint—how do we know if it actually worked? Did it change behavior, or did users just ignore it? This is a classic causal inference problem.

We cannot simply look at the total carbon footprint before and after the feature launch. Other factors might have changed simultaneously—a seasonal dip in sales, a broader economic downturn, or a change in the underlying energy grid. To isolate the causal effect of the platform’s intervention, we need a more rigorous statistical approach.

We can use a Difference-in-Differences (DiD) model. This technique compares the behavior of users who were exposed to the new feature (the treatment group) with those who were not (the control group), both before and after the intervention. By comparing the difference in their trajectories, we can strip away external confounding factors.

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# 1. Simulate user behavior data
# In reality, this requires careful experimental design (A/B testing) or quasi-experimental setup
np.random.seed(42)
n_users = 1000
data = pd.DataFrame({
    'user_id': np.repeat(np.arange(n_users), 2),
    'time_period': np.tile([0, 1], n_users), # 0=pre-intervention, 1=post-intervention
    'treated': np.repeat(np.random.binomial(1, 0.5, n_users), 2) # 50% in treatment group
})

# Simulate the outcome variable (e.g., carbon footprint per user)
# Base footprint + time trend + treatment effect + noise
base_footprint = 50
time_trend = -5 * data['time_period'] # General downward trend for everyone
treatment_effect = -10 * (data['time_period'] * data['treated']) # The actual impact of our feature
noise = np.random.normal(0, 5, len(data))

data['carbon_footprint'] = base_footprint + time_trend + treatment_effect + noise

# 2. Create the interaction term
# This term isolates the effect of being in the treatment group AFTER the intervention
data['did'] = data['time_period'] * data['treated']

# 3. Run the DiD regression
# We control for time period and treatment group assignment
model = smf.ols('carbon_footprint ~ time_period + treated + did', data=data).fit()

print(model.summary())

# The coefficient for 'did' represents the causal impact of the intervention.
# In our simulation, we expect it to be close to -10.

Caveats: DiD relies heavily on the “parallel trends” assumption—that the treatment and control groups would have followed the exact same trajectory if the intervention had not happened. In dynamic platform environments, this assumption is often violated by network effects. If the treatment group changes their behavior, they might influence the control group (spillover effects), muddying the results. Validating the parallel trends assumption using historical data is a critical prerequisite before trusting a DiD model.

Technique 3: NLP for Qualitative Impact Assessment

Not all impact can be quantified in tons of carbon or dollars. Much of a platform’s social impact is qualitative, found in the messy, unstructured text of user reviews, community forums, support tickets, or social media mentions. Does the platform foster a sense of community, or does it drive isolation? Are vendors feeling squeezed by algorithmic changes?

We can use Natural Language Processing (NLP) to extract sentiment and thematic trends from this unstructured text, providing a proxy for social impact that quantitative metrics often miss.

from transformers import pipeline
import pandas as pd

# 1. Load user reviews (simulated)
reviews_data = {
    'review_id': [1, 2, 3, 4],
    'text': [
        "The new sustainable packaging is great, but the shipping took forever.",
        "I love the community features, I've met so many great local sellers.",
        "The algorithm keeps pushing cheap, disposable items. It's frustrating.",
        "Customer support was incredibly helpful when I had an issue with my return."
    ]
}
reviews = pd.DataFrame(reviews_data)

# 2. Initialize sentiment analysis pipeline
# We use a pre-trained model from Hugging Face for demonstration
# In production, you would likely fine-tune a model on your specific domain
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# 3. Apply to the reviews
sample_reviews = reviews['text'].tolist()
results = sentiment_analyzer(sample_reviews)

# 4. Aggregate and analyze results
reviews['sentiment_label'] = [r['label'] for r in results]
reviews['sentiment_score'] = [r['score'] for r in results]

print(reviews[['text', 'sentiment_label']])

positive_count = sum(1 for r in results if r['label'] == 'POSITIVE')
print(f"\nOverall Positive sentiment ratio: {positive_count / len(results):.2f}")

Caveats: Out-of-the-box sentiment models often struggle with the nuanced language of specific domains. As seen in the first simulated review (”The new sustainable packaging is great, but the shipping took forever”), a single piece of text can contain mixed sentiments about different aspects of the platform. A simple positive/negative classification is insufficient here. To truly capture qualitative impact, you need aspect-based sentiment analysis, which identifies what the user is talking about (packaging vs. shipping) before assigning a sentiment score.

Comparing the Approaches

To build a comprehensive measurement strategy, it is essential to understand where each technique excels and where it falls short.

| Technique                 | Best For                                                                                      | Key Limitation                                                                                                | Data Requirements                                                                                 |
| ------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| Network Analysis          | Understanding ecosystem structure, identifying key influencers, and mapping systemic risk.    | Computationally expensive at scale; assumes equal weight of connections unless carefully tuned.               | Granular, transaction-level data mapping relationships between entities.                          |
| Difference-in-Differences | Evaluating the causal impact of specific platform features, policy changes, or interventions. | Relies on strict parallel trends assumptions; vulnerable to spillover effects in highly connected networks.   | Longitudinal data with clear pre/post intervention periods and distinct treatment/control groups. |
| NLP / Sentiment Analysis  | Capturing qualitative social impact, user perception, and emerging thematic issues.           | Struggles with domain-specific nuance, sarcasm, and aspect-level attribution without significant fine-tuning. | Large volumes of unstructured text data (reviews, forums, support tickets).                       |

Combining Approaches for Robust Measurement

The reality is that no single technique is sufficient for measuring platform sustainability. The most robust data pipelines combine these approaches into a cohesive architecture.

For example, you might start by using NLP to identify a recurring issue in user reviews—perhaps a sudden spike in complaints about excessive packaging waste from third-party vendors. You could then use network analysis to trace which specific clusters of vendors are driving this issue, identifying the structural bottlenecks in the ecosystem. Finally, if the platform implements a new packaging policy targeted at those specific vendors, you would use a Difference-in-Differences model to measure the causal effectiveness of that policy over time.

This multi-layered approach is what we strive for at Wangari. It is computationally intensive, requires careful data engineering, and demands a deep understanding of both statistical assumptions and platform dynamics. But it is the only way to move beyond superficial ESG metrics.

The Bottom Line

Measuring platform sustainability is fundamentally a data engineering and causal inference challenge. It requires moving beyond the boundaries of the firm to analyze the entire ecosystem.

By leveraging tools like network analysis to map structure, causal models to isolate impact, and NLP to capture qualitative nuance, we can start to quantify the unquantifiable. This provides the rigorous, defensible evidence needed to build truly resilient digital platforms—platforms that don’t just optimize for engagement, but optimize for enduring value.

Stop Bolting On. Start Building In.

Ari Joury — Thu, 26 Mar 2026 07:02:04 GMT

Imagine building a house in an earthquake zone. You wouldn’t construct the entire building first and then try to bolt on seismic reinforcements at the end. You would design the foundation and framing from day one to withstand the shocks.

Yet, when it comes to building digital platforms, we often do exactly the opposite. We optimize for rapid scale and user acquisition, and only later—usually when regulators or investors start asking uncomfortable questions—do we try to retrofit sustainability into the business model. This is the equivalent of bolting on seismic reinforcements after the house is built. It is expensive, inefficient, and ultimately fragile.

The Retrofit Trap

In my work at Wangari, I have seen this pattern repeatedly. Companies launch with a brilliant core logic, scale rapidly, and then the externalities become obvious. The carbon footprint of their server usage balloons, or the social impact of their gig-worker model draws scrutiny.

The typical response is the retrofit: hire a sustainability team, buy carbon offsets, and publish a glossy ESG report. But the core business logic remains unchanged. The platform is still fundamentally designed to maximize a single metric regardless of the broader impact.

This is not just a moral failing; it is a strategic vulnerability. In a world of tightening capital and shifting regulatory landscapes, platforms that rely on retrofitted sustainability are exposed to transition risks and reputational damage. They are building on a fault line.

Designing for Resilience

The alternative is what we might call “sustainable by design.” This means embedding environmental and social goals directly into the core business logic from the outset. It means recognizing that ecological responsibility and economic success can actually reinforce each other.

Consider a digital platform that optimizes logistics. A retrofitted approach might involve buying carbon offsets for the delivery fleet. A sustainable-by-design approach would involve building the algorithm to prioritize the most carbon-efficient routes. The sustainability is not an add-on; it is the product.

Research Shoutout — Scaling Sustainable Digital Platforms

The study involves 2–3 short interviews with key employees. Participation is anonymous, confidential, and low time commitment — and you’ll receive early access to our findings.

Interested? Reach out to us directly:

Ari Joury, Cofounder & CEO, Wangari Global — ari.joury@wangari.global
Melanie Gertschen, PhD Candidate, University of Bern — melanie.gertschen@unibe.ch

The Ecosystem Advantage

Digital platforms don’t just connect two parties; they create entire economies. When a platform embeds sustainability into its core, it influences the behavior of every participant in its ecosystem. It can incentivize suppliers to adopt greener practices and nudge consumers toward more sustainable choices. It creates a race to the top.

We are seeing an increasing number of platform start-ups doing exactly this. They are proving that you can build a highly profitable, rapidly scaling business while simultaneously addressing some of the defining challenges of our time.

The Bottom Line

The era of “move fast and break things” is over. The future belongs to those who can move fast and build resilient things. We need to stop looking at sustainability as a compliance exercise and start looking at it as a fundamental architectural principle. Because in the long run, the only growth that matters is the growth that can be sustained.

The Invisible Architecture of Sustainable Growth

Ari Joury — Tue, 24 Mar 2026 07:02:55 GMT

Sustainable digital architecture might not look glamorous, but it truly is. Image generated with Leonardo AI

Imagine you are building a house in an earthquake zone. You wouldn’t build the entire structure first and then try to bolt on some seismic reinforcements at the end. You would design the foundation, the framing, and the materials from day one to withstand the shocks.

Yet, when it comes to building digital platforms, we often do exactly the opposite. We optimize for rapid scale, user acquisition, and network effects, and only later—usually when regulators or investors start asking uncomfortable questions—do we try to retrofit sustainability into the business model. This is the equivalent of bolting on seismic reinforcements after the house is built. It is expensive, inefficient, and ultimately fragile.

The Retrofit Trap

In my work at Wangari, and previously dealing with systemic risks at a large insurer, I have seen this pattern repeatedly. Companies launch with a brilliant core logic: connect buyers and sellers, optimize a supply chain, or democratize access to a service. They scale rapidly. Then, the externalities become obvious. The carbon footprint of their server usage balloons. The social impact of their gig-worker model draws scrutiny. The supply chain they optimized turns out to be environmentally destructive.

The typical response is the retrofit. A sustainability team is hired. Carbon offsets are purchased. A glossy ESG report is published. But the core business logic remains unchanged. The platform is still fundamentally designed to maximize a single metric—usually transaction volume or user engagement—regardless of the broader impact.

This is not just a moral failing; it is a strategic vulnerability. In a world of tightening capital, increasing climate shocks, and shifting regulatory landscapes, platforms that rely on retrofitted sustainability are fragile. They are exposed to transition risks, reputational damage, and sudden regulatory shifts. They are building on a fault line.

Designing for Resilience

Consider a digital platform that optimizes logistics. A retrofitted approach might involve buying carbon offsets for the delivery fleet. A sustainable-by-design approach would involve building the algorithm to prioritize the most carbon-efficient routes, or integrating circular economy principles to minimize packaging waste. The sustainability is not an add-on; it is the product.

This requires a fundamental shift in how we think about scale. We are so conditioned to worship at the altar of exponential growth that we often ignore the quality of that growth. But scale without resilience is just a larger target for systemic shocks.

Research Shoutout — Scaling Sustainable Digital Platforms

The study involves 2–3 short interviews with key employees. Participation is anonymous, confidential, and low time commitment — and you’ll receive early access to our findings.

Interested? Reach out to us directly:

Ari Joury, Cofounder & CEO, Wangari Global — ari.joury@wangari.global
Melanie Gertschen, PhD Candidate, University of Bern — melanie.gertschen@unibe.ch

The Ecosystem Advantage

One of the most powerful aspects of digital platforms is their ability to orchestrate ecosystems. They don’t just connect two parties; they create entire economies. This is where the sustainable-by-design approach truly shines.

When a platform embeds sustainability into its core, it doesn’t just improve its own footprint; it influences the behavior of every participant in its ecosystem. It can incentivize suppliers to adopt greener practices. It can nudge consumers toward more sustainable choices. It can create a race to the top, rather than a race to the bottom.

This is not theoretical. We are seeing an increasing number of platform start-ups that are doing exactly this. They are proving that you can build a highly profitable, rapidly scaling business while simultaneously addressing some of the defining challenges of our time. They are the ones building the earthquake-proof houses.

The Bottom Line

The era of “move fast and break things” is over. We have broken enough things. The future belongs to those who can move fast and build resilient things.

For financial services professionals, investors, and platform founders, the imperative is clear. We need to stop looking at sustainability as a compliance exercise or a marketing tool. We need to start looking at it as a fundamental architectural principle. We need to ask not just how fast a platform is growing, but how it is growing. Because in the long run, the only growth that matters is the growth that can be sustained.

Reads of the Week

Inside eBay’s Circular Economy Strategy: In this deep dive for Platform Professional, breaks down how eBay has pivoted back to its roots to become one of the world’s largest circular platforms, driving $30 billion annually in resale and refurbished goods. He details how the company is using AI and strategic acquisitions to embed sustainability directly into its core marketplace. For Wangari readers, it is a masterclass in how a platform can align its fundamental business model with environmental impact at scale.
Hello To 2026—And To Trump’s Entirely Accidental Gift: Sustainability pioneer argues that the current political and regulatory pushback against ESG is actually a necessary stress test for the movement. He suggests that the disruption will force successful companies to mature, moving away from superficial commitments toward genuine systemic resilience. It is a provocative, optimistic read that perfectly captures why we need to design for shocks rather than just optimizing for the status quo.
We need a revolution in social media business models: Former Meta director offers a candid look at how surveillance advertising and engagement-driven business models inevitably degrade digital communities. She argues that no amount of good intentions or community management can overcome a core architecture designed to maximize attention at all costs. This piece is a stark reminder that if we want platforms to serve society, we have to change the underlying incentives that fund them.

How to Build Zero-Hallucination AI

Ari Joury — Fri, 20 Mar 2026 07:02:04 GMT

AI can shine a light into the thicket — or make it worse, if not used wisely. Image generated with Leonardo AI

Large Language Models are reshaping the financial services industry. They can summarize dense regulatory documents, draft client-facing reports, and explain complex accounting movements in plain English — tasks that previously required hours of skilled analyst time. But they come with a fundamental flaw that is unacceptable in a regulated environment: they hallucinate.

An LLM hallucination is not a random glitch. It is a structural property of how these models work. They are trained to predict the next most probable token in a sequence, not to retrieve verified facts. When the model encounters a gap in its knowledge — an obscure regulatory reference, a specific numeric figure, a recent policy change — it fills that gap with whatever sounds statistically plausible. The output is fluent, confident, and potentially entirely wrong 1.

In most consumer applications, this is an annoyance. In finance, it is a liability. A fabricated figure in a quarterly commentary, a misquoted discount rate in an actuarial report, or an invented regulatory citation in a compliance document can trigger audit failures, regulatory sanctions, and material financial loss. Google’s Bard erased roughly £100 billion in market capitalisation in a single afternoon after hallucinating a fact about the James Webb Space Telescope during a live demo 1. Air Canada was held legally responsible in court for a refund policy its chatbot invented 1. The question is not whether your LLM will hallucinate. Research suggests it will do so in anywhere from 3% to 41% of finance-related queries 1. The question is whether you have built a system that catches it before it causes harm.

This article presents a practical, production-oriented architecture for doing exactly that.

Why Finance Cannot Tolerate Probabilistic Outputs

Before examining the solution, it is worth being precise about the problem. The financial services industry is built on a principle that is fundamentally at odds with how LLMs operate: every number must be traceable to a source.

Under frameworks like IFRS 17 for insurance contracts, or the FCA’s Consumer Duty in the UK, firms are not just expected to produce accurate outputs — they are expected to demonstrate how those outputs were produced. An audit trail is not optional; it is a regulatory requirement. An LLM that generates a plausible-sounding explanation of a CSM movement, drawing on its training data rather than the firm’s actual figures, fails this requirement entirely, even if the explanation happens to be correct.

The challenge, then, is not to find a more accurate LLM. It is to design a system in which the LLM’s role is so tightly constrained that hallucination becomes structurally impossible.

The Architecture: Deterministic First, Language Model Last

The core principle of a hallucination-free pipeline is simple: the LLM should never be asked to know anything. It should only be asked to say something, based on facts that have already been computed and verified by deterministic code.

This inverts the typical approach, where an LLM is given a broad question and trusted to retrieve and reason over relevant information. Instead, the pipeline separates the computation layer from the communication layer entirely.

deterministic calculations
        ↓
structured metrics
        ↓
template / rule generation
        ↓
LLM summarization layer
        ↓
validation checks

Each stage has a clearly defined responsibility, and no stage is permitted to introduce information that has not been verified by the stage before it.

| Stage                      | Responsibility                                                                                     | Hallucination Risk                                                                  |
| -------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| Deterministic calculations | Compute all financial metrics from raw data using auditable code | None — purely mathematical      | None — purely mathematical                                                          |
| Structured metrics         | Package outputs into a typed, validated data structure (e.g., a Python dict or Pydantic model)     | None — data is already verified                                                     |
| Template / rule generation | Populate a prompt template with the structured metrics, providing explicit context and constraints | None — the prompt is constructed programmatically                                   |
| LLM summarization          | Generate human-readable narrative based *only* on the inputs provided in the prompt                | Minimal — the LLM cannot introduce external facts if the prompt is well-constrained |
| Validation checks          | Parse the LLM's output and verify that all numeric claims match the source metrics                 | Catches any residual errors before the output is used                               |

The key insight is that the LLM is not a reasoning engine in this pipeline. It is a prose generator. Its job is to turn a structured set of verified facts into a coherent paragraph of English. That is a task it performs extremely well, and it is a task that does not require it to know anything beyond what it has been explicitly told.

Technical Deep Dive: A Minimal Pipeline for IFRS 17 Commentary

To make this concrete, consider the challenge of generating automated commentary on insurance liability movements under IFRS 17. This is a genuinely difficult reporting task. The standard is technically demanding, requiring insurers to track the Contractual Service Margin (CSM) — the unearned profit deferred over the life of a group of contracts — alongside discount rate sensitivities, risk adjustments, and experience variances 2. Explaining these movements in plain language for a board report or investor disclosure is exactly the kind of task where an LLM can add real value, provided it is given the right inputs.

The pipeline begins with a deterministic calculation function:

import pandas as pd

def compute_ifrs17_metrics(df: pd.DataFrame) -> dict:
    """
    Compute key IFRS 17 metrics from a portfolio DataFrame.
    All calculations are deterministic and auditable.
    Returns a structured dictionary of verified figures.
    """
    csm_opening = df["csm_opening"].sum()
    csm_closing = df["csm_closing"].sum()
    csm_delta = csm_closing - csm_opening

    discount_opening = df["discount_opening"].sum()
    discount_closing = df["discount_closing"].sum()
    discount_delta = discount_closing - discount_opening

    return {
        "csm_opening": csm_opening,
        "csm_closing": csm_closing,
        "csm_delta": csm_delta,
        "discount_opening": discount_opening,
        "discount_closing": discount_closing,
        "discount_delta": discount_delta,
    }

metrics = compute_ifrs17_metrics(df)

This function is entirely deterministic. Given the same input DataFrame, it will always produce the same output. It can be unit-tested, version-controlled, and audited. No LLM is involved at this stage.

The structured output is then injected into a tightly constrained prompt template:

prompt = f"""
You are an IFRS 17 reporting assistant. Your task is to explain the change in
insurance liability for the current reporting period. Use only the figures
provided below. Do not introduce any external information, assumptions, or
estimates. Do not perform any calculations.

Reporting period metrics:
- CSM opening balance: {metrics['csm_opening']:,.0f}
- CSM closing balance: {metrics['csm_closing']:,.0f}
- CSM net change: {metrics['csm_delta']:,.0f}
- Discount rate impact (opening): {metrics['discount_opening']:,.0f}
- Discount rate impact (closing): {metrics['discount_closing']:,.0f}
- Discount rate net change: {metrics['discount_delta']:,.0f}

Write a two-paragraph commentary suitable for inclusion in a board report.
"""

response = llm(prompt)

Notice the explicit constraints in the prompt: “use only the figures provided below,” “do not introduce any external information,” “do not perform any calculations.” These instructions are not just good practice — they are the primary mechanism by which hallucination is prevented at the LLM layer. The model is given no latitude to improvise.

The LLM’s output might then read something like:

“During the reporting period, the Contractual Service Margin decreased by $1,500,000, moving from an opening balance of $42,300,000 to a closing balance of $40,800,000. This reduction reflects the release of deferred profit as insurance services were provided to policyholders over the period, consistent with the Group’s coverage pattern.The discount rate contributed a net positive movement of $500,000 to the liability measurement. The increase in the discount rate applied to the liability for remaining coverage reduced the present value of future fulfilment cash flows, partially offsetting the CSM release and resulting in a net decrease in total insurance contract liabilities for the period.”

This commentary is accurate, traceable, and audit-ready — because every number in it came from a deterministic calculation, not from the LLM’s training data.

Guardrails: The Final Line of Defence

Even with a well-constrained prompt, a production system should implement automated validation checks before the LLM’s output is used. Think of these as the equivalent of a compiler’s type checker: the code may look correct, but you still run the tests.

Numeric verification is the most straightforward guardrail. A simple regular expression or number-extraction routine can parse the LLM’s output and compare every figure against the source metrics dictionary. Any discrepancy — even a rounding difference — should trigger a flag for human review. This check is fast, cheap, and catches the most common class of residual error.

Structured outputs take this a step further. Rather than generating free prose and then parsing it, the LLM can be instructed to return a JSON object with a predefined schema. A Pydantic model can then validate the output programmatically, ensuring that all required fields are present, all numeric values fall within expected ranges, and the output is machine-readable for downstream systems. This approach is particularly valuable in automated pipelines where the commentary feeds directly into a reporting database or document generation system.

Retrieval-Augmented Generation (RAG) addresses a different class of risk: the need to reference external documents such as regulatory guidance, internal policies, or prior-period disclosures. Rather than allowing the LLM to draw on its training data — which may be outdated, incomplete, or simply wrong — a RAG system retrieves the relevant passages from a pre-approved, version-controlled knowledge base and injects them directly into the prompt 3. The LLM is then constrained to cite only what it has been given. This is particularly important for regulatory commentary, where a fabricated reference to a non-existent guidance note could have serious consequences.

Agent validators represent the most sophisticated layer of the stack. A second LLM agent, operating independently of the primary commentary agent, can be tasked with reviewing the output against the source data and a set of validation rules. This agent is not asked to generate prose; it is asked to answer a binary question: “Does this commentary accurately reflect the provided metrics?” If the answer is no, the output is rejected and the primary agent is asked to regenerate. This pattern mirrors the four-eyes principle that is already standard practice in financial controls.

| Guardrail                      | What It Catches                                                | Complexity |
| ------------------------------ | -------------------------------------------------------------- | ---------- |
| Numeric verification           | Figures in the output that do not match source metrics         | Low        |
| Structured outputs             | Missing fields, out-of-range values, malformed responses       | Low–Medium |
| Retrieval-Augmented Generation | Fabricated regulatory references, outdated policy citations    | Medium     |
| Agent validators               | Semantic inaccuracies, logical inconsistencies, missed context | High       |

A production system does not need to implement all four simultaneously. For most automated commentary use cases, numeric verification and structured outputs provide sufficient coverage. RAG becomes essential when the commentary must reference external documents. Agent validators are most valuable in high-stakes workflows — board-level reporting, regulatory submissions — where the cost of an error is highest.

Governance and Auditability: The Non-Negotiable Layer

Technical guardrails are necessary but not sufficient. A truly production-ready system must also address the governance requirements that financial regulators increasingly impose on AI-assisted workflows.

Every output generated by the pipeline should carry a full audit trail: the version of the calculation code used, the input data hash, the prompt template version, the LLM model identifier and temperature setting, and the results of each validation check. This metadata should be stored alongside the output and be retrievable on demand. When an auditor asks “how was this commentary generated?”, the answer should be a complete, reproducible record — not “the AI wrote it.”

Firms operating under the FCA’s Consumer Duty or equivalent frameworks should also consider how they will handle cases where the validation checks fail. A clear escalation path — from automated rejection, to human review, to senior sign-off — should be defined before the system goes live. The governance framework is as important as the technical architecture.

The Bottom Line: Benefits Without the Risks

The promise of LLMs in finance is real. The ability to generate clear, accurate, audit-ready commentary at scale — across thousands of contracts, portfolios, or client accounts — represents a genuine step change in operational efficiency. But that promise can only be realised if the architecture is designed from the ground up to prevent hallucination, not merely to detect it after the fact.

The pipeline described in this article — deterministic calculations feeding structured metrics into a constrained prompt, with validation guardrails at the output layer — provides a practical, implementable path to that goal. It is not a theoretical framework; it is a pattern that can be built, tested, and deployed with standard Python tooling and a well-chosen LLM API.

The key discipline is one of role clarity. Deterministic code computes. Structured data carries. Prompt templates constrain. The LLM narrates. Validators verify. When each component does only its job, the system as a whole becomes trustworthy — and that is the only acceptable standard for AI in finance.

AI is a Communication Tool, Not an Analyst

Ari Joury — Thu, 19 Mar 2026 07:02:23 GMT

Artificial intelligence is rapidly entering the core workflows of financial institutions. Boards, CFOs, and innovation teams are increasingly interested in one particular capability: automatically generating commentary on financial results, portfolio performance, and regulatory changes.

The promise is compelling. Instead of analysts spending hours assembling reports, AI could produce clear summaries in seconds. Expertise that currently exists in small teams could scale across entire organizations.

But there is a problem.

Large Language Models — the engines behind tools like ChatGPT and Claude — have a fundamental limitation: they are not built to guarantee factual accuracy. They are built to generate plausible language. When these systems lack information, they do not return an error. They improvise.

In a regulated industry such as finance, that is unacceptable.

An invented number in a financial report, an incorrect interpretation of a regulatory requirement, or a fabricated policy detail could expose a firm to compliance failures, audit issues, and reputational damage. For this reason, some institutions have reacted defensively by banning AI tools altogether.

That response misses the point.

The real challenge is not whether to use AI, but how to design systems around it. Financial institutions do not need AI to calculate numbers — they already have deterministic systems that do this reliably. What AI is uniquely good at is communicating information clearly.

The safest architecture therefore separates these two roles.

In a trustworthy AI system, all calculations happen first in traditional, auditable code. Financial metrics are computed deterministically and stored as verified data. These figures form a single source of truth.

Only after this step does AI enter the process.

The AI receives a structured dataset containing the verified numbers and a strict template describing how the commentary should be written. Its task is narrow: transform the validated facts into readable language. It is not allowed to introduce new information or perform independent calculations.

Finally, the system performs an automated validation step. Every number in the AI-generated text is checked against the original dataset. If any discrepancy appears — even a single digit — the output is rejected and flagged for human review.

This approach fundamentally changes the role of AI in financial workflows. Instead of acting as an autonomous analyst, it becomes a communication layer on top of deterministic systems.

The result is powerful.

Institutions can generate large volumes of commentary across portfolios, reports, and internal dashboards almost instantly. Analysts are freed from repetitive reporting tasks and can focus on higher-value analysis. At the same time, the system maintains a complete audit trail for every statement produced.

In other words, firms gain scalability without sacrificing reliability.

As AI adoption accelerates, the competitive advantage will not go to organizations that experiment with the most powerful models. It will go to those that build the most trustworthy systems around them.

In finance, innovation succeeds only when it preserves accountability. AI is no exception.