Should Causal Inference Be Assumption-First or Data-First?

Human assumptions introduce bias, but data-first approaches are vulnerable to spuriousness

Sep 26, 2025

Two very different schools of thought might have to join forces. Image generated with Leonardo AI

Like many techniques in statistics and in modeling, one of the trickiest questions is: Where should we begin?

Some argue that the only safe starting point is with expert assumptions. Draw the graph of how you believe the world works, encode domain knowledge about what causes what, and then let the data test or refine it. Robert Ness, in Causal AI, insists this is non-negotiable: a causal graph is not just a data artifact, it’s a statement of belief about the underlying data-generating process.

But in practice, platforms like RootCause flip this script. While they allow for expert knowledge, they advise against anchoring your analysis in expert priors too early. Instead, they recommend visualizing dependencies straight from the data—contracts lead to charges, churn follows missed payments—and only later layering on causal logic.

Their argument is that experts are biased, blind to surprises, and prone to codifying their own misconceptions into the graph.

This sets up a tension at the heart of modern causal modeling: should we trust human assumptions, or let the data speak first?

The answer matters. It shapes not just the graphs we draw, but the inferences, interventions, and strategies we derive from them.

The Case for Assumptions

Assumptions are not a weakness in causal inference—they are the starting point. Every causal analysis rests on beliefs about how the world works, whether we make them explicit or not. By insisting that we draw a directed acyclic graph (DAG) before touching the data, experts like Robert Ness force us to put those beliefs on the table.

This transparency has two virtues. First, it makes causal reasoning explainable: instead of saying “the model found this relationship,” we can say “given our assumption that contracts precede charges, here’s the estimated effect.” Second, it lets us include variables that may not even appear in the dataset but are known to matter—such as macroeconomic conditions in a credit risk model, or regulation in a sustainability analysis.

Without expert priors, models can drift into spuriousness, mistaking correlation for causation. Worse, they may fail to generalize, breaking down the moment the environment shifts. A carefully scoped DAG provides guardrails, orienting the analysis toward mechanisms rather than patterns.

In other words, assumptions are not just baggage from the past—they are the scaffolding that makes causal inference robust, interpretable, and ultimately useful for decision-making.

The Case for Data

And yet, assumptions cut both ways. Experts are human, with blind spots shaped by incentives, experience, and bias. Encode those priors too early, and the DAG becomes a codified worldview—rigid, self-confirming, and resistant to surprise. This is the danger that data-first approaches seek to avoid.

Platforms like RootCause start with what the data itself reveals: dependencies, time-series patterns, structural constraints that are undeniable (for example, for a telco there are no monthly charges without a contract, no churn without a customer). By mapping these observable relationships before asserting causal arrows, practitioners reduce the risk of building castles on faulty priors. The method is inductive: let the structure emerge, then see which relationships persist under different views of the data.

The advantage of this stance is humility. Data can surface counterintuitive patterns—hidden drivers of churn, non-obvious confounders, or time-lags that experts overlook. It also democratizes causal modeling, lowering the barrier for organizations that lack deep domain expertise.

The risk of spuriousness remains, but the gain is openness: a willingness to let the data challenge what we think we know.

Why Both Views Struggle in Isolation

Assumption-first and data-first approaches both promise clarity, but either one taken alone quickly runs into limits.

When we rely solely on expert priors, the model becomes as strong—or as fragile—as the expertise behind it. If the assumptions are wrong, the entire DAG is brittle. A misplaced causal relationship can cascade through the analysis, producing confident but misleading results. In corporate settings, this can reinforce groupthink: the model tells us what we already believe, while blind spots remain invisible.

On the other hand, a purely data-first workflow risks drowning in spuriousness. Statistical discovery tools can uncover hundreds of candidate relationships, many of which are artifacts of the sample rather than genuine causal links. Without an anchor in domain knowledge, the graph may shift each time new data arrives, undermining trust. Worse, patterns that look causal in one environment may vanish in another, leaving the model unstable when deployed.

The reality is that causality resists shortcuts. Expert priors without data are unfalsifiable; data without priors is uninterpretable. Both sides illuminate part of the problem but often fail to offer a self-contained solution. That’s why modern causal inference increasingly seeks a synthesis: an iterative loop where assumptions and evidence challenge and refine one another.

Towards a Synthesis

If assumptions alone bias us and data alone misleads us, then the only durable path forward is an integration of the two. The synthesis begins by acknowledging that every DAG is a hypothesis, not a fact. Experts can propose an initial structure, but that structure must be stress-tested against the data: do the implied conditional independencies hold? Do alternative graphs fit better? Conversely, data-driven discoveries must be subjected to expert scrutiny: are the suggested causal relationships plausible in light of how the world actually works?

This back-and-forth can take many forms. Some workflows start with expert priors and use data to refute them, pruning away fragile assumptions. Others begin with broad data-driven discovery, then narrow the search space by layering in domain knowledge. Either way, the process is iterative: assumptions guide discovery, discovery reshapes assumptions.

The advantage of synthesis is robustness. Expert priors provide stability, ensuring the model doesn’t collapse into noise. Data provides adaptability, catching surprises and preventing dogmatism.

Together they form a dialogue, not a hierarchy—neither assumptions nor data get the last word. The workflow is a bit more messy, yes. But it’s also beautiful: the causal graph evolves through negotiation between human insight and empirical evidence, producing models that are both interpretable and resilient.

The Bottom Line: It’s Still Not Easy

So where does this leave us? Neither camp offers a silver bullet. If you only start with expert priors, you risk re-baking old biases into new models. Start only with the data, and you invite spurious arrows, false dependencies, and shifting graphs that crumble outside the training set. Each approach is brittle in isolation.

The most promising path is iterative. Begin with expert knowledge to set guardrails, then use the data to test, refute, and surprise you. Refine the graph, rerun the analysis, and repeat. In this cycle, assumptions and evidence check each other’s excesses: experts prevent nonsense graphs, data prevents overconfidence.

Still, it’s not easy. The tension between assumption-first and data-first reflects something deeper about causal inference: there is no neutral ground. Every graph encodes judgment calls—about what counts as relevant, what gets ignored, and what relationships are deemed plausible.

We may never fully resolve the debate. But perhaps that’s the point: causal modeling isn’t about choosing sides, it’s about cultivating a disciplined dialogue between human insight and machine evidence. The hard part is accepting that the first finding is never the last.

Discussion about this post

Ready for more?