Does Causal Inference Overly Rely on Assumptions?

What recent research papers on causal inference teach us

Sep 19, 2025

Are we seeing the world too simplistically? Image generated with Leonardo AI

Causal inference has become one of the most exciting areas of data science. (I’m biased, but I believe this to be more than true.)

With the rise of libraries like DoWhy, EconML, and newer ML-heavy approaches such as CausalPFN or DAG-aware transformers, it feels like we’re moving closer to an “automated” solution: feed in data, get out causal answers.

But one thing about all this has been bothering me: every causal estimate rests on assumptions. Any causal model needs to tick the boxes of ignorability (no unmeasured confounders), overlap (treatment and control groups actually comparable), and correct model specification (no hidden functional form mistakes). Violate any of these, and the entire causal edifice can wobble.

Ummm, how do we not violate these maxims in the real world of messy data?

My take is that it all lives on a spectrum, and as long as you don’t violate ignorability, overlap, and correct model specification too badly, you’re probably doing okay. (Luckily, you can stress-test your model to find out how bad it is.)

Now, with machine learning everywhere, the question becomes: are new ML-heavy methods making us more robust to these model failures, or are they amplifying the risks?

Recent research suggests… both. Some papers show clever ways to relax or bypass assumptions. Others remind us that without diagnostics and sensitivity checks, automation may hide fragility rather than fix it. Here’s an overview over the latest thinking and what it means for practitioners (like myself) trying to apply causal inference outside the lab.

Why Assumptions Matter

Causal inference relies heavily on assumptions. It’s like “here’s how I think the world works, let’s test whether I’m right.” That first part of the sentence is the assumption.

The problem is that in applied settings, assumptions almost never hold perfectly:

Overlap can fail when certain groups never receive treatment.
Ignorability can fail because key confounders weren’t measured.
Model specification can fail if functional forms are misspecified or proxies are noisy.

Each violation undermines our estimates in ways that no amount of computational power can simply fix. This fragility has been noted since the earliest days of causal inference, but what’s changing is the ambition.

With ML-heavy methods like causal forests, transformers, or pretrained causal predictors, we’re increasingly trying to automate causal estimation. The more automated the pipeline, the easier it becomes to forget that fragile foundations still underpin it.

As Ghosh & Rothenhäusler (2025) argue in their assumption-robust causal inference framework, robustness is not about eliminating assumptions, but ratehr about acknowledging uncertainty across many possible adjustment sets.

Ignorability and the Problem of Hidden Confounders

Real-world datasets almost always miss something, for example, unrecorded patient history in healthcare, unmeasured sentiment in financial markets, or unobserved culture in organizational studies. How do you, then, account for ignorability?

There are some solutions in the works for this conundrum. One promising stream is proximal causal inference, which leverages proxy variables to account for hidden confounders. Classic proximal methods assumed all proxies were valid, but new work on fortified proximal inference relaxes this: even if some proxies are invalid, it’s possible to identify causal effects under weaker conditions. This makes the methods far more usable in messy observational data.

Another direction is sensitivity analysis. Instead of assuming ignorability, researchers quantify how large unmeasured confounding would have to be to overturn results. This quite naturally extends into bounding frameworks: rather than point estimates, analysts report intervals that remain valid even if ignorability is partly violated.

The key insight is that ignorability is rarely black-and-white. Modern methods try to blur the edges: either by tolerating invalid proxies, or by bounding effects under uncertainty. These are not silver bullets, but they represent progress toward causal inference that is honest about the limits of our data.

Overlap and Positivity

Another foundational assumption is overlap, also called positivity: For any treatment, there must be a treatment- and a very similar control group, so that any causal effect can actually be quantified.

Violations of overlap are common. In medicine, certain drugs may never be prescribed to specific subpopulations. In finance, only large firms may adopt sustainability disclosures, leaving no “untreated” peers of similar size.

Recent research is confronting this. For example, Rafieian et al. (2025) propose a matrix completion approach to correct for bias when overlap fails: by treating the lack of overlap as a missing-data problem and assuming the treatment-effect surface has a low-rank structure, their method recovers causal effects in settings where standard weighting or trimming would fail.

In parallel, McClean & Díaz (2025) address violations of positivity in longitudinal data with new estimands (cumulative cross-world weighted effects), which are identifiable even without assuming full positivity, though with interpretability trade-offs.

The lesson is that overlap violations don’t just increase variance — they change the target of inference. Modern methods help salvage estimands in partial-overlap scenarios. Nevertheless, one must bear in mind that the effect is being estimated for a narrower population than originally intended.

Model Misspecification in the Age of ML

Even if ignorability and overlap hold, causal inference still relies on correct model specification. Traditional approaches assumed parametric models (e.g. logistic regression for propensity scores). If those models were wrong, estimates were biased.

Machine learning was supposed to solve this. By using highly flexible learners (such as random forests, boosting, and neural nets) we can approximate complex functions without hand-specifying them. This is the logic behind double/debiased machine learning and causal forests.

The problem is that this flexibility doesn’t totally erase the problem: ML models can still extrapolate poorly, overfit, or embed their own biases. (If we throw inadequate models on ever-complex problems, we might just amplify erroneous thinking!)

A recent example is the rise of transformer-based causal estimators. These promise automated effect estimation across diverse data-generating processes. But as other researchers have show in their evaluation of differentiable causal discovery, such models are surprisingly fragile under misspecified data-generating assumptions. They can detect spurious relationships or miss causal edges entirely.

The trend in current research is to build doubly robust estimators that hedge against model failure: if either the propensity score model or the outcome model is correct, estimates remain valid.

But as any practitioner knows, “at least one is right” is still a big assumption. Diagnostics, cross-validation, and sanity checks remain indispensable at this stage.

What Practitioners Can Do

So, what can we do in all this mess? The key is not to abandon causal inference just because assumptions wobble, but to adopt a more transparent workflow.

Diagnose overlap. Before estimating effects, check propensity score distributions or covariate balance. Severe violations should trigger trimming, weighting adjustments, or matrix-completion-style methods.
Run sensitivity analyses. Tools like Rosenbaum bounds or modern implementations in R/Python allow you to quantify how robust results are to hidden confounding.
Be explicit about populations. If overlap is limited, make clear that the effect applies to a narrower subgroup.
Leverage robust estimators. Doubly robust methods (e.g., double ML, causal forests) guard against misspecification in one nuisance model, though not both. Use cross-validation and diagnostics to avoid blind trust.
Communicate uncertainty. Present intervals, not just point estimates. Decision-makers value clarity about what assumptions underlie the numbers and how results might shift if those assumptions fail.

Ultimately, robust causal inference is less about clever estimators than about disciplined practice: checking assumptions, reporting sensitivity, and staying honest about limits.

The Bottom Line: Maybe We Can’t Automate Causation at All

Causal inference is not—and may never be—a push-button exercise. Every estimate rests on assumptions about data, design, and context. The newest ML-heavy methods are powerful, but they don’t make assumptions disappear. At best, they change the trade-offs: more flexibility, but also more ways to go wrong if the underlying conditions fail.

The good news is that the field is moving fast. Recent research is actively developing ways to cope with assumption violations. The methods are wide-ranging, from matrix completion that addresses weak overlap, to proximal methods that tolerate invalid proxies, to assumption-robust approaches that combine multiple adjustment sets, all the way to bounding frameworks that provide honest intervals when certainty is impossible.

What I’m retaining from it is this: Don’t treat causal ML like predictive ML. You can’t just optimize accuracy and call it a day. Instead, embrace a workflow that tests assumptions, runs sensitivity analyses, and communicates uncertainty honestly.

If causal inference is about answering “what works,” then the deeper question is always “under what assumptions does it work?” Understanding that, and staying transparent about it, is in my view the only way to unleash the huge power that causal inference really truly brings.

Discussion about this post

Ready for more?