Stop Trusting ML Predictions for the 2026 World Cup. Here's Why.
I wrote a book about soccer analytics with machine learning. The World Cup is where most of it breaks.

Every four years, the same thing happens.
A major football analytics site publishes its World Cup predictions. A few academic groups release theirs. Goldman Sachs releases one too, because Goldman Sachs releases one of everything. They run thousands of simulations, train models on Bundesliga data, layer in xG and shot-quality metrics, and then they tell you, with a confidence interval, that Brazil has a 17.2% chance of lifting the trophy.
Then the World Cup happens, and it doesn’t.
I’m not going to argue that ML can’t predict football. I just wrote a book about exactly that. What I’m going to argue is something narrower and more uncomfortable: most of the techniques that make ML useful in club football break down for the World Cup, and the analytics community has spent two decades pretending this isn’t true.
There’s incentives for this (sports betting, anyone?) — but that doesn’t make it truer.
The training data is wrong
Football ML lives and dies on data. The richest dataset we have is the club game — five major European leagues, multiple cup competitions, roughly 2,000 top-flight matches a year per major league. Decades of it. Coaches who manage 50 matches a season for ten years. Players who play together every single week.
International football has approximately none of that.
A World Cup squad spends about 25 days together in the year leading up to the tournament. The starting eleven you’ll see against Argentina in June has almost certainly never played that exact lineup before this calendar year. Your fancy possession network metric? It was trained on teams that had 200 matches of muscle memory. The model doesn’t know that the back four you’re feeding it has played together six times.
The technical name for this is “distribution shift.” The honest name is “we have no idea what we’re predicting.” Most public World Cup models paper over this by aggregating individual player ratings into team strength scores. That sort of works for group stage. It collapses in the knockout rounds, where formations get cagey, managers go conservative, and one substitution rewires the whole tactical setup.
If you’re going to deploy ML in a domain, the first question to ask is whether your training distribution matches your deployment distribution. For the World Cup, the honest answer is not even close.
xG was never built for this
Expected goals is the best metric football analytics has produced. I use it in the book, I think it’s genuinely a step forward, and I’d defend it against anyone who calls it nonsense.
But xG is a shot-level metric trained on Premier League and Bundesliga shot data. It was designed for repeated trials in similar contexts. The World Cup gives you seven matches, maybe, for the team you care about. Half of those matches end in 1–0 results or worse. Aggregate-xG noise dominates the signal at small sample sizes — this is a basic statistical fact that gets quietly ignored when people pump World Cup predictions through their xG-based pipelines.
There’s a deeper problem. xG models are trained on “normal” attacking play in club football. They have no idea what to do with extra time in a knockout match where one team has been parking the bus for 30 minutes and is now trying to score on a single counter. The data those models learned from doesn’t contain very many of those situations. International knockout football contains a lot of them.
You can absolutely build an xG model that works for the World Cup. You just can’t use the Premier League one and assume it transfers.
Form doesn’t transfer
Here’s a thought experiment. A striker scores 32 goals in La Liga from August to May. His national team plays four friendlies in the meantime, in which he scores zero. Which is the predictive signal for what he’ll do at the World Cup?
Most public models implicitly say: weight the 32. Use his “true talent” as inferred from his club output. Plug it into the international model.
This is wrong for a specific reason. The 32 goals were scored in a system, with a manager he sees every day, with teammates who know exactly where to put the through-ball. He arrives at the World Cup as an extraordinary player attached to a team that has practiced his preferred runs perhaps twice. International form for international striker output is the more honest signal, even though the sample is brutal.
The football analytics community has known this for years. Every analyst in private will tell you. None of the public predictive models I’ve seen meaningfully correct for it, because the obvious correction (downweighting club performance) catastrophically reduces the signal you have to work with.
You don’t get to wish away the small-sample problem by averaging in irrelevant data.
So what does work?
That’s a reasonable question by ambitious people. If most of football ML breaks for the World Cup, what’s actually predictive?
A few things are not quite as fashionable as ML pipelines but get closer to the right answer:
Squad market value. Boring, embarrassing, true. The total transfermarkt valuation of a squad is one of the strongest publicly available predictors of tournament progression. Not because money buys winners, but because it’s a market-aggregated bet on individual quality, made by people with real money on the line. ML models often beat this baseline by 1–2 percentage points after thousands of features. It’s worth asking what you’ve actually added.
Manager continuity. Teams whose manager has been in place for 18+ months consistently outperform teams with new appointments. This is partly because they have a system, partly because the players have trust, and partly because the manager has had time to identify and stop using their worst players. It’s hard to put into a model cleanly; it shows up anyway when you do.
Tournament experience as a team. Not as individuals. The cohort of players who’ve played a knockout international match together has more predictive power than aggregate caps. Spain’s 2010 team had two and a half cycles of building. France’s 2018 team had two. The 2022 Argentina team had a manager who’d been in place for four years. There’s a pattern, even though “tournament experience as a team” doesn’t fit cleanly into a feature vector.
Group draw difficulty. Trivially obvious, but most ML models bake this into other features rather than respecting it as the structural variable it is. Whether your route to the semifinal goes through Brazil or through Costa Rica matters more than any in-game metric.
If your World Cup model can’t beat a simple weighted combination of those four, it isn’t doing anything that justifies its training cost.
The deeper thing
Football analytics keeps wanting to be Moneyball. It’s been trying for fifteen years. There’s been real progress — modern shot maps, possession value, EPV-style models, automated tracking data — and I’m not the guy who’s going to tell you it was wasted.
But the World Cup is the part of football where Moneyball logic breaks the hardest, because Moneyball relied on the law of large numbers. 162 baseball games. Repeated trials. The dice converge to their fair value over a long enough season.
The World Cup is seven matches per team. Maybe two of them are knockout games that go to penalties. The dice don’t converge over seven throws. They land somewhere, and you live with where they landed.
I wrote a whole book about how to use ML in soccer well — what to model, what not to, how to set up your training data, what to do about the messy stuff. Soccer Analytics with Machine Learning, out from O’Reilly at the end of June. About a tenth of it is World Cup-specific. The rest is the part that does work — the part you can use on the league football that fills the other 47 weeks of your year.
By the time the World Cup actually starts, half the predictive models you’ll see have already been falsified by the warm-up matches. Watch the football. Watch the predictions. See for yourself which ones got Argentina–Saudi Arabia 2022.
I’d be impressed by anyone who got it right. I just wouldn’t pay them to do it again.
Meanwhile, at Wangari
If you’re curious, the book I mentioned earlier is coming out at O’Reilly Media in a couple of weeks! An un-edited early release is already available on the O’Reilly Learning Platform (it’ll be updated by the final version as soon as we’ve finished the last touches with the book’s production team).
I’ll let you know when the final version is out — will be available wherever books are sold.
Reads of the Week
The Promise and Limits of Machine Learning in Football Attacking Analysis: In this deep dive for The xG Football Club, Alex Marin Felices reviews a landmark academic paper examining how supervised and unsupervised machine learning has been applied to attacking performance in professional football — from pass-pattern clustering to off-ball scoring opportunity models. He argues that while the field has moved well beyond simple event counts, most models still struggle to capture the contextual and interactive dynamics that actually drive goals. For a Wangari audience thinking about the gap between data richness and decision-making quality, this is a sharp reminder that more data does not automatically mean better insight.
Does Liverpool FC Have a Data Science Problem?: In this essay, data scientist Karthik S traces Liverpool’s post-Ian-Graham recruitment decline and argues the club is suffering from a classic failure of “non-agentic data science” — where models inform but do not recommend, and the critical translation layer between analysts and decision-makers has essentially broken down. The piece is a compelling case study in what happens when the head of a high-stakes data function moves on and institutional knowledge does not transfer cleanly. Anyone building or inheriting a data team in a high-variance, low-volume decision environment — think insurance underwriting or credit risk — will find the parallels uncomfortably familiar.
Football Fans Are Drowning in Data, Starved of Wisdom: In this essay for xGuff (Expected Guff), Thomas Aston applies the Data–Information–Knowledge–Wisdom (DIKW) pyramid to the explosion of football analytics and finds the pyramid severely bottom-heavy: vast quantities of event and tracking data at the base, but precious little wisdom at the tip. He walks through vivid examples of how correct statistics are routinely used to reach wildly incorrect conclusions — from manager-sacking survival rates to the ongoing xG wars on TalkSport — and asks whether the volume of information is actually making the sport harder to understand. It is a timely provocation for anyone in data-heavy industries who has ever wondered whether their dashboards are generating knowledge or just more noise.


