What predicting a soccer match teaches us about real-life AI

And how to make good predictions work in enterprise settings

Jun 23, 2026

Building a reliable system is really, really hard. Image generated with Leonardo AI

If you want to build a machine learning model to predict the outcome of a soccer match, you will quickly discover that the algorithm you choose—whether it is a random forest, a support vector machine, or a deep neural network—is the least important part of the process.

The algorithm is a commodity. The real challenge, the part that determines whether your model is a predictive powerhouse or a random number generator, is deciding what data to feed it.

This is the art and science of feature engineering. And as I detail in my new book, *Soccer Analytics with Python* (published this week by O’Reilly Media), it is the exact same challenge that enterprise AI teams face when trying to automate complex business processes.

Whether you are trying to predict a goal at the World Cup or automate a Solvency II regulatory report, the fundamental problem is the same: how do you separate the signal from the noise?

The Illusion of “More Data”

There is a persistent myth in the AI industry that more data automatically leads to better models.

In soccer, we have access to an overwhelming amount of data. We can track the x,y coordinates of every player on the pitch 25 times per second. We know how many passes a midfielder completed, how many tackles a defender won, and the exact velocity of every shot.

But if you feed all of this raw data into a model, it will likely fail. It will find spurious correlations—perhaps the team wearing blue won more often on Tuesdays when it was raining—and it will overfit to the noise.

The same is true in the enterprise. An insurance company has petabytes of historical claims data, customer demographics, and macroeconomic indicators. But throwing all of that data into a large language model will not magically produce a reliable underwriting agent.

Engineering the Signal

Feature engineering is the process of transforming raw data into meaningful signals that a model can actually use.

In soccer, a raw statistic like “total distance run” is mostly noise. A player might run 12 kilometers in a match, but if they are constantly out of position, that running is detrimental to the team.

A much stronger signal is “packing”—a metric that measures how many opponents a player bypasses with a forward pass or dribble. Packing requires complex spatial analysis to calculate, but it is highly predictive of a team’s offensive success. It is an engineered feature that captures the *context* of the action, not just the action itself.

In the enterprise, feature engineering is equally critical. A raw text field containing a customer’s email is noise. An engineered feature that extracts the specific regulatory clause the customer is referencing, and maps it to an internal compliance taxonomy, is a signal.

The Domain Expertise Advantage

The most important lesson I learned while writing *Soccer Analytics with Python* is that you cannot engineer good features without deep domain expertise.

A brilliant data scientist who knows nothing about soccer will build a terrible predictive model. They will not know that a pass backward to the goalkeeper is sometimes a brilliant tactical move to reset the press, rather than a sign of offensive failure.

Similarly, a brilliant AI engineer who knows nothing about actuarial science will build a terrible regulatory reporting agent. They will not understand the nuanced difference between two seemingly identical financial metrics, or the specific regulatory context that dictates how a certain risk must be calculated.

This is why the most successful enterprise AI deployments are not built by isolated data science teams. They are built by cross-functional teams where domain experts—the actuaries, the compliance officers, the underwriters—work hand-in-hand with the engineers to define the features that actually matter.

The Limits of Machine Learning

Even with perfect feature engineering, machine learning has its limits. As Alex Marin Felices points out in his analysis of football attacking performance, [machine learning models often struggle to represent the interactive and dynamic nature of play](https://thexgfootballclub.substack.com/p/the-promise-and-limits-of-machine) [1]. They can identify patterns, but they cannot always capture the complex, multi-agent interactions that define a match.

This limitation is why causal inference is becoming increasingly important. To truly understand the game—or the business—we must move beyond correlative models and build systems that can answer “what if” questions.

The Causal Dimension of Feature Engineering

Feature engineering is not just about finding better correlations. At its most sophisticated, it is about encoding causal knowledge into the model.

In soccer, a truly powerful feature is not just a statistical summary of past events; it is a variable that captures a causal mechanism. The “packing” metric is powerful precisely because it measures a causal driver of attacking success: bypassing defenders. It is not just correlated with goals; it is causally related to the creation of goal-scoring opportunities.

In the enterprise, the same principle applies. The most powerful features are not the ones with the highest correlation to the target variable in the training data. They are the ones that capture the true causal mechanisms driving the outcome. Identifying these features requires deep domain expertise and a willingness to go beyond the data to understand the underlying business logic.

This is why at Wangari, we always begin a new engagement by working closely with domain experts to map out the causal structure of the problem before we write a single line of code.

Which Brings me to… The Book Launch

This week, Soccer Analytics with Machine Learning officially hits the shelves.

We timed the release to coincide with the start of the World Cup, a time when the entire world is focused on the beautiful game. But my hope is that the book reaches an audience far beyond sports fans.

The book uses soccer as a vehicle to teach the fundamental principles of machine learning—logistic regression, simulation, deep learning, and yes, feature engineering. It is designed to bridge the gap between academic theory and practical application, proving that you can learn complex data science concepts without getting bogged down in abstract mathematics.

If you are struggling to understand how machine learning actually works in the real world, I invite you to pick up a copy. You might just find that the lessons learned on the pitch are exactly what you need in the boardroom.

Meanwhile, at Wangari

Our book is finally out! You can find it on the O’Reilly Learning platform and everywhere books are sold — as e-book, for now, and in print from mid-July.

This happened just in time, as the soccer world cup ramps up to full swing. Unintended, but fun: One of my pieces, an exclusive run for Towards Data Science, actually got picked up by the Daily Mail. I’m used to publishing on fairly niche topics, so it was fun to see my work in mainstream media!

Reads of the Week

The Promise and Limits of Machine Learning in Football Attacking Analysis by Alex Marin Felices: A critical look at why machine learning models often struggle to capture the dynamic, interactive nature of soccer. Felices argues that while ML is great for identifying correlations, it falls short when trying to model the complex, multi-agent decision-making that defines a successful attack.

Opinion: Inside the World Cup’s AI offside revolution by Clemente Lisi: An exploration of FIFA’s ambitious plan to use AI-enabled 3D avatars for the 2026 World Cup. Lisi discusses the tension between technological precision and the human element of refereeing, a debate that mirrors discussions about AI governance in the enterprise.
Hidden Technical Debt in Agentic Systems by Miguel Otero Pedrido: A reminder that the true complexity of AI automation lies not in the model, but in the surrounding infrastructure. Pedrido’s analysis is a sobering reminder that the “glue code” holding an AI system together is often its weakest link.

Discussion about this post

Ready for more?