What Soccer Taught Me About Machine Learning

Why the world's most popular game is the perfect laboratory for understanding predictive modeling and data science.

May 26, 2026

Soccer and machine learning have more in common than you might think. Image generated with Leonardo AI

If you want to understand the complexities of machine learning, you could start by studying linear algebra, calculus, and probability theory. You could immerse yourself in academic papers on gradient descent and backpropagation.

Or, you could watch a soccer match.

At first glance, the chaotic, fluid nature of soccer seems entirely disconnected from the structured world of data science. But beneath the surface of every pass, tackle, and shot on goal lies a rich tapestry of data waiting to be analyzed. In fact, the challenges inherent in modeling a soccer match perfectly mirror the challenges of building robust machine learning systems for the enterprise.

This realization is what drove me to co-author my upcoming book, Soccer Analytics with Python (O’Reilly Media). The goal was not just to write a book for sports fans, but to use the universal language of soccer to demystify machine learning concepts that often feel abstract and inaccessible.

Frankly, it was also just fun to write as a way to combine my soccer-playing teenage years with my code-crunching twenties. But hey, here we are, old, wise and having fun with soccer analytics.

The Beautiful Game as a Data Problem

Consider the fundamental problem of predicting a match outcome. It is not a simple deterministic equation. It is a highly probabilistic scenario influenced by dozens of interacting variables: player form, tactical formations, weather conditions, historical matchups, and the sheer unpredictability of human behavior.

When we attempt to model this, we encounter the exact same issues that plague enterprise data scientists.

We must deal with feature engineering—deciding which variables actually matter. Is a team’s recent possession percentage more predictive than their historical expected goals (xG)? We must grapple with the bias-variance tradeoff, ensuring our model is complex enough to capture the nuances of the game without overfitting to the noise of a single anomalous match.

We must also confront the limitations of purely correlative models. A model might find a strong correlation between a specific player wearing yellow boots and their team winning. But without causal reasoning, the model cannot distinguish between a meaningless coincidence and a true driver of performance. As Alex Marin Felices points out, while machine learning models can identify correlations between performance indicators and success, they often struggle to represent the interactive and dynamic nature of attacking play.

From the Pitch to the Boardroom

The lessons learned from analyzing soccer data translate directly to the big-corp boardroom, and how they’re looking at nascent machine learning and AI projects in their enterprises.

In the book, we explore techniques like logistic regression, random forests, and deep learning, applying them to real-world soccer datasets. We build models to predict match outcomes, evaluate player performance, and even test betting strategies.

But the underlying principles are universal. The same random forest algorithm used to predict whether a striker will score from a specific location on the pitch can be used by an insurance company to predict the likelihood of a claim. The same simulation techniques used to model different tactical scenarios can be used by a financial institution to stress-test their portfolio against market shocks.

By grounding these concepts in a domain that is intuitive and engaging, we can strip away the intimidating jargon and focus on the core mechanics of how machine learning actually works.

The Importance of Context

Perhaps the most important lesson soccer teaches us about data science is the critical importance of context.

In soccer, a raw statistic like “total passes completed” is almost meaningless without context. Were those passes progressive, breaking through the opponent’s defensive lines, or were they safe, lateral passes between defenders?

Similarly, in enterprise AI, data without context is a liability. A model trained on historical financial data might identify a pattern, but without understanding the underlying economic context—the regulatory changes, the market dynamics, the causal relationships—that pattern is likely to be misleading.

This is why at Wangari, we emphasize causal AI. We believe that true intelligence requires understanding the why behind the data, not just the what. Whether you are analyzing a soccer match or automating complex regulatory reporting, context is the difference between a model that merely describes the past and a system that can reliably navigate the future.

The Future of Sports Analytics

The integration of AI into sports is accelerating rapidly. For the 2026 World Cup, FIFA plans to create AI-enabled 3D avatars of every player to ensure precise player identification and tracking for semi-automated offside decisions. This level of data capture represents a massive leap forward, but it also highlights the tension between technological precision and human judgment.

As decisions become more exact, they can also feel more arbitrary to spectators. If an attacker is ruled offside because a 3D scan shows a shoulder fractionally further forward than previously assumed, it raises questions about fairness and the role of technology in the game. This mirrors the challenges we face in enterprise AI, where highly accurate models can sometimes produce decisions that are difficult for humans to interpret or trust.

The Evolution of Expected Goals

One of the most fascinating developments in soccer analytics is the evolution of the “Expected Goals” (xG) metric. Early xG models were relatively simple, relying primarily on the distance and angle of the shot relative to the goal.

Today, state-of-the-art xG models are vastly more sophisticated. They incorporate the positions of all defenders and the goalkeeper, the velocity of the pass preceding the shot, and even the specific body part used to strike the ball. This evolution perfectly illustrates the concept of feature engineering—the continuous process of refining the inputs to a model to capture more of the underlying reality.

In the enterprise, we see a similar evolution. Early predictive models about, say, customer conversions relied on simple demographic data. Today, advanced models incorporate behavioral data, network graphs, and unstructured text analysis. The goal is always the same: to move from a crude approximation of reality to a high-fidelity representation.

Bridging the Gap

Writing Soccer Analytics with Python has been a fascinating exercise in translation. It has reinforced my belief that the most complex technical concepts can be made accessible when framed through the right lens.

The book is designed for anyone who wants to develop a solid foundation in machine learning, whether you are a student, an analyst, or simply a fan of the game. It bridges the gap between academic principles and practical applications, proving that you don’t need a PhD in particle physics to understand how to build predictive models.

The beautiful game is more than just a sport. It is a masterclass in probability, strategy, and the power of data.

Meanwhile, at Wangari

While I have been busy writing about soccer analytics, the core focus at Wangari remains on solving the hardest data challenges in the enterprise.

If you are a technical leader looking to bridge the gap between AI prototypes and production systems, my upcoming course is designed for you.

From Demo to Production: Operationalize an Enterprise-Grade Agentic AI Reporting System launches on June 9th. Over 6 weeks, we will cover the orchestration, evaluation, and governance frameworks necessary to build reliable AI systems.

Enrollment is open now at GenAI Academy.

And if you are interested in exploring machine learning through the lens of soccer, my new book, Soccer Analytics with Python, will be published by O’Reilly Media in late June, just in time for the World Cup.

Reads of the Week

The Promise and Limits of Machine Learning in Football Attacking Analysis by Alex Marin Felices: A critical review of how machine learning is applied to analyze attacking performance and the challenges of representing dynamic play. Felices rightly points out that while models excel at finding correlations, they often fail to capture the complex, multi-agent interactions that define a successful attack.
Opinion: Inside the World Cup’s AI offside revolution by Clemente Lisi: An insightful look at FIFA’s plan to use AI-enabled 3D avatars for the 2026 World Cup and the implications for the game. Lisi explores the tension between technological precision and the human element of refereeing, a debate that mirrors discussions about AI governance in the enterprise.
Hidden Technical Debt in Agentic Systems by Miguel Otero Pedrido: A reminder that whether in sports analytics or enterprise AI, the model is just a small part of the overall system complexity. Pedrido’s analysis of the infrastructure required to support autonomous agents is a must-read for anyone moving beyond simple API calls.

Discussion about this post

Ready for more?