Funny distributions: Professional soccer player size, footedness, and birthdays
Using athlete data to trial workflows and how/when to use Python or SQL or QGIS… or ChatGPT.
One in a series of projects highlighting my progress as a self-taught programmer.
I began teaching myself Python, SQL, and GIS in spring 2023 — starting from zero. I therefore welcome feedback on these projects and review for errors. And I’d be interested in taking a crack at your own data and geospatial questions too. Please get in touch.
29 September 2023
Background
The purpose of this project was to run some reps on a simple dataset to improve my sense of when to employ which of the tools I’ve been learning the last months. But I chose fun data: A list of football/soccer players from the top European leagues that included attributes like height, weight, footedness, and birthday. What could I find out about this elite athlete population with a few SQL queries—or should it be Python scripts? And should I merge and display findings with pandas, matplotlib, numpy, and seaborn libraries—or could QGIS handle the data machinations and make an interesting map?
As New York Times soccer reporter Rory Smith explains in his newish book, Expected Goals, world soccer has trailed baseball and other sports on the analytics front, and huge datasets of metrics and in-game events are hard to come by. But a quick search landed me a few files from a 2019 paper by Luca Pappalardo et al., “A public data set of spatio-temporal match events in soccer competitions.” I decided to leave the spatio-temporal data for another day, and instead settled on three questions for the file covering 3,603 players from “season 2017/2018 of five national soccer competitions in Europe: Spanish first division, Italian first division, English first division, German first division, French first division”:
Are today’s soccer stars shorter than yesterday’s?
Do some countries produce a higher proportion of left-footed players than others?
Do I see evidence of the “relative age effect,” meaning, as FiveThirtyEight explains, “to be a professional athlete in most sports, it helps to be born at the right time of year”?
Project outputs
This surprised me. It’s not a multi-cohort historical study of any kind (so not really about yesterday and today’s stars), but in this 2017-18 season snapshot, I saw a clear trend: Older players were generally taller than younger players.
According to this dataset, players about 37 years old averaged roughly 6’ in height, while their 18- and 19-year-old colleagues were more like 5’10” to 5’11”. We’re only talking a 4 cm difference, but the trend is there. (FYI, I pared back the raw dataset in one key respect: No goalkeepers, since they tend to be big guys. Also, I didn’t scan for outliers, but I excluded absurdly high and low values from my queries as well as birth years with fewer than 10 players.)
Again, since practicing SQL was my goal, I did not put much “why?” research into this finding, but I did have one thought: Are defenders responsible? Teams prize tall defenders and especially, perhaps, experienced tall defenders. And maybe defenders tend to have longer careers than midfielders and forwards, whose vital speed and agility fades by 35. In a quick check, I indeed found that defenders were, on average, a touch taller and older than midfielders and forwards. (Defenders and midfielders were roughly equally represented in the dataset, both more common than forwards.)
With knowledge of the height trend, I was less surprised to see similar results with weight. Older players tended to be heavier:
And what about footedness? Though foot preference is apparently less studied than hand dominance, I found one paper that concluded “about 12.1% of people were left-footed” worldwide. Other studies put the number higher, and in my dataset, the figure was almost exactly 25 percent. But was that true on a country-by-country basis? I probed footedness according to player country of birth:
I found the variation away from the dataset’s average curious, even though I ignored countries with fewer than 10 players across the five top European leagues. There were 11 Moroccans and Greeks, of whom 54.5 and 45.5 percent, respectively, were left-footed. Only one of the 12 U.S.-born players was left-footed (New Jersey native Giuseppe Rossi, of Italian immigrant parents, who represented Italy internationally). Similar numbers applied to DRC, so perhaps my count floor should have been higher.
For my final question, I wanted to probe a phenomenon first described in mainstream media by, I believe, Malcolm Gladwell in Outliers, namely that players at the older end of their youth sports cohort are overrepresented in professional sports. For example, as FiveThirtyEight summarized, one study found “58 percent of National Hockey League players were born in the first six months of the year, when the selection year in youth matches begins, and just 42 percent in the last six months.”
Now I did not research the youth soccer systems of each of the 115 countries represented in the dataset, but if most of them—if most of the countries heavily represented in top European leagues—tend to group kids by birth year (e.g. a kid born on January 1 of year x plays in the same league as a kid born on December 31 of year x) then the “relative age effect” indeed makes an appearance:
Just like the NHL, it looks like birthdays earlier in the year are overrepresented in top flight soccer, at least for the 2017-18 season. (The most common birthday was January 1, which I suspected meant the date had been used as a placeholder in some cases—so I dropped it from the bar chart.)
Techniques
My real question for this project was, when examining a basic table of data, am I better off probing with SQL or Python, and if/when I want to visualize my findings, do I need to change over to another tool, and if so, how do I do so efficiently?
The answer was clear: Most of my projects are best kicked off with SQL queries in DB Browser—but only after I have a Python-cleaned file I know something about. And once I hit upon a query that produces results I find interesting, I generally need Python again and/or QGIS to produce charts and maps.
For this project, my process looked something like this:
(1) Data cleaning and first-look
Python to read and flatten my raw data (it was a JSON file) into a pandas dataframe, with which I printed column titles, several rows (df.head()), and a few unique values lists (.unique()) to get a sense of the data.
Python to ditch fields I didn’t need (e.g., alternate country code formats) and replace periods with underscores in column titles (knowing periods cause problems in SQL).
Python to export .csv and .db versions of my leaner, cleaner dataset.
(2) Data exploration
SQL in DB Browser to probe the .db data file with various queries.
(3) Data processing
Python, using pandas and filtering, grouping, and aggregating methods, to replicate SQL queries and create dataframes specific to the questions I examined, like a table of average player heights and weights by birth year—all in anticipation of using Python plotting libraries to present findings.
Python, embedding SQL queries into the script with the sqlite3 library, to replicate SQL queries in anticipation of using Python plotting libraries to present findings.
ChatGPT to help ensure I accurately converted SQL queries into Python pandas and sqlite3 queries.
SQL in DB Browser to export .csv files of query results bound for QGIS, like a table indicating number of left- and right-footed players in each country.
Until I’m more fluent in subqueries and other techniques, SQL is not the easiest place for me to add certain calculated columns, such as a column converting number left-footed to percent left-footed. In these situations I have variously (i) manually added columns to a csv in Excel, (ii) added columns useful for calculations in Python in my previous cleaning step, or (iii) waited till my data was in QGIS, where column editing and the expression editor allow me to make calculations I need.
(4) Data presentation
Python, using numpy, matplotlib, and seaborn libraries, to plot charts and add best-fit lines.
QGIS to present data spatially.
A frequent challenge for me here is attaching spatial information (e.g. point coordinates, WKT) to datasets containing only place names. For instance, my footedness csv was organized by country, but QGIS has no idea where “England” is or how it is shaped unless my file includes location and geometry vector information. This is easily resolved with a join between the csv attribute table layer and a map vector layer, but the technique relies on country names matching exactly throughout both sources, which I can rarely count on. Is it Bosnia and Herzegovina or Bosnia-Herzegovina? Is it Dem. Rep. of Congo, Congo-Kinshasa, or Democratic Republic of the Congo? I’ve used ChatGPT, Python scripts, and my own human eyes and typing to standardize names, but sometimes I get lucky: Many layers I work with include ISO-standardized 3-letter codes. BIH and COD.
But that wasn’t enough here! “England,” “Scotland,” “Wales,” and “Northern Ireland” may be countries in the world of soccer, but most global-scale datasets, vector and otherwise, aggregate these as the “United Kingdom.” To resolve, I applied the join process to country vector files of multiple administrative levels, then, after some column deletion and renaming (Refactor Fields), I used Merge Vector Layers to get England, Scotland, Wales, and Northern Ireland onto the same geospatial layer as Bosnia-Herzegovina, Democratic Republic of the Congo, etc. with all the footedness data attached.
Data sources
The official citations for the paper and data from Pappalardo et al. is:
Pappalardo et al., (2019) A public data set of spatio-temporal match events in soccer competitions, Nature Scientific Data 6:236, https://www.nature.com/articles/s41597-019-0247-7
Pappalardo et al. (2019)PlayeRank: Data-driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach. ACM Transactions on Intellingent Systems and Technologies (TIST) 10, 5, Article 59 (September 2019), 27 pages. DOI: https://doi.org/10.1145/3343172
My maps came from GADM.