International Football Matches Exploratory Data Analysis with Python

Iniesta and Xavi about to take kick off in a football match between Spain and Portugal

During the Christmas break of 2021, I stumbled upon an interesting dataset on football. Football(Soccer) is a sport I am passionate about but unfortunately I didn’t want to mix business with pleasure. I downloaded the dataset then but it was not till recently that I thought of ditching the whole business and pleasure talk. Who says work cannot be fun? The full project is available here.

The dataset was got from Kaggle in January 2022. It contains international football results from 1872 to 2022. The notebook for this project is on my github.

  1. This dataset is only for international men’s football.
  2. Only national team stats are available, no clubs are in the dataset.
  3. No player information, just national teams.

I used python for this project with pandas libraries for efficient cleaning and manipulations, seaborn was also used for visualization in order to gain insights.

The csv file contains 43,086 rows and 9 columns. The columns are date, home_team, away_team, home_score, away_score, tournament, city, country and neutral.

A couple of questions are to be answered such as

  1. Which team(s) scored the most goals in FIFA World Cup 2014?
  2. Country with most world cup wins ever
  3. The tournament with most goals
  4. Best teams in major tournaments
  5. Correlation between home ground and home_win

Other insights will be derived during the course of the analysis

Firstly, all the necessary libraries were imported.

The csv file was read into a pandas dataframe. A quick check on the first five rows of the dataframe using the head function showed that the first match here was played in 1872. A quick Google search confirms that this is the first official football match in history.

First ever football match!

It can be observed that the data was sorted using the date column from earliest football matches to the latest as at the time. 1872 to 2021.

The info function was used to check the total number of columns, the datatype of each column as well as the number of rows.

Both the home_score and away_score are floats, it is more aesthetically pleasing to the eyes for the scores to be integers hence we convert these floats to integers.

The data types of the columns are checked again after the conversion.

The dataframe was checked for null/missing values. It can be seen that there are a few missing values. I proceeded to check for the missing values.

By digging into the 5 missing values in the home_score column, I found out that all the other missing values are contained here as well. 2 missing values in home_team, 2 in away_team, 5 in away_team and 2 in neutral.

I proceeded to drop these 5 rows from the dataframe and continue the analysis using the cleaned data. The missing values are perhaps due to insufficient information as at the time of preparation of the dataset because these five rows are the most recent entries in the whole dataset.

A quick check for missing values after dropping the missing rows. The cleaned data now has 43,081 rows.

Creating a copy of the cleaned data.

Using value_counts function to check the top 20 tournaments with most matches.

Top 20 tournaments with most matches played

A function to show the winning team name or return a draw in the case of a drawn match.

Function to determine the winner

A function to show the losing team name or return a draw in the case of a drawn match.

Function to determine the loser

Creating a new column to show the winner and loser using the above function. Checking for the newly added winner column.

Defining a function to return the scoreline using the home_score and away_score as inputs.

Scoreline function

Using the code to create a new column scoreline which returns the match scoreline.

Calculating the total number of goals scored in the cleaned dataset.

Function for total goals

Drilling down to the most popular sporting event in the world — The FIFA World Cup.+

Total number of goals scored in the world cup since its inception.

Total goals scored in FIFA World Cup history

Further drilling down to the world cup in 2014.

Checking for the match that produced most goals in the 2014 FIFA world cup.

Checking for the top 10 matches that produced most goals.

Matches with most goals in FIFA World Cup 2014

A function to return the total number of goals scored by a country as well as details of each match is shown below.

Function to get the total number of goals scored by each country
Total goals scored by each team in the FIFA World Cup 2014

Now to the questions…

1. Country with most goals in FIFA World Cup 2014

A function was written to help show the countries with most goals in FIFA World Cup 2014. Also, seaborn was used to visualize the top 6 countries with most goals in the tournament.

Function to show the most prolific teams in FIFA World Cup 2014

2. Country that has won the most FIFA World Cup matches

The answer to this lies in the winner column created. This column however includes information for drawn games. A new dataframe which filters out the drawn games was created and the value_counts function was used to get the countries with most wins in FIFA World cup history.

3. The tournament with most goals

There are lots of tournaments in this dataset, value_counts will also be used here to determine which tournament has most goals. The newly created column of total goals will be used here accompanied with the group by function.

4. Best teams in major tournaments

Major tournaments are the apex tournaments held in each continental region as well as the world cup. The qualification tournaments for these major tournaments will be excluded. Friendlies are not major tournaments. UEFA(Europe) seem to have 2 tournaments), the UEFA EURO as well as the UEFA Nations league. The UEFA Nations league will be excluded from the major tournaments. Similarly there is a lesser championship in Africa compared to the major African Cup of Nations tournament. These lesser tournaments are not included.

There are 5 qualification regions and the single major tournament for these regions are includeD. The FIFA World Cup is included as well as the Confederations Cup.

Africa — African Cup of Nations

Europe — UEFA

Euro Asia (Oceania) — AFC Asian Cup

Oceania — Oceania Nations Cup

North and Central America and the Caribbean — Gold Cup

South America — Copa América

World — FIFA World Cup

World — Confederations Cup

Dataframe for major tournaments
Most wins in major tournaments

PS: I checked for the top 13 most wins as against my initial plan to only show the top 10 teams. My reason? I want to always see Nigeria in the list of best teams. I love Naija.

5. Compare home wins and away wins

We expect home wins to be more than away wins due to the home advantage in football matches. By how much does this home advantage influence wins? We will visualize this using seaborn.

A whooping 48.64% of matches played result in home wins compared to only 28.32% of away wins.

There’s so much information contained in this dataset and only by digging into it can we find them out. Python is a fantastic tool for this! I am looking forward to making big predictions using Python and becoming a great pundit and football analyst using data. I hope you enjoyed this project as much as I did. I doubt that. I wish you more wins at home and away from home in all your endeavors.

The complete python code for this analysis can be got on my GitHub. This notebook has some insights not covered here. Thanks for your time.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Olakunle Yusuf

I am a data analyst with strong analytical skills. I recently earned Google Data Analytics Professional Certificate. SQL | R | Python | Tableau