Premier league season 2022-23 match data analysis#
Author: Chenzhang Zhao chenzz1@uci.edu
Course Project, UC Irvine, Math 10, S23
Introduction#
Introduce your project here. Maybe 3 sentences.
As a soccer fan who has been watching English premier league for over ten years, I am interested in discovering the correlation between match statitics and the win/loss of a team. By studying the data gathered from previous matches, I am also hoping to analyze the style of some particular teams and predict their future performances. In my project, I delve into the match statitics for premier league in the past season (2022-23) to demonstrate the overall team performance and pay attention to details of a few particular teams to analyze their records throughout the whole season. By implementing a number of methods learned in Math 10 and some outside resources, I try to elaborate by process step by step and introduce visualization skills to better show the outcome.
Main project#
You can either have all one section or divide into multiple sections. To make new sections, use ##
in a markdown cell. Double-click this cell for an example of using ##
In the first part I would import all the python libraries needed.
import pandas as pd
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
Files reading#
We first want to read each file and take a look at what they look like.
df_home = pd.read_csv("2023_home_teams_stats.csv")
df_matchday = pd.read_csv("2023_matchday_results.csv")
df_standing = pd.read_csv("2023_PL_standings.csv")
df_home.sample(5)
fixture id | Home team id | Home team name | Shots on Goal | Shots off Goal | Total Shots | Blocked Shots | Shots insidebox | Shots outsidebox | Fouls | Corner Kicks | Offsides | Ball Possession | Yellow Cards | Red Cards | Goalkeeper Saves | Total passes | Passes accurate | Passes % | expected_goals | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
183 | 868129 | 49 | Chelsea | 2.0 | 3.0 | 8 | 3.0 | 6.0 | 2.0 | 9 | 3.0 | 2.0 | 48% | 2.0 | NaN | 2.0 | 553 | 471 | 85% | NaN |
286 | 868232 | 50 | Manchester City | 8.0 | 3.0 | 17 | 6.0 | 14.0 | 3.0 | 9 | 7.0 | 3.0 | 68% | 1.0 | NaN | NaN | 754 | 692 | 92% | 2.96 |
1 | 867947 | 36 | Fulham | 3.0 | 2.0 | 9 | 4.0 | 7.0 | 2.0 | 7 | 4.0 | 4.0 | 33% | 2.0 | NaN | 1.0 | 294 | 181 | 62% | NaN |
165 | 868111 | 45 | Everton | 6.0 | 4.0 | 12 | 2.0 | 7.0 | 5.0 | 9 | 2.0 | 1.0 | 58% | 1.0 | NaN | 2.0 | 505 | 427 | 85% | NaN |
151 | 868097 | 51 | Brighton | 2.0 | 1.0 | 7 | 4.0 | 4.0 | 3.0 | 14 | 11.0 | 3.0 | 65% | 2.0 | NaN | NaN | 526 | 454 | 86% | NaN |
df_matchday
Unnamed: 0 | fixture.id | fixture.date | teams.home.id | teams.home.name | teams.home.winner | teams.away.id | teams.away.name | teams.away.winner | goals.home | goals.away | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 |
1 | 1 | 867947 | 2022-08-06T11:30:00+00:00 | 36 | Fulham | NaN | 40 | Liverpool | NaN | 2 | 2 |
2 | 2 | 867948 | 2022-08-06T14:00:00+00:00 | 35 | Bournemouth | True | 66 | Aston Villa | False | 2 | 0 |
3 | 3 | 867949 | 2022-08-06T14:00:00+00:00 | 63 | Leeds | True | 39 | Wolves | False | 2 | 1 |
4 | 4 | 867950 | 2022-08-07T13:00:00+00:00 | 46 | Leicester | NaN | 55 | Brentford | NaN | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375 | 375 | 868321 | 2023-05-28T15:30:00+00:00 | 45 | Everton | True | 35 | Bournemouth | False | 1 | 0 |
376 | 376 | 868322 | 2023-05-28T15:30:00+00:00 | 63 | Leeds | False | 47 | Tottenham | True | 1 | 4 |
377 | 377 | 868323 | 2023-05-28T15:30:00+00:00 | 46 | Leicester | True | 48 | West Ham | False | 2 | 1 |
378 | 378 | 868324 | 2023-05-28T15:30:00+00:00 | 33 | Manchester United | True | 36 | Fulham | False | 2 | 1 |
379 | 379 | 868325 | 2023-05-28T15:30:00+00:00 | 41 | Southampton | NaN | 40 | Liverpool | NaN | 4 | 4 |
380 rows × 11 columns
df_standing
rank | points | goalsDiff | team_id | team.name | matches_played | wins | draws | losses | goals_for | ... | home_draw | home_lose | home_goals_for | home_goals_against | away | away_wins | away_draw | away_lose | away_goals_for | away_goals_against | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 89 | 61 | 50 | Manchester City | 38 | 28 | 5 | 5 | 94 | ... | 1 | 1 | 60 | 17 | 19 | 11 | 4 | 4 | 34 | 16 |
1 | 2 | 84 | 45 | 42 | Arsenal | 38 | 26 | 6 | 6 | 88 | ... | 3 | 2 | 53 | 25 | 19 | 12 | 3 | 4 | 35 | 18 |
2 | 3 | 75 | 15 | 33 | Manchester United | 38 | 23 | 6 | 9 | 58 | ... | 3 | 1 | 36 | 10 | 19 | 8 | 3 | 8 | 22 | 33 |
3 | 4 | 71 | 35 | 34 | Newcastle | 38 | 19 | 14 | 5 | 68 | ... | 6 | 2 | 36 | 14 | 19 | 8 | 8 | 3 | 32 | 19 |
4 | 5 | 67 | 28 | 40 | Liverpool | 38 | 19 | 10 | 9 | 75 | ... | 5 | 1 | 46 | 17 | 19 | 6 | 5 | 8 | 29 | 30 |
5 | 6 | 62 | 19 | 51 | Brighton | 38 | 18 | 8 | 12 | 72 | ... | 4 | 5 | 37 | 21 | 19 | 8 | 4 | 7 | 35 | 32 |
6 | 7 | 61 | 5 | 66 | Aston Villa | 38 | 18 | 7 | 13 | 51 | ... | 2 | 5 | 33 | 21 | 19 | 6 | 5 | 8 | 18 | 25 |
7 | 8 | 60 | 7 | 47 | Tottenham | 38 | 18 | 6 | 14 | 70 | ... | 1 | 6 | 37 | 25 | 19 | 6 | 5 | 8 | 33 | 38 |
8 | 9 | 59 | 12 | 55 | Brentford | 38 | 15 | 14 | 9 | 58 | ... | 7 | 2 | 35 | 18 | 19 | 5 | 7 | 7 | 23 | 28 |
9 | 10 | 52 | 2 | 36 | Fulham | 38 | 15 | 7 | 16 | 55 | ... | 5 | 6 | 31 | 29 | 19 | 7 | 2 | 10 | 24 | 24 |
10 | 11 | 45 | -9 | 52 | Crystal Palace | 38 | 11 | 12 | 15 | 40 | ... | 7 | 5 | 21 | 23 | 19 | 4 | 5 | 10 | 19 | 26 |
11 | 12 | 44 | -9 | 49 | Chelsea | 38 | 11 | 11 | 16 | 38 | ... | 7 | 6 | 20 | 19 | 19 | 5 | 4 | 10 | 18 | 28 |
12 | 13 | 41 | -27 | 39 | Wolves | 38 | 11 | 8 | 19 | 31 | ... | 3 | 7 | 19 | 20 | 19 | 2 | 5 | 12 | 12 | 38 |
13 | 14 | 40 | -13 | 48 | West Ham | 38 | 11 | 7 | 20 | 42 | ... | 4 | 7 | 26 | 24 | 19 | 3 | 3 | 13 | 16 | 31 |
14 | 15 | 39 | -34 | 35 | Bournemouth | 38 | 11 | 6 | 21 | 37 | ... | 4 | 9 | 20 | 28 | 19 | 5 | 2 | 12 | 17 | 43 |
15 | 16 | 38 | -30 | 65 | Nottingham Forest | 38 | 9 | 11 | 18 | 38 | ... | 6 | 5 | 27 | 24 | 19 | 1 | 5 | 13 | 11 | 44 |
16 | 17 | 36 | -23 | 45 | Everton | 38 | 8 | 12 | 18 | 34 | ... | 3 | 10 | 16 | 27 | 19 | 2 | 9 | 8 | 18 | 30 |
17 | 18 | 34 | -17 | 46 | Leicester | 38 | 9 | 7 | 22 | 51 | ... | 4 | 10 | 23 | 27 | 19 | 4 | 3 | 12 | 28 | 41 |
18 | 19 | 31 | -30 | 63 | Leeds | 38 | 7 | 10 | 21 | 48 | ... | 7 | 7 | 26 | 37 | 19 | 2 | 3 | 14 | 22 | 41 |
19 | 20 | 25 | -37 | 41 | Southampton | 38 | 6 | 7 | 25 | 36 | ... | 5 | 12 | 19 | 37 | 19 | 4 | 2 | 13 | 17 | 36 |
20 rows × 23 columns
Overall Team Status#
I want to demonstrate each team’s overall performance in a season.
I noticed that the column name “team.name” might cause parsing issues with altair, so I renamed it to “team_name” for future convenience.
df_standing = df_standing.rename(columns={'team.name': 'team_name'})
Here, to visualize clearly the final points for each team, I draw an altair chart to show the relation between points and ranking, with tooltip showing each team’s name.
df_final_ranking_dataframe = df_standing[['rank', 'points']].copy()
df_final_ranking_dataframe
rank | points | |
---|---|---|
0 | 1 | 89 |
1 | 2 | 84 |
2 | 3 | 75 |
3 | 4 | 71 |
4 | 5 | 67 |
5 | 6 | 62 |
6 | 7 | 61 |
7 | 8 | 60 |
8 | 9 | 59 |
9 | 10 | 52 |
10 | 11 | 45 |
11 | 12 | 44 |
12 | 13 | 41 |
13 | 14 | 40 |
14 | 15 | 39 |
15 | 16 | 38 |
16 | 17 | 36 |
17 | 18 | 34 |
18 | 19 | 31 |
19 | 20 | 25 |
Using seaborn as an extra part, I plot the final points data using scatterplot according to the team’s final ranking. The plot is increasing beacuse the higher the ranking is, the higher the final point is. I also use altair in this case because it is easier to use a tooltip.
sns.scatterplot(data=df_final_ranking_dataframe, x='points', y='rank', hue='points')
plt.xlabel('Points')
plt.ylabel('Rank')
plt.title('Final Ranking')
plt.gca().invert_yaxis()
# Set the y-axis limits and tick labels
plt.ylim(20.5, 0.5)
plt.yticks(list(range(20, 0, -1)))
# Show the plot
plt.show()
df_final_ranking = alt.Chart(df_final_ranking_dataframe).mark_circle().encode(
x=alt.X("points:Q", scale=alt.Scale(zero=False)),
y=alt.Y("rank:N", scale=alt.Scale(zero=False)),
color=alt.Color("points"),
tooltip=['rank:N', 'team_name:N', 'points:N']
)
df_final_ranking
Individual team performance after each match#
Then I want to see how each team performs after each match. I make a line chart to show each team’s points after each match using data from df_matchday dataset.
To plot the lines, I first want to add three columns in df_matchday dataset, which are “teams_draw”, “teams_home_points”, “teams_away_points” to manually calculate how many points each team get after each game. Also, I have to rename some columns to avoid parsing issues.
df_matchday
Unnamed: 0 | fixture.id | fixture.date | teams.home.id | teams.home.name | teams.home.winner | teams.away.id | teams.away.name | teams.away.winner | goals.home | goals.away | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 |
1 | 1 | 867947 | 2022-08-06T11:30:00+00:00 | 36 | Fulham | NaN | 40 | Liverpool | NaN | 2 | 2 |
2 | 2 | 867948 | 2022-08-06T14:00:00+00:00 | 35 | Bournemouth | True | 66 | Aston Villa | False | 2 | 0 |
3 | 3 | 867949 | 2022-08-06T14:00:00+00:00 | 63 | Leeds | True | 39 | Wolves | False | 2 | 1 |
4 | 4 | 867950 | 2022-08-07T13:00:00+00:00 | 46 | Leicester | NaN | 55 | Brentford | NaN | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375 | 375 | 868321 | 2023-05-28T15:30:00+00:00 | 45 | Everton | True | 35 | Bournemouth | False | 1 | 0 |
376 | 376 | 868322 | 2023-05-28T15:30:00+00:00 | 63 | Leeds | False | 47 | Tottenham | True | 1 | 4 |
377 | 377 | 868323 | 2023-05-28T15:30:00+00:00 | 46 | Leicester | True | 48 | West Ham | False | 2 | 1 |
378 | 378 | 868324 | 2023-05-28T15:30:00+00:00 | 33 | Manchester United | True | 36 | Fulham | False | 2 | 1 |
379 | 379 | 868325 | 2023-05-28T15:30:00+00:00 | 41 | Southampton | NaN | 40 | Liverpool | NaN | 4 | 4 |
380 rows × 11 columns
We will be dealing with this dataframe for a lot of times, so it’s better to look at some features of this particular dataframe.
df_matchday.dtypes
Unnamed: 0 int64
fixture.id int64
fixture.date object
teams.home.id int64
teams.home.name object
teams.home.winner object
teams.away.id int64
teams.away.name object
teams.away.winner object
goals.home int64
goals.away int64
dtype: object
df_matchday.describe
<bound method NDFrame.describe of Unnamed: 0 fixture.id fixture.date teams.home.id \
0 0 867946 2022-08-05T19:00:00+00:00 52
1 1 867947 2022-08-06T11:30:00+00:00 36
2 2 867948 2022-08-06T14:00:00+00:00 35
3 3 867949 2022-08-06T14:00:00+00:00 63
4 4 867950 2022-08-07T13:00:00+00:00 46
.. ... ... ... ...
375 375 868321 2023-05-28T15:30:00+00:00 45
376 376 868322 2023-05-28T15:30:00+00:00 63
377 377 868323 2023-05-28T15:30:00+00:00 46
378 378 868324 2023-05-28T15:30:00+00:00 33
379 379 868325 2023-05-28T15:30:00+00:00 41
teams.home.name teams.home.winner teams.away.id teams.away.name \
0 Crystal Palace False 42 Arsenal
1 Fulham NaN 40 Liverpool
2 Bournemouth True 66 Aston Villa
3 Leeds True 39 Wolves
4 Leicester NaN 55 Brentford
.. ... ... ... ...
375 Everton True 35 Bournemouth
376 Leeds False 47 Tottenham
377 Leicester True 48 West Ham
378 Manchester United True 36 Fulham
379 Southampton NaN 40 Liverpool
teams.away.winner goals.home goals.away
0 True 0 2
1 NaN 2 2
2 False 2 0
3 False 2 1
4 NaN 2 2
.. ... ... ...
375 False 1 0
376 True 1 4
377 False 2 1
378 False 2 1
379 NaN 4 4
[380 rows x 11 columns]>
list(df_matchday.columns)
['Unnamed: 0',
'fixture.id',
'fixture.date',
'teams.home.id',
'teams.home.name',
'teams.home.winner',
'teams.away.id',
'teams.away.name',
'teams.away.winner',
'goals.home',
'goals.away']
for index in df_matchday.columns:
new_index = index.replace(".", "_")
df_matchday.rename(columns={index: new_index}, inplace=True)
df_matchday
Unnamed: 0 | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 |
1 | 1 | 867947 | 2022-08-06T11:30:00+00:00 | 36 | Fulham | NaN | 40 | Liverpool | NaN | 2 | 2 |
2 | 2 | 867948 | 2022-08-06T14:00:00+00:00 | 35 | Bournemouth | True | 66 | Aston Villa | False | 2 | 0 |
3 | 3 | 867949 | 2022-08-06T14:00:00+00:00 | 63 | Leeds | True | 39 | Wolves | False | 2 | 1 |
4 | 4 | 867950 | 2022-08-07T13:00:00+00:00 | 46 | Leicester | NaN | 55 | Brentford | NaN | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375 | 375 | 868321 | 2023-05-28T15:30:00+00:00 | 45 | Everton | True | 35 | Bournemouth | False | 1 | 0 |
376 | 376 | 868322 | 2023-05-28T15:30:00+00:00 | 63 | Leeds | False | 47 | Tottenham | True | 1 | 4 |
377 | 377 | 868323 | 2023-05-28T15:30:00+00:00 | 46 | Leicester | True | 48 | West Ham | False | 2 | 1 |
378 | 378 | 868324 | 2023-05-28T15:30:00+00:00 | 33 | Manchester United | True | 36 | Fulham | False | 2 | 1 |
379 | 379 | 868325 | 2023-05-28T15:30:00+00:00 | 41 | Southampton | NaN | 40 | Liverpool | NaN | 4 | 4 |
380 rows × 11 columns
Here I noticed if two teams end up with a draw, the “teams_home_winner” column would be nan, so I make a “teams_draw” column to determine if the match results in a draw.
df_matchday["teams_draw"] = df_matchday["teams_home_winner"].isna()
df_matchday["teams_draw"]
0 False
1 True
2 False
3 False
4 True
...
375 False
376 False
377 False
378 False
379 True
Name: teams_draw, Length: 380, dtype: bool
In the next two block I make two new columns “teams_home_points” and “teams_away_points” to determine how many points each team received from each match they played. The winner receives three points, the loser receives 0 point, and each team receives one point if the result is a draw.
for i in range(0,380):
if df_matchday.loc[i, "goals_home"] > df_matchday.loc[i, "goals_away"]:
df_matchday.loc[i, "teams_home_points"] = 3
elif df_matchday.loc[i, "goals_home"] < df_matchday.loc[i, "goals_away"]:
df_matchday.loc[i, "teams_home_points"] = 0
else:
df_matchday.loc[i, "teams_home_points"] = 1
df_matchday["teams_home_points"]
0 0.0
1 1.0
2 3.0
3 3.0
4 1.0
...
375 3.0
376 0.0
377 3.0
378 3.0
379 1.0
Name: teams_home_points, Length: 380, dtype: float64
for i in range(0,380):
if df_matchday.loc[i, "goals_home"] > df_matchday.loc[i, "goals_away"]:
df_matchday.loc[i, "teams_away_points"] = 0
elif df_matchday.loc[i, "goals_home"] < df_matchday.loc[i, "goals_away"]:
df_matchday.loc[i, "teams_away_points"] = 3
else:
df_matchday.loc[i, "teams_away_points"] = 1
df_matchday["teams_away_points"]
0 3.0
1 1.0
2 0.0
3 0.0
4 1.0
...
375 0.0
376 3.0
377 0.0
378 0.0
379 1.0
Name: teams_away_points, Length: 380, dtype: float64
Now I try to separate the teams using “teams_home_id”. I check if the length of the numpy array is equal to 20, which is the number of teams in the league.
teams_id_array = df_matchday["teams_home_id"].unique()
teams_id_array
array([52, 36, 35, 63, 46, 34, 47, 45, 33, 48, 42, 66, 55, 51, 49, 40, 50,
65, 41, 39])
len(teams_id_array)
20
Here I check the matches for a specific team with id “52” for a whole season.
df_sub = df_matchday[(df_matchday["teams_home_id"] == 52) | (df_matchday["teams_away_id"] == 52)].copy()
df_sub
Unnamed: 0 | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 | False | 0.0 | 3.0 |
15 | 15 | 867961 | 2022-08-15T19:00:00+00:00 | 40 | Liverpool | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 |
21 | 21 | 867967 | 2022-08-20T14:00:00+00:00 | 52 | Crystal Palace | True | 66 | Aston Villa | False | 3 | 1 | False | 3.0 | 0.0 |
36 | 36 | 867982 | 2022-08-27T14:00:00+00:00 | 50 | Manchester City | True | 52 | Crystal Palace | False | 4 | 2 | False | 3.0 | 0.0 |
46 | 46 | 867992 | 2022-08-30T18:30:00+00:00 | 52 | Crystal Palace | NaN | 55 | Brentford | NaN | 1 | 1 | True | 1.0 | 1.0 |
56 | 56 | 868002 | 2022-09-03T14:00:00+00:00 | 34 | Newcastle | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 |
62 | 62 | 868008 | 2023-01-18T20:00:00+00:00 | 52 | Crystal Palace | NaN | 33 | Manchester United | NaN | 1 | 1 | True | 1.0 | 1.0 |
72 | 72 | 868018 | 2023-03-15T19:30:00+00:00 | 51 | Brighton | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 |
82 | 82 | 868028 | 2022-10-01T14:00:00+00:00 | 52 | Crystal Palace | False | 49 | Chelsea | True | 1 | 2 | False | 0.0 | 3.0 |
94 | 94 | 868040 | 2022-10-09T13:00:00+00:00 | 52 | Crystal Palace | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 |
104 | 104 | 868050 | 2022-10-15T11:30:00+00:00 | 46 | Leicester | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 |
116 | 116 | 868062 | 2022-10-18T19:15:00+00:00 | 52 | Crystal Palace | True | 39 | Wolves | False | 2 | 1 | False | 3.0 | 0.0 |
122 | 122 | 868068 | 2022-10-22T14:00:00+00:00 | 45 | Everton | True | 52 | Crystal Palace | False | 3 | 0 | False | 3.0 | 0.0 |
134 | 134 | 868080 | 2022-10-29T14:00:00+00:00 | 52 | Crystal Palace | True | 41 | Southampton | False | 1 | 0 | False | 3.0 | 0.0 |
148 | 148 | 868094 | 2022-11-06T14:00:00+00:00 | 48 | West Ham | False | 52 | Crystal Palace | True | 1 | 2 | False | 0.0 | 3.0 |
156 | 156 | 868102 | 2022-11-12T15:00:00+00:00 | 65 | Nottingham Forest | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 |
164 | 164 | 868110 | 2022-12-26T15:00:00+00:00 | 52 | Crystal Palace | False | 36 | Fulham | True | 0 | 3 | False | 0.0 | 3.0 |
170 | 170 | 868116 | 2022-12-31T15:00:00+00:00 | 35 | Bournemouth | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 |
184 | 184 | 868130 | 2023-01-04T20:00:00+00:00 | 52 | Crystal Palace | False | 47 | Tottenham | True | 0 | 4 | False | 0.0 | 3.0 |
193 | 193 | 868139 | 2023-01-15T14:00:00+00:00 | 49 | Chelsea | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 |
202 | 202 | 868148 | 2023-01-21T17:30:00+00:00 | 52 | Crystal Palace | NaN | 34 | Newcastle | NaN | 0 | 0 | True | 1.0 | 1.0 |
215 | 215 | 868161 | 2023-02-04T15:00:00+00:00 | 33 | Manchester United | True | 52 | Crystal Palace | False | 2 | 1 | False | 3.0 | 0.0 |
222 | 222 | 868168 | 2023-02-11T15:00:00+00:00 | 52 | Crystal Palace | NaN | 51 | Brighton | NaN | 1 | 1 | True | 1.0 | 1.0 |
231 | 231 | 868177 | 2023-02-18T15:00:00+00:00 | 55 | Brentford | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 |
241 | 241 | 868187 | 2023-02-25T19:45:00+00:00 | 52 | Crystal Palace | NaN | 40 | Liverpool | NaN | 0 | 0 | True | 1.0 | 1.0 |
251 | 251 | 868197 | 2023-03-04T15:00:00+00:00 | 66 | Aston Villa | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 |
261 | 261 | 868207 | 2023-03-11T17:30:00+00:00 | 52 | Crystal Palace | False | 50 | Manchester City | True | 0 | 1 | False | 0.0 | 3.0 |
270 | 270 | 868216 | 2023-03-19T14:00:00+00:00 | 42 | Arsenal | True | 52 | Crystal Palace | False | 4 | 1 | False | 3.0 | 0.0 |
284 | 284 | 868230 | 2023-04-01T14:00:00+00:00 | 52 | Crystal Palace | True | 46 | Leicester | False | 2 | 1 | False | 3.0 | 0.0 |
293 | 293 | 868239 | 2023-04-09T13:00:00+00:00 | 63 | Leeds | False | 52 | Crystal Palace | True | 1 | 5 | False | 0.0 | 3.0 |
306 | 306 | 868252 | 2023-04-15T14:00:00+00:00 | 41 | Southampton | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 |
314 | 314 | 868260 | 2023-04-22T14:00:00+00:00 | 52 | Crystal Palace | NaN | 45 | Everton | NaN | 0 | 0 | True | 1.0 | 1.0 |
325 | 325 | 868271 | 2023-04-25T18:30:00+00:00 | 39 | Wolves | True | 52 | Crystal Palace | False | 2 | 0 | False | 3.0 | 0.0 |
334 | 334 | 868280 | 2023-04-29T11:30:00+00:00 | 52 | Crystal Palace | True | 48 | West Ham | False | 4 | 3 | False | 3.0 | 0.0 |
347 | 347 | 868293 | 2023-05-06T14:00:00+00:00 | 47 | Tottenham | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 |
354 | 354 | 868300 | 2023-05-13T14:00:00+00:00 | 52 | Crystal Palace | True | 35 | Bournemouth | False | 2 | 0 | False | 3.0 | 0.0 |
362 | 362 | 868308 | 2023-05-20T14:00:00+00:00 | 36 | Fulham | NaN | 52 | Crystal Palace | NaN | 2 | 2 | True | 1.0 | 1.0 |
374 | 374 | 868320 | 2023-05-28T15:30:00+00:00 | 52 | Crystal Palace | NaN | 65 | Nottingham Forest | NaN | 1 | 1 | True | 1.0 | 1.0 |
Now I try to use cumsum method to calculate the cumulative for a specific teamw ith id “52” after each match. I store the values in a new column called “teams_total_points”.
df_sub.loc[:, "teams_total_points"] = df_sub.apply(lambda row: row["teams_home_points"] if row["teams_home_id"] == 52 else row["teams_away_points"], axis=1).cumsum()
df_sub
Unnamed: 0 | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | teams_total_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 | False | 0.0 | 3.0 | 0.0 |
15 | 15 | 867961 | 2022-08-15T19:00:00+00:00 | 40 | Liverpool | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 | 1.0 |
21 | 21 | 867967 | 2022-08-20T14:00:00+00:00 | 52 | Crystal Palace | True | 66 | Aston Villa | False | 3 | 1 | False | 3.0 | 0.0 | 4.0 |
36 | 36 | 867982 | 2022-08-27T14:00:00+00:00 | 50 | Manchester City | True | 52 | Crystal Palace | False | 4 | 2 | False | 3.0 | 0.0 | 4.0 |
46 | 46 | 867992 | 2022-08-30T18:30:00+00:00 | 52 | Crystal Palace | NaN | 55 | Brentford | NaN | 1 | 1 | True | 1.0 | 1.0 | 5.0 |
56 | 56 | 868002 | 2022-09-03T14:00:00+00:00 | 34 | Newcastle | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 | 6.0 |
62 | 62 | 868008 | 2023-01-18T20:00:00+00:00 | 52 | Crystal Palace | NaN | 33 | Manchester United | NaN | 1 | 1 | True | 1.0 | 1.0 | 7.0 |
72 | 72 | 868018 | 2023-03-15T19:30:00+00:00 | 51 | Brighton | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 7.0 |
82 | 82 | 868028 | 2022-10-01T14:00:00+00:00 | 52 | Crystal Palace | False | 49 | Chelsea | True | 1 | 2 | False | 0.0 | 3.0 | 7.0 |
94 | 94 | 868040 | 2022-10-09T13:00:00+00:00 | 52 | Crystal Palace | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 10.0 |
104 | 104 | 868050 | 2022-10-15T11:30:00+00:00 | 46 | Leicester | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 | 11.0 |
116 | 116 | 868062 | 2022-10-18T19:15:00+00:00 | 52 | Crystal Palace | True | 39 | Wolves | False | 2 | 1 | False | 3.0 | 0.0 | 14.0 |
122 | 122 | 868068 | 2022-10-22T14:00:00+00:00 | 45 | Everton | True | 52 | Crystal Palace | False | 3 | 0 | False | 3.0 | 0.0 | 14.0 |
134 | 134 | 868080 | 2022-10-29T14:00:00+00:00 | 52 | Crystal Palace | True | 41 | Southampton | False | 1 | 0 | False | 3.0 | 0.0 | 17.0 |
148 | 148 | 868094 | 2022-11-06T14:00:00+00:00 | 48 | West Ham | False | 52 | Crystal Palace | True | 1 | 2 | False | 0.0 | 3.0 | 20.0 |
156 | 156 | 868102 | 2022-11-12T15:00:00+00:00 | 65 | Nottingham Forest | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 20.0 |
164 | 164 | 868110 | 2022-12-26T15:00:00+00:00 | 52 | Crystal Palace | False | 36 | Fulham | True | 0 | 3 | False | 0.0 | 3.0 | 20.0 |
170 | 170 | 868116 | 2022-12-31T15:00:00+00:00 | 35 | Bournemouth | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 | 23.0 |
184 | 184 | 868130 | 2023-01-04T20:00:00+00:00 | 52 | Crystal Palace | False | 47 | Tottenham | True | 0 | 4 | False | 0.0 | 3.0 | 23.0 |
193 | 193 | 868139 | 2023-01-15T14:00:00+00:00 | 49 | Chelsea | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 23.0 |
202 | 202 | 868148 | 2023-01-21T17:30:00+00:00 | 52 | Crystal Palace | NaN | 34 | Newcastle | NaN | 0 | 0 | True | 1.0 | 1.0 | 24.0 |
215 | 215 | 868161 | 2023-02-04T15:00:00+00:00 | 33 | Manchester United | True | 52 | Crystal Palace | False | 2 | 1 | False | 3.0 | 0.0 | 24.0 |
222 | 222 | 868168 | 2023-02-11T15:00:00+00:00 | 52 | Crystal Palace | NaN | 51 | Brighton | NaN | 1 | 1 | True | 1.0 | 1.0 | 25.0 |
231 | 231 | 868177 | 2023-02-18T15:00:00+00:00 | 55 | Brentford | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 | 26.0 |
241 | 241 | 868187 | 2023-02-25T19:45:00+00:00 | 52 | Crystal Palace | NaN | 40 | Liverpool | NaN | 0 | 0 | True | 1.0 | 1.0 | 27.0 |
251 | 251 | 868197 | 2023-03-04T15:00:00+00:00 | 66 | Aston Villa | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 27.0 |
261 | 261 | 868207 | 2023-03-11T17:30:00+00:00 | 52 | Crystal Palace | False | 50 | Manchester City | True | 0 | 1 | False | 0.0 | 3.0 | 27.0 |
270 | 270 | 868216 | 2023-03-19T14:00:00+00:00 | 42 | Arsenal | True | 52 | Crystal Palace | False | 4 | 1 | False | 3.0 | 0.0 | 27.0 |
284 | 284 | 868230 | 2023-04-01T14:00:00+00:00 | 52 | Crystal Palace | True | 46 | Leicester | False | 2 | 1 | False | 3.0 | 0.0 | 30.0 |
293 | 293 | 868239 | 2023-04-09T13:00:00+00:00 | 63 | Leeds | False | 52 | Crystal Palace | True | 1 | 5 | False | 0.0 | 3.0 | 33.0 |
306 | 306 | 868252 | 2023-04-15T14:00:00+00:00 | 41 | Southampton | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 | 36.0 |
314 | 314 | 868260 | 2023-04-22T14:00:00+00:00 | 52 | Crystal Palace | NaN | 45 | Everton | NaN | 0 | 0 | True | 1.0 | 1.0 | 37.0 |
325 | 325 | 868271 | 2023-04-25T18:30:00+00:00 | 39 | Wolves | True | 52 | Crystal Palace | False | 2 | 0 | False | 3.0 | 0.0 | 37.0 |
334 | 334 | 868280 | 2023-04-29T11:30:00+00:00 | 52 | Crystal Palace | True | 48 | West Ham | False | 4 | 3 | False | 3.0 | 0.0 | 40.0 |
347 | 347 | 868293 | 2023-05-06T14:00:00+00:00 | 47 | Tottenham | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 40.0 |
354 | 354 | 868300 | 2023-05-13T14:00:00+00:00 | 52 | Crystal Palace | True | 35 | Bournemouth | False | 2 | 0 | False | 3.0 | 0.0 | 43.0 |
362 | 362 | 868308 | 2023-05-20T14:00:00+00:00 | 36 | Fulham | NaN | 52 | Crystal Palace | NaN | 2 | 2 | True | 1.0 | 1.0 | 44.0 |
374 | 374 | 868320 | 2023-05-28T15:30:00+00:00 | 52 | Crystal Palace | NaN | 65 | Nottingham Forest | NaN | 1 | 1 | True | 1.0 | 1.0 | 45.0 |
df_sub.rename(columns={"Unnamed: 0": "game_number"}, inplace=True)
For future convenience and clarity, I replace the previous values in column “game_number”(previously settled value with no particular meaning) for a specific team with values from 1-38, which represent the matches 1-38.
game_number_mapping = {game_number: i + 1 for i, game_number in enumerate(df_sub['game_number'].unique())}
df_sub.loc[:, 'game_number'] = df_sub['game_number'].replace(game_number_mapping)
df_sub
game_number | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | teams_total_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 867946 | 2022-08-05T19:00:00+00:00 | 52 | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 | False | 0.0 | 3.0 | 0.0 |
15 | 2 | 867961 | 2022-08-15T19:00:00+00:00 | 40 | Liverpool | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 | 1.0 |
21 | 3 | 867967 | 2022-08-20T14:00:00+00:00 | 52 | Crystal Palace | True | 66 | Aston Villa | False | 3 | 1 | False | 3.0 | 0.0 | 4.0 |
36 | 4 | 867982 | 2022-08-27T14:00:00+00:00 | 50 | Manchester City | True | 52 | Crystal Palace | False | 4 | 2 | False | 3.0 | 0.0 | 4.0 |
46 | 5 | 867992 | 2022-08-30T18:30:00+00:00 | 52 | Crystal Palace | NaN | 55 | Brentford | NaN | 1 | 1 | True | 1.0 | 1.0 | 5.0 |
56 | 6 | 868002 | 2022-09-03T14:00:00+00:00 | 34 | Newcastle | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 | 6.0 |
62 | 7 | 868008 | 2023-01-18T20:00:00+00:00 | 52 | Crystal Palace | NaN | 33 | Manchester United | NaN | 1 | 1 | True | 1.0 | 1.0 | 7.0 |
72 | 8 | 868018 | 2023-03-15T19:30:00+00:00 | 51 | Brighton | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 7.0 |
82 | 9 | 868028 | 2022-10-01T14:00:00+00:00 | 52 | Crystal Palace | False | 49 | Chelsea | True | 1 | 2 | False | 0.0 | 3.0 | 7.0 |
94 | 10 | 868040 | 2022-10-09T13:00:00+00:00 | 52 | Crystal Palace | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 10.0 |
104 | 11 | 868050 | 2022-10-15T11:30:00+00:00 | 46 | Leicester | NaN | 52 | Crystal Palace | NaN | 0 | 0 | True | 1.0 | 1.0 | 11.0 |
116 | 12 | 868062 | 2022-10-18T19:15:00+00:00 | 52 | Crystal Palace | True | 39 | Wolves | False | 2 | 1 | False | 3.0 | 0.0 | 14.0 |
122 | 13 | 868068 | 2022-10-22T14:00:00+00:00 | 45 | Everton | True | 52 | Crystal Palace | False | 3 | 0 | False | 3.0 | 0.0 | 14.0 |
134 | 14 | 868080 | 2022-10-29T14:00:00+00:00 | 52 | Crystal Palace | True | 41 | Southampton | False | 1 | 0 | False | 3.0 | 0.0 | 17.0 |
148 | 15 | 868094 | 2022-11-06T14:00:00+00:00 | 48 | West Ham | False | 52 | Crystal Palace | True | 1 | 2 | False | 0.0 | 3.0 | 20.0 |
156 | 16 | 868102 | 2022-11-12T15:00:00+00:00 | 65 | Nottingham Forest | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 20.0 |
164 | 17 | 868110 | 2022-12-26T15:00:00+00:00 | 52 | Crystal Palace | False | 36 | Fulham | True | 0 | 3 | False | 0.0 | 3.0 | 20.0 |
170 | 18 | 868116 | 2022-12-31T15:00:00+00:00 | 35 | Bournemouth | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 | 23.0 |
184 | 19 | 868130 | 2023-01-04T20:00:00+00:00 | 52 | Crystal Palace | False | 47 | Tottenham | True | 0 | 4 | False | 0.0 | 3.0 | 23.0 |
193 | 20 | 868139 | 2023-01-15T14:00:00+00:00 | 49 | Chelsea | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 23.0 |
202 | 21 | 868148 | 2023-01-21T17:30:00+00:00 | 52 | Crystal Palace | NaN | 34 | Newcastle | NaN | 0 | 0 | True | 1.0 | 1.0 | 24.0 |
215 | 22 | 868161 | 2023-02-04T15:00:00+00:00 | 33 | Manchester United | True | 52 | Crystal Palace | False | 2 | 1 | False | 3.0 | 0.0 | 24.0 |
222 | 23 | 868168 | 2023-02-11T15:00:00+00:00 | 52 | Crystal Palace | NaN | 51 | Brighton | NaN | 1 | 1 | True | 1.0 | 1.0 | 25.0 |
231 | 24 | 868177 | 2023-02-18T15:00:00+00:00 | 55 | Brentford | NaN | 52 | Crystal Palace | NaN | 1 | 1 | True | 1.0 | 1.0 | 26.0 |
241 | 25 | 868187 | 2023-02-25T19:45:00+00:00 | 52 | Crystal Palace | NaN | 40 | Liverpool | NaN | 0 | 0 | True | 1.0 | 1.0 | 27.0 |
251 | 26 | 868197 | 2023-03-04T15:00:00+00:00 | 66 | Aston Villa | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 27.0 |
261 | 27 | 868207 | 2023-03-11T17:30:00+00:00 | 52 | Crystal Palace | False | 50 | Manchester City | True | 0 | 1 | False | 0.0 | 3.0 | 27.0 |
270 | 28 | 868216 | 2023-03-19T14:00:00+00:00 | 42 | Arsenal | True | 52 | Crystal Palace | False | 4 | 1 | False | 3.0 | 0.0 | 27.0 |
284 | 29 | 868230 | 2023-04-01T14:00:00+00:00 | 52 | Crystal Palace | True | 46 | Leicester | False | 2 | 1 | False | 3.0 | 0.0 | 30.0 |
293 | 30 | 868239 | 2023-04-09T13:00:00+00:00 | 63 | Leeds | False | 52 | Crystal Palace | True | 1 | 5 | False | 0.0 | 3.0 | 33.0 |
306 | 31 | 868252 | 2023-04-15T14:00:00+00:00 | 41 | Southampton | False | 52 | Crystal Palace | True | 0 | 2 | False | 0.0 | 3.0 | 36.0 |
314 | 32 | 868260 | 2023-04-22T14:00:00+00:00 | 52 | Crystal Palace | NaN | 45 | Everton | NaN | 0 | 0 | True | 1.0 | 1.0 | 37.0 |
325 | 33 | 868271 | 2023-04-25T18:30:00+00:00 | 39 | Wolves | True | 52 | Crystal Palace | False | 2 | 0 | False | 3.0 | 0.0 | 37.0 |
334 | 34 | 868280 | 2023-04-29T11:30:00+00:00 | 52 | Crystal Palace | True | 48 | West Ham | False | 4 | 3 | False | 3.0 | 0.0 | 40.0 |
347 | 35 | 868293 | 2023-05-06T14:00:00+00:00 | 47 | Tottenham | True | 52 | Crystal Palace | False | 1 | 0 | False | 3.0 | 0.0 | 40.0 |
354 | 36 | 868300 | 2023-05-13T14:00:00+00:00 | 52 | Crystal Palace | True | 35 | Bournemouth | False | 2 | 0 | False | 3.0 | 0.0 | 43.0 |
362 | 37 | 868308 | 2023-05-20T14:00:00+00:00 | 36 | Fulham | NaN | 52 | Crystal Palace | NaN | 2 | 2 | True | 1.0 | 1.0 | 44.0 |
374 | 38 | 868320 | 2023-05-28T15:30:00+00:00 | 52 | Crystal Palace | NaN | 65 | Nottingham Forest | NaN | 1 | 1 | True | 1.0 | 1.0 | 45.0 |
Just a reminder for the list of all team ids.
list(teams_id_array)
[52,
36,
35,
63,
46,
34,
47,
45,
33,
48,
42,
66,
55,
51,
49,
40,
50,
65,
41,
39]
Now using the above method for one team, I use a for loop to create 20 sub dataframes for all the 20 teams in the league. I use distinct team ids to accomplish this and store all of the sub dataframes in one list called “dataframe”.
dataframes = []
for index in teams_id_array:
df_sub = df_matchday[(df_matchday["teams_home_id"] == index) | (df_matchday["teams_away_id"] == index)].copy()
df_sub.rename(columns={"Unnamed: 0": "game_number"}, inplace=True)
df_sub.loc[:, "teams_total_points"] = df_sub.apply(lambda row: row["teams_home_points"] if row["teams_home_id"] == index else row["teams_away_points"], axis=1).cumsum()
game_number_mapping = {game_number: i + 1 for i, game_number in enumerate(df_sub['game_number'].unique())}
df_sub.loc[:, 'game_number'] = df_sub['game_number'].replace(game_number_mapping)
dataframes.append(df_sub)
Take a look at one sub dataframe to see if it looks the way I want it to be.
dataframes[3]
game_number | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | teams_total_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | 867949 | 2022-08-06T14:00:00+00:00 | 63 | Leeds | True | 39 | Wolves | False | 2 | 1 | False | 3.0 | 0.0 | 3.0 |
18 | 2 | 867964 | 2022-08-13T14:00:00+00:00 | 41 | Southampton | NaN | 63 | Leeds | NaN | 2 | 2 | True | 1.0 | 1.0 | 4.0 |
24 | 3 | 867970 | 2022-08-21T13:00:00+00:00 | 63 | Leeds | True | 49 | Chelsea | False | 3 | 0 | False | 3.0 | 0.0 | 7.0 |
33 | 4 | 867979 | 2022-08-27T14:00:00+00:00 | 51 | Brighton | True | 63 | Leeds | False | 1 | 0 | False | 3.0 | 0.0 | 7.0 |
43 | 5 | 867989 | 2022-08-30T19:00:00+00:00 | 63 | Leeds | NaN | 45 | Everton | NaN | 1 | 1 | True | 1.0 | 1.0 | 8.0 |
51 | 6 | 867997 | 2022-09-03T14:00:00+00:00 | 55 | Brentford | True | 63 | Leeds | False | 5 | 2 | False | 3.0 | 0.0 | 8.0 |
64 | 7 | 868010 | 2023-04-04T18:45:00+00:00 | 63 | Leeds | True | 65 | Nottingham Forest | False | 2 | 1 | False | 3.0 | 0.0 | 11.0 |
75 | 8 | 868021 | 2023-02-08T20:00:00+00:00 | 33 | Manchester United | NaN | 63 | Leeds | NaN | 2 | 2 | True | 1.0 | 1.0 | 12.0 |
84 | 9 | 868030 | 2022-10-02T15:30:00+00:00 | 63 | Leeds | NaN | 66 | Aston Villa | NaN | 0 | 0 | True | 1.0 | 1.0 | 13.0 |
94 | 10 | 868040 | 2022-10-09T13:00:00+00:00 | 52 | Crystal Palace | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 13.0 |
103 | 11 | 868049 | 2022-10-16T13:00:00+00:00 | 63 | Leeds | False | 42 | Arsenal | True | 0 | 1 | False | 0.0 | 3.0 | 13.0 |
115 | 12 | 868061 | 2022-10-20T19:15:00+00:00 | 46 | Leicester | True | 63 | Leeds | False | 2 | 0 | False | 3.0 | 0.0 | 13.0 |
123 | 13 | 868069 | 2022-10-23T13:00:00+00:00 | 63 | Leeds | False | 36 | Fulham | True | 2 | 3 | False | 0.0 | 3.0 | 13.0 |
137 | 14 | 868083 | 2022-10-29T18:45:00+00:00 | 40 | Liverpool | False | 63 | Leeds | True | 1 | 2 | False | 0.0 | 3.0 | 16.0 |
143 | 15 | 868089 | 2022-11-05T15:00:00+00:00 | 63 | Leeds | True | 35 | Bournemouth | False | 4 | 3 | False | 3.0 | 0.0 | 19.0 |
157 | 16 | 868103 | 2022-11-12T15:00:00+00:00 | 47 | Tottenham | True | 63 | Leeds | False | 4 | 3 | False | 3.0 | 0.0 | 19.0 |
166 | 17 | 868112 | 2022-12-28T20:00:00+00:00 | 63 | Leeds | False | 50 | Manchester City | True | 1 | 3 | False | 0.0 | 3.0 | 19.0 |
175 | 18 | 868121 | 2022-12-31T15:00:00+00:00 | 34 | Newcastle | NaN | 63 | Leeds | NaN | 0 | 0 | True | 1.0 | 1.0 | 20.0 |
186 | 19 | 868132 | 2023-01-04T19:45:00+00:00 | 63 | Leeds | NaN | 48 | West Ham | NaN | 2 | 2 | True | 1.0 | 1.0 | 21.0 |
190 | 20 | 868136 | 2023-01-13T20:00:00+00:00 | 66 | Aston Villa | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 21.0 |
204 | 21 | 868150 | 2023-01-22T14:00:00+00:00 | 63 | Leeds | NaN | 55 | Brentford | NaN | 0 | 0 | True | 1.0 | 1.0 | 22.0 |
217 | 22 | 868163 | 2023-02-05T14:00:00+00:00 | 65 | Nottingham Forest | True | 63 | Leeds | False | 1 | 0 | False | 3.0 | 0.0 | 22.0 |
224 | 23 | 868170 | 2023-02-12T14:00:00+00:00 | 63 | Leeds | False | 33 | Manchester United | True | 0 | 2 | False | 0.0 | 3.0 | 22.0 |
234 | 24 | 868180 | 2023-02-18T15:00:00+00:00 | 45 | Everton | True | 63 | Leeds | False | 1 | 0 | False | 3.0 | 0.0 | 22.0 |
244 | 25 | 868190 | 2023-02-25T15:00:00+00:00 | 63 | Leeds | True | 41 | Southampton | False | 1 | 0 | False | 3.0 | 0.0 | 25.0 |
254 | 26 | 868200 | 2023-03-04T15:00:00+00:00 | 49 | Chelsea | True | 63 | Leeds | False | 1 | 0 | False | 3.0 | 0.0 | 25.0 |
264 | 27 | 868210 | 2023-03-11T15:00:00+00:00 | 63 | Leeds | NaN | 51 | Brighton | NaN | 2 | 2 | True | 1.0 | 1.0 | 26.0 |
279 | 28 | 868225 | 2023-03-18T15:00:00+00:00 | 39 | Wolves | False | 63 | Leeds | True | 2 | 4 | False | 0.0 | 3.0 | 29.0 |
281 | 29 | 868227 | 2023-04-01T14:00:00+00:00 | 42 | Arsenal | True | 63 | Leeds | False | 4 | 1 | False | 3.0 | 0.0 | 29.0 |
293 | 30 | 868239 | 2023-04-09T13:00:00+00:00 | 63 | Leeds | False | 52 | Crystal Palace | True | 1 | 5 | False | 0.0 | 3.0 | 29.0 |
303 | 31 | 868249 | 2023-04-17T19:00:00+00:00 | 63 | Leeds | False | 40 | Liverpool | True | 1 | 6 | False | 0.0 | 3.0 | 29.0 |
315 | 32 | 868261 | 2023-04-22T11:30:00+00:00 | 36 | Fulham | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 29.0 |
321 | 33 | 868267 | 2023-04-25T19:00:00+00:00 | 63 | Leeds | NaN | 46 | Leicester | NaN | 1 | 1 | True | 1.0 | 1.0 | 30.0 |
330 | 34 | 868276 | 2023-04-30T13:00:00+00:00 | 35 | Bournemouth | True | 63 | Leeds | False | 4 | 1 | False | 3.0 | 0.0 | 30.0 |
344 | 35 | 868290 | 2023-05-06T14:00:00+00:00 | 50 | Manchester City | True | 63 | Leeds | False | 2 | 1 | False | 3.0 | 0.0 | 30.0 |
356 | 36 | 868302 | 2023-05-13T11:30:00+00:00 | 63 | Leeds | NaN | 34 | Newcastle | NaN | 2 | 2 | True | 1.0 | 1.0 | 31.0 |
368 | 37 | 868314 | 2023-05-21T12:30:00+00:00 | 48 | West Ham | True | 63 | Leeds | False | 3 | 1 | False | 3.0 | 0.0 | 31.0 |
376 | 38 | 868322 | 2023-05-28T15:30:00+00:00 | 63 | Leeds | False | 47 | Tottenham | True | 1 | 4 | False | 0.0 | 3.0 | 31.0 |
Now I try to draw a graph for each of the sub dataframe using a for loop. Each of the sub chart represents a single team. Here I use the alt.concat method as an extra part.
charts = []
for df_sub in dataframes:
chart = alt.Chart(df_sub).mark_line().encode(
x='game_number',
y='teams_total_points'
)
charts.append(chart)
combined_chart = alt.concat(*charts, columns=5) # Concatenate the individual line charts
combined_chart
Now I try to combine all the 20 lines in one single chart using layer method. I use different colors representing different teams and add tooltip showing game number and points after each game. Each line represents the cumulative points for a teams through the season.
layers = [] # List to store the individual line layers
# Define a list of colors for differentiating teams
colors = ['blue', 'red', 'green', 'orange', 'purple', 'cyan', 'magenta', 'yellow', 'black', 'brown', 'pink', 'gray', 'olive', 'teal', 'navy', 'gold', 'indigo', 'lime', 'maroon', 'silver']
for i, df_sub in enumerate(dataframes):
team_color = alt.Color(value=colors[i % len(colors)]) # Assign a unique color for each team
tooltip = [alt.Tooltip("game_number:N"), alt.Tooltip("teams_total_points:Q")] # Add tooltips for game_number and teams_total_points
layer = alt.Chart(df_sub).mark_line().encode(
x='game_number',
y='teams_total_points',
color=team_color,
tooltip=tooltip
).properties(
title=f"Team {i+1}"
)
layers.append(layer)
combined_chart = alt.layer(*layers) # Combine all the line layers into one chart
combined_chart
The x-axis is too vague in the chart above, so i modify the code to change one unit on x-axis from 5 to 1, so that we can better see the team’s points after each match.
layers = []
colors = ['blue', 'red', 'green', 'orange', 'purple', 'cyan', 'magenta', 'yellow', 'black', 'brown', 'pink', 'gray', 'olive', 'teal', 'navy', 'gold', 'indigo', 'lime', 'maroon', 'silver']
for i, df_sub in enumerate(dataframes):
team_color = alt.Color(value=colors[i % len(colors)]) # Assign a unique color for each team
tooltip = [alt.Tooltip("game_number:N"), alt.Tooltip("teams_total_points:Q")] # Add tooltips for game_number and teams_total_points
layer = alt.Chart(df_sub).mark_line().encode(
x=alt.X('game_number:Q', axis=alt.Axis(values=list(range(41)))), # Show tick mark for every unit from 0 to 40
y='teams_total_points',
color=team_color,
tooltip=tooltip
).properties(
title=f"Teams' overall performance"
)
layers.append(layer)
combined_chart = alt.layer(*layers) # Combine all the line layers into one chart
combined_chart
Classify the teams into different categories#
I want to see how the teams can be divided into different categories using Kmeans. I manually set the teams to three categories (“Champions league qualifier”: usually top 4 in the league; “European league competitors”: top 10 teams(exclusive of the previous 4 teams) who could compete for other european cups; and “Fighting for relegation”: the last 10 teams who would fight for relegation(the last 3 teams would demote to EFL chamionship).)
I use Kmeans to classify the teams as an extra component in my project.
kmeans = KMeans(n_clusters=3)
kmeans.fit(df_final_ranking_dataframe)
KMeans(n_clusters=3)
labels = kmeans.labels_
df_final_ranking_dataframe['cluster'] = labels
df_final_ranking_dataframe
rank | points | cluster | |
---|---|---|---|
0 | 1 | 89 | 1 |
1 | 2 | 84 | 1 |
2 | 3 | 75 | 1 |
3 | 4 | 71 | 1 |
4 | 5 | 67 | 2 |
5 | 6 | 62 | 2 |
6 | 7 | 61 | 2 |
7 | 8 | 60 | 2 |
8 | 9 | 59 | 2 |
9 | 10 | 52 | 2 |
10 | 11 | 45 | 0 |
11 | 12 | 44 | 0 |
12 | 13 | 41 | 0 |
13 | 14 | 40 | 0 |
14 | 15 | 39 | 0 |
15 | 16 | 38 | 0 |
16 | 17 | 36 | 0 |
17 | 18 | 34 | 0 |
18 | 19 | 31 | 0 |
19 | 20 | 25 | 0 |
cluster_labels = {
0: 'Fighting for relegation',
1: 'Champions League Qualifier',
2: 'European leagues competitors'
}
# Create a new column with meaningful cluster labels
df_final_ranking_dataframe['cluster_label'] = df_final_ranking_dataframe['cluster'].map(cluster_labels)
df_final_ranking_dataframe
rank | points | cluster | cluster_label | |
---|---|---|---|---|
0 | 1 | 89 | 1 | Champions League Qualifier |
1 | 2 | 84 | 1 | Champions League Qualifier |
2 | 3 | 75 | 1 | Champions League Qualifier |
3 | 4 | 71 | 1 | Champions League Qualifier |
4 | 5 | 67 | 2 | European leagues competitors |
5 | 6 | 62 | 2 | European leagues competitors |
6 | 7 | 61 | 2 | European leagues competitors |
7 | 8 | 60 | 2 | European leagues competitors |
8 | 9 | 59 | 2 | European leagues competitors |
9 | 10 | 52 | 2 | European leagues competitors |
10 | 11 | 45 | 0 | Fighting for relegation |
11 | 12 | 44 | 0 | Fighting for relegation |
12 | 13 | 41 | 0 | Fighting for relegation |
13 | 14 | 40 | 0 | Fighting for relegation |
14 | 15 | 39 | 0 | Fighting for relegation |
15 | 16 | 38 | 0 | Fighting for relegation |
16 | 17 | 36 | 0 | Fighting for relegation |
17 | 18 | 34 | 0 | Fighting for relegation |
18 | 19 | 31 | 0 | Fighting for relegation |
19 | 20 | 25 | 0 | Fighting for relegation |
With the above data we see that the Kmeans method successfully divide the teams into three categories based on their final performances and the separation is close to our prediction.
Now I want to visualize the outcome of the cluster generated by Kmeans using DecisionTreeClassifier. I use max_leaf_nodes=3 since there are 3 clusters in total.
X = df_final_ranking_dataframe[['points']]
y = df_final_ranking_dataframe['cluster_label']
clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
clf.fit(X, y)
DecisionTreeClassifier(max_leaf_nodes=3, random_state=2)
The tree below splits the data into three categories: “fighting for relegation”, “European leagues competitors”, and “Champions league qualifiers” as generated by Kmeans clustering.
fig = plt.figure()
_ = plot_tree(clf,
feature_names=clf.feature_names_in_,
class_names=clf.classes_,
filled=True)
Teams’ performances at home and away#
Many people says a team normally performs at home than away since the phenomenon aroused by their fans could be encouraging. Is it really the case?
With my math from Math 130A: Probability 1 class, I know the expected points received by a team from a match is 0 * 1/3 + 1 * 1/3 + 3 * 1/3 is 4/3, which is approximately 1.33. Using the new columns we create in df_matchday dataframe, we can calculate the mean points each team receives from a home game and an away game.
df_matchday["teams_home_points"].mean()
1.681578947368421
df_matchday["teams_away_points"].mean()
1.0894736842105264
Apparently the average points a team gets from a home game is much higher than that from an away game, so that it is more challenging for teams to perform better away than home. We want to continue to analyze if teams that perform better away results in higher points at the end of the season.
Now I want to see if the teams that perform better at away gets higher points at the end of the season.
First recall our teams_id_array contains all the indices of the teams, we make 20 sub dataframes, each corresponding to a team and containg all the team’s away matches.
teams_id_array
array([52, 36, 35, 63, 46, 34, 47, 45, 33, 48, 42, 66, 55, 51, 49, 40, 50,
65, 41, 39])
dataframes_away = []
for index in list(teams_id_array):
df_sub = df_matchday[df_matchday["teams_away_id"] == index].copy()
dataframes_away.append(df_sub)
Just take a look at one specific sub dataframe to see if it is really what we want.
dataframes_away[1]
Unnamed: 0 | fixture_id | fixture_date | teams_home_id | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | 19 | 867965 | 2022-08-13T14:00:00+00:00 | 39 | Wolves | NaN | 36 | Fulham | NaN | 0 | 0 | True | 1.0 | 1.0 |
30 | 30 | 867976 | 2022-08-27T16:30:00+00:00 | 42 | Arsenal | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
58 | 58 | 868004 | 2022-09-03T14:00:00+00:00 | 47 | Tottenham | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
77 | 77 | 868023 | 2022-09-16T19:00:00+00:00 | 65 | Nottingham Forest | False | 36 | Fulham | True | 2 | 3 | False | 0.0 | 3.0 |
99 | 99 | 868045 | 2022-10-09T13:00:00+00:00 | 48 | West Ham | True | 36 | Fulham | False | 3 | 1 | False | 3.0 | 0.0 |
123 | 123 | 868069 | 2022-10-23T13:00:00+00:00 | 63 | Leeds | False | 36 | Fulham | True | 2 | 3 | False | 0.0 | 3.0 |
144 | 144 | 868090 | 2022-11-05T15:00:00+00:00 | 50 | Manchester City | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
164 | 164 | 868110 | 2022-12-26T15:00:00+00:00 | 52 | Crystal Palace | False | 36 | Fulham | True | 0 | 3 | False | 0.0 | 3.0 |
187 | 187 | 868133 | 2023-01-03T19:45:00+00:00 | 46 | Leicester | False | 36 | Fulham | True | 0 | 1 | False | 0.0 | 3.0 |
196 | 196 | 868142 | 2023-01-15T14:00:00+00:00 | 34 | Newcastle | True | 36 | Fulham | False | 1 | 0 | False | 3.0 | 0.0 |
213 | 213 | 868159 | 2023-02-03T20:00:00+00:00 | 49 | Chelsea | NaN | 36 | Fulham | NaN | 0 | 0 | True | 1.0 | 1.0 |
232 | 232 | 868178 | 2023-02-18T15:00:00+00:00 | 51 | Brighton | False | 36 | Fulham | True | 0 | 1 | False | 0.0 | 3.0 |
252 | 252 | 868198 | 2023-03-06T20:00:00+00:00 | 55 | Brentford | True | 36 | Fulham | False | 3 | 2 | False | 3.0 | 0.0 |
275 | 275 | 868221 | 2023-05-03T19:00:00+00:00 | 40 | Liverpool | True | 36 | Fulham | False | 1 | 0 | False | 3.0 | 0.0 |
280 | 280 | 868226 | 2023-04-01T14:00:00+00:00 | 35 | Bournemouth | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
302 | 302 | 868248 | 2023-04-15T14:00:00+00:00 | 45 | Everton | False | 36 | Fulham | True | 1 | 3 | False | 0.0 | 3.0 |
326 | 326 | 868272 | 2023-04-25T18:45:00+00:00 | 66 | Aston Villa | True | 36 | Fulham | False | 1 | 0 | False | 3.0 | 0.0 |
359 | 359 | 868305 | 2023-05-13T14:00:00+00:00 | 41 | Southampton | False | 36 | Fulham | True | 0 | 2 | False | 0.0 | 3.0 |
378 | 378 | 868324 | 2023-05-28T15:30:00+00:00 | 33 | Manchester United | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
We only want to choose the “teams_away_id” and “teams_away_points” columns to make our prediction, so we first extract both columns to make a new dataframe.
sub_dataframe_away = dataframes_away[1][["teams_away_id", "teams_away_points"]]
sub_dataframe_away
teams_away_id | teams_away_points | |
---|---|---|
19 | 36 | 1.0 |
30 | 36 | 0.0 |
58 | 36 | 0.0 |
77 | 36 | 3.0 |
99 | 36 | 0.0 |
123 | 36 | 3.0 |
144 | 36 | 0.0 |
164 | 36 | 3.0 |
187 | 36 | 3.0 |
196 | 36 | 0.0 |
213 | 36 | 1.0 |
232 | 36 | 3.0 |
252 | 36 | 0.0 |
275 | 36 | 0.0 |
280 | 36 | 0.0 |
302 | 36 | 3.0 |
326 | 36 | 0.0 |
359 | 36 | 3.0 |
378 | 36 | 0.0 |
We want to reshape the dataframe so that the row is the id of the team and the column labels are match 1-19 (one team plays 19 away games and 19 home games each season). I use the pivot method in dataframes to have the match numbers as columns and reset the index to make the team IDs a column.
I refer to ChatGPT here to teach me about the pivot method as an extra component in my project.
sub_dataframe_away = sub_dataframe_away.copy()
sub_dataframe_away.loc[:, 'match_number'] = range(1, 20)
df_reshaped = sub_dataframe_away.pivot(index='teams_away_id', columns='match_number', values='teams_away_points')
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns = ['team_id'] + list(range(1, 20))
df_reshaped.loc[:, 'team_id'] = df_reshaped['team_id']
df_reshaped
team_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 36 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | 0.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 |
Above is how to perform the pivot method to reshape one of the teams. I use a for loop below iterating over the indices in the dataframes_away list to make 20 sub dataframes like the one above. Then I use pd.concat method to combine all 20 sub dataframes into one dataframe.
processed_dataframes = []
# Loop over each dataframe in your list
for idx, df in enumerate(dataframes_away):
sub_dataframe_away = df[["teams_away_id", "teams_away_points"]].copy()
sub_dataframe_away.loc[:, 'match_number'] = range(1, 20)
# Pivot the dataframe to have the match numbers as columns
df_reshaped = sub_dataframe_away.pivot(index='teams_away_id', columns='match_number', values='teams_away_points')
# Reset the index to make the team IDs a column
df_reshaped = df_reshaped.reset_index()
# Rename the columns to have the y labels from 1 to 19
df_reshaped.columns = ['team_id'] + list(range(1, 20))
df_reshaped.loc[:, 'team_id'] = df_reshaped['team_id']
# Append the reshaped dataframe to the list
processed_dataframes.append(df_reshaped)
# Concatenate all processed dataframes
final_df = pd.concat(processed_dataframes, ignore_index=True)
final_df
team_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 |
1 | 36 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | 0.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 |
2 | 35 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 0.0 |
3 | 63 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 46 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
5 | 34 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 0.0 | 3.0 | 1.0 | 1.0 |
6 | 47 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 |
7 | 45 | 0.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 3.0 | 1.0 |
8 | 33 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | 3.0 | 1.0 | 0.0 | 3.0 | 3.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 3.0 |
9 | 48 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 |
10 | 42 | 3.0 | 3.0 | 0.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 | 3.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 |
11 | 66 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 3.0 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 |
12 | 55 | 1.0 | 0.0 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 |
13 | 51 | 3.0 | 3.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 |
14 | 49 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 |
15 | 40 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 1.0 |
16 | 50 | 3.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 0.0 |
17 | 65 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
18 | 41 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
19 | 39 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 1.0 | 0.0 | 3.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I combine final_df to df_standing to fit the away data above to predict the final result.
combined_df_away_result = pd.concat([final_df, df_standing], axis=1)
combined_df_away_result
team_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | home_draw | home_lose | home_goals_for | home_goals_against | away | away_wins | away_draw | away_lose | away_goals_for | away_goals_against | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 52 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | ... | 1 | 1 | 60 | 17 | 19 | 11 | 4 | 4 | 34 | 16 |
1 | 36 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | ... | 3 | 2 | 53 | 25 | 19 | 12 | 3 | 4 | 35 | 18 |
2 | 35 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3 | 1 | 36 | 10 | 19 | 8 | 3 | 8 | 22 | 33 |
3 | 63 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 1.0 | ... | 6 | 2 | 36 | 14 | 19 | 8 | 8 | 3 | 32 | 19 |
4 | 46 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | ... | 5 | 1 | 46 | 17 | 19 | 6 | 5 | 8 | 29 | 30 |
5 | 34 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 3.0 | ... | 4 | 5 | 37 | 21 | 19 | 8 | 4 | 7 | 35 | 32 |
6 | 47 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 | 3.0 | 1.0 | ... | 2 | 5 | 33 | 21 | 19 | 6 | 5 | 8 | 18 | 25 |
7 | 45 | 0.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1 | 6 | 37 | 25 | 19 | 6 | 5 | 8 | 33 | 38 |
8 | 33 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | 3.0 | 1.0 | 0.0 | 3.0 | ... | 7 | 2 | 35 | 18 | 19 | 5 | 7 | 7 | 23 | 28 |
9 | 48 | 0.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 5 | 6 | 31 | 29 | 19 | 7 | 2 | 10 | 24 | 24 |
10 | 42 | 3.0 | 3.0 | 0.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 3.0 | ... | 7 | 5 | 21 | 23 | 19 | 4 | 5 | 10 | 19 | 26 |
11 | 66 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 3.0 | ... | 7 | 6 | 20 | 19 | 19 | 5 | 4 | 10 | 18 | 28 |
12 | 55 | 1.0 | 0.0 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 3.0 | ... | 3 | 7 | 19 | 20 | 19 | 2 | 5 | 12 | 12 | 38 |
13 | 51 | 3.0 | 3.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | ... | 4 | 7 | 26 | 24 | 19 | 3 | 3 | 13 | 16 | 31 |
14 | 49 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | ... | 4 | 9 | 20 | 28 | 19 | 5 | 2 | 12 | 17 | 43 |
15 | 40 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | ... | 6 | 5 | 27 | 24 | 19 | 1 | 5 | 13 | 11 | 44 |
16 | 50 | 3.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 3.0 | 3.0 | 3.0 | ... | 3 | 10 | 16 | 27 | 19 | 2 | 9 | 8 | 18 | 30 |
17 | 65 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 4 | 10 | 23 | 27 | 19 | 4 | 3 | 12 | 28 | 41 |
18 | 41 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | ... | 7 | 7 | 26 | 37 | 19 | 2 | 3 | 14 | 22 | 41 |
19 | 39 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | ... | 5 | 12 | 19 | 37 | 19 | 4 | 2 | 13 | 17 | 36 |
20 rows × 43 columns
X = combined_df_away_result.iloc[:, 1:20]
y = combined_df_away_result[['points']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = LogisticRegression()
clf.fit(X, y.values.ravel())
LogisticRegression()
clf.score(X_test, y_test)
1.0
combined_df_away_result['predicted_result'] = clf.predict(X)
combined_df_away_result[['points', 'predicted_result']]
points | predicted_result | |
---|---|---|
0 | 89 | 89 |
1 | 84 | 84 |
2 | 75 | 75 |
3 | 71 | 71 |
4 | 67 | 67 |
5 | 62 | 62 |
6 | 61 | 61 |
7 | 60 | 60 |
8 | 59 | 59 |
9 | 52 | 52 |
10 | 45 | 45 |
11 | 44 | 44 |
12 | 41 | 41 |
13 | 40 | 40 |
14 | 39 | 39 |
15 | 38 | 38 |
16 | 36 | 36 |
17 | 34 | 34 |
18 | 31 | 31 |
19 | 25 | 25 |
Incredibly my above data has a score of 1.0 under LogisticRegressiom, which is a sign of overfitting. I think even though the answer suggests that better performance away does suggest better points at the end of the season, there might be some problems in fitting my data. I have three explanations for this problem: 1) the test size might be too small since it only contains 20 sets of data 2) I might have mistakenly leak the test data to train the data 3) It is also possible that the dataset is not difficult to learn.
Predicting match results using df_home data#
Then I want to find if the home team results can be predicted using logistic regression.
I first combine the df_matchday and df_home dataframesusing pd.concat. Then I choose some of the useful columns (‘Shots on Goal’, ‘Shots off Goal’, ‘Total Shots’, ‘Blocked Shots’, ‘Shots insidebox’, ‘Shots outsidebox’, ‘Corner Kicks’, ‘Offsides’) from the df_matchday dataframe to predict the “teams_home_points” column in df_home. I drop the rows which contain “nan” in those columns.
combined_df_matchday = pd.concat([df_home, df_matchday], axis=1)
combined_df_matchday.dropna(subset=['Shots on Goal', 'Shots off Goal', 'Total Shots', 'Blocked Shots', 'Shots insidebox', 'Shots outsidebox', 'Corner Kicks', 'Offsides'], inplace=True)
combined_df_matchday
fixture id | Home team id | Home team name | Shots on Goal | Shots off Goal | Total Shots | Blocked Shots | Shots insidebox | Shots outsidebox | Fouls | ... | teams_home_name | teams_home_winner | teams_away_id | teams_away_name | teams_away_winner | goals_home | goals_away | teams_draw | teams_home_points | teams_away_points | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 867946 | 52 | Crystal Palace | 2.0 | 2.0 | 10 | 6.0 | 9.0 | 1.0 | 16 | ... | Crystal Palace | False | 42 | Arsenal | True | 0 | 2 | False | 0.0 | 3.0 |
1 | 867947 | 36 | Fulham | 3.0 | 2.0 | 9 | 4.0 | 7.0 | 2.0 | 7 | ... | Fulham | NaN | 40 | Liverpool | NaN | 2 | 2 | True | 1.0 | 1.0 |
2 | 867948 | 35 | Bournemouth | 3.0 | 2.0 | 7 | 2.0 | 6.0 | 1.0 | 18 | ... | Bournemouth | True | 66 | Aston Villa | False | 2 | 0 | False | 3.0 | 0.0 |
4 | 867950 | 46 | Leicester | 5.0 | 6.0 | 14 | 3.0 | 5.0 | 9.0 | 6 | ... | Leicester | NaN | 55 | Brentford | NaN | 2 | 2 | True | 1.0 | 1.0 |
5 | 867951 | 34 | Newcastle | 10.0 | 8.0 | 23 | 5.0 | 16.0 | 7.0 | 9 | ... | Newcastle | True | 65 | Nottingham Forest | False | 2 | 0 | False | 3.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
373 | 868319 | 49 | Chelsea | 5.0 | 9.0 | 22 | 8.0 | 13.0 | 9.0 | 9 | ... | Chelsea | NaN | 34 | Newcastle | NaN | 1 | 1 | True | 1.0 | 1.0 |
374 | 868320 | 52 | Crystal Palace | 3.0 | 8.0 | 15 | 4.0 | 9.0 | 6.0 | 9 | ... | Crystal Palace | NaN | 65 | Nottingham Forest | NaN | 1 | 1 | True | 1.0 | 1.0 |
376 | 868322 | 63 | Leeds | 2.0 | 7.0 | 19 | 10.0 | 14.0 | 5.0 | 8 | ... | Leeds | False | 47 | Tottenham | True | 1 | 4 | False | 0.0 | 3.0 |
377 | 868323 | 46 | Leicester | 4.0 | 6.0 | 13 | 3.0 | 8.0 | 5.0 | 8 | ... | Leicester | True | 48 | West Ham | False | 2 | 1 | False | 3.0 | 0.0 |
378 | 868324 | 33 | Manchester United | 8.0 | 7.0 | 21 | 6.0 | 12.0 | 9.0 | 14 | ... | Manchester United | True | 36 | Fulham | False | 2 | 1 | False | 3.0 | 0.0 |
281 rows × 34 columns
X = combined_df_matchday[['Shots on Goal', 'Shots off Goal', 'Total Shots', 'Blocked Shots', 'Shots insidebox', 'Shots outsidebox', 'Corner Kicks', 'Offsides']]
y = combined_df_matchday[['teams_home_points']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=500)
clf.fit(X, y.values.ravel())
LogisticRegression(max_iter=500)
We can see in the score of the prediction is not high (only above 50 percent), which means that football matches are unpredictable.
clf.score(X, y)
0.5338078291814946
From the coefficients below we can see that “shots on goal“‘s influence on the final result is the greatest, while other factors have fluctuating importance.
clf.coef_
array([[-1.50886102e-01, 3.00392034e-02, -3.33541894e-02,
8.74927091e-02, -2.45592879e-02, -8.79490146e-03,
1.93229115e-02, -7.55692733e-02],
[-2.60558822e-02, 4.86229868e-02, 1.25542066e-02,
-1.00128981e-02, -2.38983995e-02, 3.64526061e-02,
-9.08832105e-05, 7.93281577e-02],
[ 1.76941984e-01, -7.86621902e-02, 2.07999828e-02,
-7.74798110e-02, 4.84576875e-02, -2.76577046e-02,
-1.92320283e-02, -3.75888439e-03]])
The classifier identifies three different classes from training the data. It can provide unique class labels present in the dataset.
clf.classes_
array([0., 1., 3.])
Summary#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
In this project I mainly summarize the dataset of premier league season 2022-23. I make dataframes and draw figures of teams’ overall performances and points related to matches. In addition, I also try to fit and predict the distribution of overall data to cluster the teams and predict the results of matches and final performances using LogisticRegression. I refer to outside references to learn about extra ideas such as Kmeans, seaborn, and pivot method to better analyze my data. My result summarizes and better visualizes the teams’ performance and shows some predictable and unpredictable elements in soccer.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
My source is from the following website with URL https://www.kaggle.com/datasets/afnanurrahim/premier-league-2022-23
List any other references that you found helpful.
Previous students’ work such as Songhan (Hanson) Hu’s project (this one teaches me about how to better use Kmeans method) from Spring 2022 with URL https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/SonghanHu.
Premier league final report telling me about team’s possible classification such as the URL https://www.sportingnews.com/us/soccer/news/english-premier-league-table-2022-23-updated-epl-standings/jh3khedtjm3bbxccudntwq9h
ChatGPT that helps me debug and teach me about pivot method.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote