Analyzing Winning factors on Premier League#

Author:Zheyu Wu

Course Project, UC Irvine, Math 10, S23


Picture title

The English Premier League (EPL) is globally renowned as one of the most popular football leagues. It consists of 20 teams competing in a total of 38 matches over a year. The team with the highest points will be the winner. In this system, a win earns a team 3 points, a draw results in 1 point each for both teams, and a loss yields zero points. Each team faces every other team in the league once at home and once as the away team. For this project, I want to investigate which factors will contribute the most for winning a match in the Premier League.

Importing and Cleaning the Data#

import pandas as pd
import numpy as np
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from mplsoccer.pitch import Pitch
import seaborn as sns
df = pd.read_csv('stats.csv')
team wins losses goals total_yel_card total_red_card total_scoring_att ontarget_scoring_att hit_woodwork att_hd_goal ... total_cross corner_taken touches big_chance_missed clearance_off_line dispossessed penalty_save total_high_claim punches season
0 Manchester United 28.0 5.0 83.0 60.0 1.0 698.0 256.0 21.0 12.0 ... 918.0 258.0 25686.0 NaN 1.0 NaN 2.0 37.0 25.0 2006-2007
1 Chelsea 24.0 3.0 64.0 62.0 4.0 636.0 216.0 14.0 16.0 ... 897.0 231.0 24010.0 NaN 2.0 NaN 1.0 74.0 22.0 2006-2007
2 Liverpool 20.0 10.0 57.0 44.0 0.0 668.0 214.0 15.0 8.0 ... 1107.0 282.0 24150.0 NaN 1.0 NaN 0.0 51.0 27.0 2006-2007
3 Arsenal 19.0 8.0 63.0 59.0 3.0 638.0 226.0 19.0 10.0 ... 873.0 278.0 25592.0 NaN 1.0 NaN 0.0 88.0 27.0 2006-2007
4 Tottenham Hotspur 17.0 12.0 57.0 48.0 3.0 520.0 184.0 6.0 5.0 ... 796.0 181.0 22200.0 NaN 2.0 NaN 0.0 51.0 24.0 2006-2007
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
235 Huddersfield Town 9.0 19.0 28.0 62.0 3.0 362.0 109.0 8.0 5.0 ... 765.0 165.0 22619.0 21.0 6.0 416.0 2.0 31.0 24.0 2017-2018
236 Swansea City 8.0 21.0 28.0 51.0 1.0 338.0 103.0 8.0 3.0 ... 694.0 150.0 22775.0 26.0 1.0 439.0 3.0 44.0 15.0 2017-2018
237 Southampton 7.0 16.0 37.0 63.0 2.0 450.0 145.0 15.0 7.0 ... 800.0 227.0 24639.0 37.0 4.0 379.0 1.0 29.0 13.0 2017-2018
238 Stoke City 7.0 19.0 35.0 62.0 1.0 384.0 132.0 8.0 8.0 ... 598.0 136.0 20368.0 33.0 3.0 402.0 0.0 27.0 14.0 2017-2018
239 West Bromwich Albion 6.0 19.0 31.0 73.0 1.0 378.0 114.0 7.0 10.0 ... 784.0 176.0 20552.0 28.0 3.0 446.0 0.0 40.0 5.0 2017-2018

240 rows Ă— 42 columns

#check for missing values
#drop the missing values
df = df.dropna(axis=0).copy()
(160, 42)

In order to keep my data to be general on analyzing Premier League teams, I decide to analyze those teams which can stay for at least three seasons since for those relatively weaker teams, they are probably not on the same levels compares to other Premier League teams.

team_seasons = df.groupby('team')['season'].nunique().reset_index()
filtered_teams = team_seasons[team_seasons['season'] >= 3]['team']
df = df[df['team'].isin(filtered_teams)]

After cleaning my dataset, I want to check the average data and then I can have a rough thoughts about which factors I should predict to have a positve/negative impact to win.

df_avg = df.groupby(['team'],as_index=False).mean()
team wins losses goals total_yel_card total_red_card total_scoring_att ontarget_scoring_att hit_woodwork att_hd_goal ... backward_pass total_cross corner_taken touches big_chance_missed clearance_off_line dispossessed penalty_save total_high_claim punches
11 Manchester City 24.625 6.375 82.625 65.500 2.750 650.00 227.000 18.500 8.500 ... 3179.500 813.75 267.625 29239.875 58.000 3.125 487.375 1.000 41.750 26.125
12 Manchester United 22.500 7.000 68.750 62.625 2.125 549.50 195.375 15.875 12.000 ... 2941.750 910.00 229.375 27705.625 47.500 3.125 461.500 0.375 33.250 14.250
4 Chelsea 21.875 7.750 69.875 60.750 3.000 626.25 214.500 14.250 10.875 ... 2889.875 812.00 239.000 27226.375 43.625 4.000 463.750 0.250 55.125 19.250

3 rows Ă— 41 columns

So, there is a roughly thinking that “attacking” data should have a positive impact and “defending” data should be otherwise.

wins                    1.000000
losses                 -0.968590
goals                   0.968633
total_yel_card         -0.257762
total_red_card         -0.180572
total_scoring_att       0.833338
ontarget_scoring_att    0.900001
hit_woodwork            0.718569
att_hd_goal             0.502109
att_pen_goal            0.645460
att_freekick_goal       0.349527
att_ibox_goal           0.967825
att_obox_goal           0.681283
goal_fastbreak          0.643157
total_offside           0.211573
clean_sheet             0.930604
goals_conceded         -0.917405
saves                  -0.205303
outfielder_block       -0.768150
interception           -0.210378
total_tackle            0.087433
last_man_tackle         0.131081
total_clearance        -0.614563
head_clearance         -0.377820
own_goals              -0.470851
penalty_conceded       -0.554425
pen_goals_conceded     -0.531732
total_pass              0.841452
total_through_ball      0.776743
total_long_balls       -0.413157
backward_pass           0.739098
total_cross             0.444769
corner_taken            0.901280
touches                 0.849492
big_chance_missed       0.858744
clearance_off_line     -0.642533
dispossessed            0.514228
penalty_save           -0.285913
total_high_claim       -0.386829
punches                -0.010322
Name: wins, dtype: float64

Altair Chart to Visualize the Correlation#

selected_columns = [col for col in df.columns if col not in ['wins', 'season','team','losses']]

From Worksheet7, I learned a way to draw an altair chart which contains different charts with different columns’ names splits in columns, and it is a way that clearly shows the correlation with the selected_columns and wins in graph.

def make_chart(c):
    chart = alt.Chart(df_avg).mark_circle().encode(
        x = "wins",
        y = c,
        color = alt.Color("team", scale=alt.Scale(scheme="darkblue")),
        tooltip = ["team","wins"]
    return chart
chart_list = [make_chart(c) for c in selected_columns]
total_chart = alt.vconcat(*chart_list)

Factor Importance Contribute to Wins#

I will use Random Forest to analyze the feature importance to wins in Premier League

X_train, X_test, y_train, y_test = train_test_split(df_avg[selected_columns], df_avg['wins'], test_size=0.3, random_state=22)

I used ChatGPT to find a code that could decide how many leaf nodes number I should take, and based on this twitter, I can choose an appropriate number to predict my data.

df_err = pd.DataFrame(columns=['leaves', 'error', 'set'])

for i in range(2, 41):
    regressor = RandomForestRegressor(n_estimators=100, max_leaf_nodes=i, random_state=42), y_train)
    y_train_pred = regressor.predict(X_train)
    train_error = mean_squared_error(y_train, y_train_pred)
    y_test_pred = regressor.predict(X_test)
    test_error = mean_squared_error(y_test, y_test_pred)
    df_err.loc[len(df_err)] = {'leaves': i, 'error': train_error, 'set': 'train'}
    df_err.loc[len(df_err)] = {'leaves': i, 'error': test_error, 'set': 'test'}

c = alt.Chart(df_err).mark_line().encode(
   leaves     error    set
0       2  3.399707  train
1       2  2.408625   test
2       3  1.794364  train
3       3  2.094462   test
4       4  1.110214  train
..    ...       ...    ...
73     38  2.098586   test
74     39  0.678814  train
75     39  2.098586   test
76     40  0.678814  train
77     40  2.098586   test

[78 rows x 3 columns]

The best choice of max_leaf_nodes should be 5

rfr = RandomForestRegressor(n_estimators=1000, max_leaf_nodes=5,random_state=42),y_train)
RandomForestRegressor(max_leaf_nodes=5, n_estimators=1000, random_state=42)
pd.Series(rfr.feature_importances_, index=selected_columns).sort_values(ascending=False)
outfielder_block        0.090093
goals_conceded          0.086272
goals                   0.065844
total_scoring_att       0.065017
total_through_ball      0.062856
backward_pass           0.059776
att_ibox_goal           0.059725
ontarget_scoring_att    0.059307
clean_sheet             0.057036
att_obox_goal           0.048713
big_chance_missed       0.048022
total_pass              0.042571
touches                 0.041945
corner_taken            0.033134
clearance_off_line      0.029099
hit_woodwork            0.025168
att_freekick_goal       0.022858
att_pen_goal            0.018243
goal_fastbreak          0.016872
penalty_conceded        0.009769
total_clearance         0.006979
own_goals               0.006838
penalty_save            0.005824
pen_goals_conceded      0.005630
total_high_claim        0.005070
dispossessed            0.004506
total_tackle            0.003362
att_hd_goal             0.003285
total_yel_card          0.002817
total_offside           0.002622
interception            0.002109
saves                   0.001893
last_man_tackle         0.001621
total_long_balls        0.001520
total_cross             0.001337
head_clearance          0.001103
total_red_card          0.000681
punches                 0.000482
dtype: float64

We can see that “outfielder_block” and “goals_conceded” have the most feature importance to wins. Both stats are defending stats. Thus, from Random Forest model, it shows that defending is the key to win.

LinearRegression Model#

lin = LinearRegression()[selected_columns],df_avg['wins'])
pd.Series(lin.coef_, index=lin.feature_names_in_)
goals                   0.170069
total_yel_card         -0.008121
total_red_card         -0.067695
total_scoring_att      -0.004115
ontarget_scoring_att    0.085436
hit_woodwork           -0.033554
att_hd_goal            -0.023789
att_pen_goal           -0.039289
att_freekick_goal      -0.021382
att_ibox_goal           0.062600
att_obox_goal           0.124503
goal_fastbreak          0.091819
total_offside          -0.031666
clean_sheet             0.089106
goals_conceded         -0.122839
saves                   0.015306
outfielder_block       -0.078234
interception            0.001499
total_tackle           -0.003998
last_man_tackle         0.161247
total_clearance        -0.005070
head_clearance          0.022121
own_goals               0.013613
penalty_conceded        0.032624
pen_goals_conceded      0.075956
total_pass              0.001883
total_through_ball     -0.028828
total_long_balls       -0.001374
backward_pass          -0.002561
total_cross             0.002376
corner_taken           -0.065405
touches                -0.001482
big_chance_missed      -0.021757
clearance_off_line     -0.101263
dispossessed           -0.011044
penalty_save           -0.051346
total_high_claim        0.011235
punches                -0.012115
dtype: float64

The results shows that “goals” infulence wins the most in positive way and espeically for “att_ibox_goal”(goals inside the penalty area),”goal_fastbreak”(goals from counter attack) and “att_obox_goal”(goals outside the penalty area). The “goals_conceded” shows a negative impact on wins and it is reasonable since it shows the defending level of a team. The coefficients that is unexpected for me is that “passing” inputs are not that significant to determine the wins.

Passing Visualizing Map#

After finding those key inputs contribute to wins, I am very curious about the passing inputs. Because we can see that the total_pass is contributing a positive effect to wins but the others except for total_cross did otherwise. So I want to go further on this topic by drawing the Passing Plot on the pitch below.

# From here I choose KDB from Man City vs TOT to see what a top midfielder's passes looks like
df2 = pd.read_csv('KDBpasses.csv')
player minute second x y type outcome endX endY
0 KDB 45 0 50 50 Pass Successful 40 43
1 KDB 45 25 63 48 Pass Successful 75 83
2 KDB 46 4 74 58 Pass Successful 71 65
3 KDB 46 7 76 68 Pass Successful 95 78
4 KDB 46 55 100 1 Pass Unsuccessful 96 38
5 KDB 47 42 73 25 Pass Successful 70 54
6 KDB 49 47 75 43 Pass Unsuccessful 82 38
7 KDB 50 10 79 31 Pass Successful 87 24
8 KDB 50 17 82 52 Pass Successful 93 68
9 KDB 50 24 98 62 Pass Unsuccessful 98 61
10 KDB 55 22 30 12 Pass Successful 22 24
11 KDB 56 24 38 42 Pass Successful 37 51
12 KDB 56 36 53 44 Pass Successful 43 62
13 KDB 56 43 61 63 Pass Successful 64 63
14 KDB 57 2 96 33 Pass Successful 91 34
15 KDB 62 25 46 45 Pass Successful 38 43
16 KDB 62 31 49 67 Pass Successful 61 96
17 KDB 63 36 64 22 Pass Successful 63 10
18 KDB 63 40 74 7 Pass Successful 66 9
19 KDB 63 42 70 6 Pass Successful 63 9
20 KDB 63 43 67 4 Pass Successful 62 9
21 KDB 63 48 69 4 Pass Successful 62 9
22 KDB 66 27 65 36 Pass Successful 71 51
23 KDB 67 40 73 58 Pass Unsuccessful 87 41
24 KDB 68 19 66 67 Pass Successful 60 69
25 KDB 69 25 71 79 Pass Successful 63 72
26 KDB 71 29 64 30 Pass Successful 69 60
27 KDB 71 46 81 57 Pass Successful 96 71
28 KDB 73 44 50 50 Pass Successful 47 52
29 KDB 76 11 67 43 Pass Successful 75 41
30 KDB 76 25 77 41 Pass Unsuccessful 84 37
31 KDB 77 59 100 100 Pass Successful 92 96
32 KDB 78 3 88 96 Pass Successful 69 82
33 KDB 78 9 83 96 Pass Successful 81 83
34 KDB 78 58 64 51 Pass Successful 69 83
35 KDB 79 5 67 67 Pass Successful 66 54
36 KDB 79 19 70 56 Pass Successful 85 87
37 KDB 79 31 76 84 Pass Successful 65 73
38 KDB 80 26 100 1 Pass Successful 92 5
39 KDB 80 30 82 3 Pass Successful 76 17
40 KDB 81 14 59 4 Pass Successful 69 4
41 KDB 81 20 80 6 Pass Successful 81 19
42 KDB 83 14 71 40 Pass Unsuccessful 88 55
43 KDB 87 24 58 42 Pass Successful 66 74
44 KDB 88 22 100 1 Pass Successful 73 26
45 KDB 90 18 71 25 Pass Unsuccessful 91 52
46 KDB 91 18 66 31 Pass Successful 76 20
47 KDB 91 46 74 38 Pass Unsuccessful 76 36
48 KDB 94 7 57 37 Pass Unsuccessful 90 54
#Converting our data to fit the football map
df2['x'] = df2['x']*1.2
df2['y'] = df2['y']*0.8
df2['endX'] = df2['endX']*1.2
df2['endY'] = df2['endY']*0.8
player minute second x y type outcome endX endY
0 KDB 45 0 60.0 40.0 Pass Successful 48.0 34.4
1 KDB 45 25 75.6 38.4 Pass Successful 90.0 66.4
2 KDB 46 4 88.8 46.4 Pass Successful 85.2 52.0
3 KDB 46 7 91.2 54.4 Pass Successful 114.0 62.4
4 KDB 46 55 120.0 0.8 Pass Unsuccessful 115.2 30.4
5 KDB 47 42 87.6 20.0 Pass Successful 84.0 43.2
6 KDB 49 47 90.0 34.4 Pass Unsuccessful 98.4 30.4
7 KDB 50 10 94.8 24.8 Pass Successful 104.4 19.2
8 KDB 50 17 98.4 41.6 Pass Successful 111.6 54.4
9 KDB 50 24 117.6 49.6 Pass Unsuccessful 117.6 48.8
10 KDB 55 22 36.0 9.6 Pass Successful 26.4 19.2
11 KDB 56 24 45.6 33.6 Pass Successful 44.4 40.8
12 KDB 56 36 63.6 35.2 Pass Successful 51.6 49.6
13 KDB 56 43 73.2 50.4 Pass Successful 76.8 50.4
14 KDB 57 2 115.2 26.4 Pass Successful 109.2 27.2
15 KDB 62 25 55.2 36.0 Pass Successful 45.6 34.4
16 KDB 62 31 58.8 53.6 Pass Successful 73.2 76.8
17 KDB 63 36 76.8 17.6 Pass Successful 75.6 8.0
18 KDB 63 40 88.8 5.6 Pass Successful 79.2 7.2
19 KDB 63 42 84.0 4.8 Pass Successful 75.6 7.2
20 KDB 63 43 80.4 3.2 Pass Successful 74.4 7.2
21 KDB 63 48 82.8 3.2 Pass Successful 74.4 7.2
22 KDB 66 27 78.0 28.8 Pass Successful 85.2 40.8
23 KDB 67 40 87.6 46.4 Pass Unsuccessful 104.4 32.8
24 KDB 68 19 79.2 53.6 Pass Successful 72.0 55.2
25 KDB 69 25 85.2 63.2 Pass Successful 75.6 57.6
26 KDB 71 29 76.8 24.0 Pass Successful 82.8 48.0
27 KDB 71 46 97.2 45.6 Pass Successful 115.2 56.8
28 KDB 73 44 60.0 40.0 Pass Successful 56.4 41.6
29 KDB 76 11 80.4 34.4 Pass Successful 90.0 32.8
30 KDB 76 25 92.4 32.8 Pass Unsuccessful 100.8 29.6
31 KDB 77 59 120.0 80.0 Pass Successful 110.4 76.8
32 KDB 78 3 105.6 76.8 Pass Successful 82.8 65.6
33 KDB 78 9 99.6 76.8 Pass Successful 97.2 66.4
34 KDB 78 58 76.8 40.8 Pass Successful 82.8 66.4
35 KDB 79 5 80.4 53.6 Pass Successful 79.2 43.2
36 KDB 79 19 84.0 44.8 Pass Successful 102.0 69.6
37 KDB 79 31 91.2 67.2 Pass Successful 78.0 58.4
38 KDB 80 26 120.0 0.8 Pass Successful 110.4 4.0
39 KDB 80 30 98.4 2.4 Pass Successful 91.2 13.6
40 KDB 81 14 70.8 3.2 Pass Successful 82.8 3.2
41 KDB 81 20 96.0 4.8 Pass Successful 97.2 15.2
42 KDB 83 14 85.2 32.0 Pass Unsuccessful 105.6 44.0
43 KDB 87 24 69.6 33.6 Pass Successful 79.2 59.2
44 KDB 88 22 120.0 0.8 Pass Successful 87.6 20.8
45 KDB 90 18 85.2 20.0 Pass Unsuccessful 109.2 41.6
46 KDB 91 18 79.2 24.8 Pass Successful 91.2 16.0
47 KDB 91 46 88.8 30.4 Pass Unsuccessful 91.2 28.8
48 KDB 94 7 68.4 29.6 Pass Unsuccessful 108.0 43.2
# draw the pitch
pitch = Pitch(pitch_type='statsbomb', pitch_color='#22312b', line_color='#c7d5cc')
fig, ax = pitch.draw(figsize=(16, 11), constrained_layout=True, tight_layout=False)

for x in range(len(df2['x'])):
    if df2['outcome'][x] == 'Successful':
        # Successful passes are in color cyan
    if df2['outcome'][x] == 'Unsuccessful':
        # Unsuccessful passes are in color red

plt.title('KDB Pass Map vs TOT',color='white',size=20)
Text(0.5, 1.0, 'KDB Pass Map vs TOT')

From here, we can find that most of successful passes are backward passes which are less likely to affect a goal, while about half of the passes are forward passes that are close to the box. Therefore, from here we can say that those forward passes have more impact to score a goal which contribute the most as an attack input to win. After doing some researches, these passes are called Progressive Paases. Therefore, I need to clarrified those progressive passes by following this Tutorial

# see the passing distance to determine which passes are progressive pass.
df2['beginning'] = np.sqrt(np.square(120-df2['x'])+np.square(40-df2['y']))
df2['end'] = np.sqrt(np.square(120 - df2['endX'])+np.square(40-df2['endY']))
df2['progressive'] = [(df2['end'][x]) / (df2['beginning'][x]) < .75 for x in range(len(df2.beginning))]
player minute second x y type outcome endX endY beginning end progressive
0 KDB 45 0 60.0 40.0 Pass Successful 48.0 34.4 60.000000 72.217449 False
1 KDB 45 25 75.6 38.4 Pass Successful 90.0 66.4 44.428819 39.961982 False
2 KDB 46 4 88.8 46.4 Pass Successful 85.2 52.0 31.849647 36.810868 False
3 KDB 46 7 91.2 54.4 Pass Successful 114.0 62.4 32.199379 23.189653 True
4 KDB 46 55 120.0 0.8 Pass Unsuccessful 115.2 30.4 39.200000 10.733126 True
df3 = df2.loc[df2['progressive'] == True]
player minute second x y type outcome endX endY beginning end progressive
3 KDB 46 7 91.2 54.4 Pass Successful 114.0 62.4 32.199379 23.189653 True
4 KDB 46 55 120.0 0.8 Pass Unsuccessful 115.2 30.4 39.200000 10.733126 True
23 KDB 67 40 87.6 46.4 Pass Unsuccessful 104.4 32.8 33.026050 17.181385 True
27 KDB 71 46 97.2 45.6 Pass Successful 115.2 56.8 23.477649 17.472264 True
42 KDB 83 14 85.2 32.0 Pass Unsuccessful 105.6 44.0 35.707702 14.945233 True
45 KDB 90 18 85.2 20.0 Pass Unsuccessful 109.2 41.6 40.137763 10.917875 True
48 KDB 94 7 68.4 29.6 Pass Unsuccessful 108.0 43.2 52.637629 12.419340 True

As we can see, most the progressive passes that Kevin did are unsuccessful in the second half of this match.

# Visualizing the progressive passes made by KDB
pitch = Pitch(pitch_type='statsbomb',pitch_color='#22312b', line_color='#c7d5cc')
fig,ax = pitch.draw(figsize=(16, 11), constrained_layout=True, tight_layout=False)

<matplotlib.collections.LineCollection at 0x7fe204dc3650>

Therefore, not all passes are creating chances to score goals. Only progressive passes (like crossing) are the “key” passes. That’s why passes like backward passes have a negative impact on winning a match, since it has an opposite relation with goals.


Overall, this project I did are basically analyzing the important facotrs contributing to win a match in Premier League. Scoring a goal is the key attacking input for winning. outfielder_block and goals_conceded are the key defending inputs for winning.


Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?


  • List any other references that you found helpful.

Prgressive Pass Def Create a soccer pitch


Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Created in Deepnote