Men’s volleyball performances prediction#
Author:Mingyang Yi
Course Project, UC Irvine, Math 10, S23
Introduction#
As a volleyball fan, I’m very interested in what factor effects the result the most, also I want to predict one team’s performance.
Data Cleaning#
import pandas as pd
df=pd.read_csv("mensvolleyball-PlusLiga08-23.csv")
this is one way to delete part of name of columns, it is from PyData website.
df2 = df.filter(regex="T1|Winner|Team_1|Date")
i need to divide into two dataframe later so i need to change its boolen value.
df2['Winner'] = df2['Winner'].replace({1: 0, 0: 1})
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
df2.columns = df2.columns.str.replace("1", "")
df2
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Rec_Perf | T_Att_Sum | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01.10.2022, 14:45 | AZS Olsztyn | 1 | 60.0 | 17.0 | 11.0 | 79.0 | 18 | 6.0 | -13% | ... | 25% | 100 | 7.0 | 14.0 | 47.0 | 47% | 26% | 7.0 | 11 | 0 |
1 | 30.09.2022, 17:30 | Jastrzębski Węgiel | 3 | 51.0 | 17.0 | 27.0 | 77.0 | 15 | 4.0 | -7% | ... | 16% | 88 | 4.0 | 1.0 | 43.0 | 48% | 43% | 4.0 | 8 | 1 |
2 | 01.10.2022, 20:30 | LUK Lublin | 2 | 76.0 | 23.0 | 35.0 | 109.0 | 16 | 3.0 | -9% | ... | 21% | 115 | 6.0 | 10.0 | 63.0 | 54% | 40% | 10.0 | 9 | 0 |
3 | 02.10.2022, 14:45 | Warta Zawiercie | 3 | 66.0 | 16.0 | 22.0 | 98.0 | 21 | 5.0 | -16% | ... | 12% | 92 | 8.0 | 7.0 | 52.0 | 56% | 40% | 9.0 | 11 | 1 |
4 | 03.10.2022, 17:30 | BBTS Bielsko-Biała | 1 | 63.0 | 22.0 | 17.0 | 100.0 | 19 | 7.0 | -7% | ... | 23% | 97 | 5.0 | 10.0 | 48.0 | 49% | 34% | 8.0 | 10 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2634 | 20.03.2010, 17:00 | Pamapol Wielton Wieluń | 3 | 50.0 | 74.0 | 6.0 | 11.0 | 2,00 | 37.0 | 0 | ... | 18 | 48% | 67.0 | 4.0 | 7.0 | 35 | 52% | 9.0 | 3,00 | 1 |
2635 | 19.03.2010, 18:00 | ZAKSA Kędzierzyn-Koźle | 3 | 54.0 | 74.0 | 4.0 | 11.0 | 1,33 | 46.0 | 2 | ... | 18 | 39% | 74.0 | 4.0 | 9.0 | 41 | 55% | 9.0 | 3,00 | 1 |
2636 | 20.03.2010, 17:00 | PGE Skra Bełchatów | 3 | 54.0 | 75.0 | 5.0 | 12.0 | 1,67 | 54.0 | 5 | ... | 15 | 27% | 69.0 | 3.0 | 5.0 | 41 | 59% | 8.0 | 2,67 | 1 |
2637 | 20.03.2010, 17:00 | Asseco Resovia | 3 | 55.0 | 73.0 | 8.0 | 6.0 | 2,67 | 49.0 | 1 | ... | 19 | 38% | 88.0 | 5.0 | 7.0 | 42 | 48% | 5.0 | 1,67 | 1 |
2638 | 20.03.2010, 14:45 | Chemik Bydgoszcz | 0 | 43.0 | 64.0 | 1.0 | 12.0 | 0,33 | 65.0 | 1 | ... | 26 | 40% | 89.0 | 9.0 | 7.0 | 41 | 46% | 1.0 | 0,33 | 0 |
2639 rows × 23 columns
df3=df.filter(regex='T2|Winner|Team_2|Date')
df3.columns = df3.columns.str.replace("2", "")
df3
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Rec_Perf | T_Att_Sum | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01.10.2022, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 69 | 30 | 38 | 96 | 11 | 10 | 2% | ... | 26% | 88 | 7 | 7 | 45 | 51% | 35% | 14 | 11 | 1 |
1 | 30.09.2022, 17:30 | GKS Katowice | 0 | 48 | 16 | 16 | 70 | 16 | 4 | -11% | ... | 20% | 91 | 8 | 4 | 43 | 47% | 34% | 1 | 17 | 0 |
2 | 01.10.2022, 20:30 | Czarni Radom | 3 | 82 | 23 | 40 | 104 | 19 | 9 | -5% | ... | 18% | 128 | 10 | 10 | 63 | 49% | 33% | 10 | 13 | 1 |
3 | 02.10.2022, 14:45 | PGE Skra Bełchatów | 2 | 71 | 21 | 25 | 103 | 23 | 8 | -8% | ... | 9% | 102 | 9 | 9 | 56 | 54% | 37% | 7 | 14 | 0 |
4 | 03.10.2022, 17:30 | Cuprum Lubin | 3 | 80 | 30 | 32 | 103 | 26 | 12 | -8% | ... | 22% | 109 | 7 | 8 | 58 | 53% | 39% | 10 | 10 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2634 | 20.03.2010, 17:00 | AZS Częstochowa | 0 | 34 | 52 | 0 | 15 | 0,00 | 60 | 3 | ... | 26 | 43% | 70 | 9 | 9 | 27 | 39% | 7 | 2,33 | 0 |
2635 | 19.03.2010, 18:00 | AZS Olsztyn | 0 | 39 | 57 | 2 | 11 | 0,67 | 63 | 4 | ... | 14 | 22% | 80 | 10 | 9 | 28 | 35% | 9 | 3,00 | 0 |
2636 | 20.03.2010, 17:00 | Jadar Radom | 0 | 43 | 67 | 4 | 13 | 1,33 | 63 | 5 | ... | 11 | 17% | 66 | 7 | 8 | 35 | 53% | 5 | 1,67 | 0 |
2637 | 20.03.2010, 17:00 | Projekt Warszawa | 0 | 37 | 59 | 1 | 10 | 0,33 | 67 | 8 | ... | 16 | 23% | 82 | 8 | 6 | 31 | 38% | 6 | 2,00 | 0 |
2638 | 20.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 50 | 66 | 1 | 9 | 0,33 | 52 | 1 | ... | 26 | 50% | 73 | 7 | 1 | 42 | 58% | 7 | 2,33 | 1 |
2639 rows × 23 columns
df1 = pd.concat([df2, df3], axis=0, ignore_index=True)
perc_cols = ['T_Srv_Eff', 'T_Rec_Pos', 'T_Rec_Perf', 'T_Att_Kill_Perc', 'T_Att_Eff', 'T_Att_Sum']
for col in perc_cols:
df1[col] = pd.to_numeric(df1[col].str.replace('%', ''))
float_cols = ['T_Srv_Err', 'T_Blk_As']
for col in float_cols:
df1[col] = pd.to_numeric(df1[col].str.replace(',', '.'))
I change the points they gain to the percent of points they gain in different ways, because some teams have 5 sets a match, but some teams only have 3 sets a match.
df1["T_Att_Kill"]=df1["T_Att_Kill"]/df1["T_Sum"]
df1["T_Blk_Sum"]=df1["T_Blk_Sum"]/df1["T_Sum"]
df1["T_Srv_Ace"]=df1["T_Srv_Ace"]/df1["T_Sum"]
df1
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Rec_Perf | T_Att_Sum | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01.10.2022, 14:45 | AZS Olsztyn | 1 | 60.0 | 17.0 | 11.0 | 79.0 | 18.00 | 0.100000 | -13 | ... | 25 | 100 | 7.0 | 14.0 | 0.783333 | 47 | 26 | 0.116667 | 11.00 | 0 |
1 | 30.09.2022, 17:30 | Jastrzębski Węgiel | 3 | 51.0 | 17.0 | 27.0 | 77.0 | 15.00 | 0.078431 | -7 | ... | 16 | 88 | 4.0 | 1.0 | 0.843137 | 48 | 43 | 0.078431 | 8.00 | 1 |
2 | 01.10.2022, 20:30 | LUK Lublin | 2 | 76.0 | 23.0 | 35.0 | 109.0 | 16.00 | 0.039474 | -9 | ... | 21 | 115 | 6.0 | 10.0 | 0.828947 | 54 | 40 | 0.131579 | 9.00 | 0 |
3 | 02.10.2022, 14:45 | Warta Zawiercie | 3 | 66.0 | 16.0 | 22.0 | 98.0 | 21.00 | 0.075758 | -16 | ... | 12 | 92 | 8.0 | 7.0 | 0.787879 | 56 | 40 | 0.136364 | 11.00 | 1 |
4 | 03.10.2022, 17:30 | BBTS Bielsko-Biała | 1 | 63.0 | 22.0 | 17.0 | 100.0 | 19.00 | 0.111111 | -7 | ... | 23 | 97 | 5.0 | 10.0 | 0.761905 | 49 | 34 | 0.126984 | 10.00 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5273 | 20.03.2010, 17:00 | AZS Częstochowa | 0 | 34.0 | 52.0 | 0.0 | 15.0 | 0.00 | 1.764706 | 3 | ... | 26 | 43 | 70.0 | 9.0 | 0.264706 | 27 | 39 | 0.205882 | 2.33 | 0 |
5274 | 19.03.2010, 18:00 | AZS Olsztyn | 0 | 39.0 | 57.0 | 2.0 | 11.0 | 0.67 | 1.615385 | 4 | ... | 14 | 22 | 80.0 | 10.0 | 0.230769 | 28 | 35 | 0.230769 | 3.00 | 0 |
5275 | 20.03.2010, 17:00 | Jadar Radom | 0 | 43.0 | 67.0 | 4.0 | 13.0 | 1.33 | 1.465116 | 5 | ... | 11 | 17 | 66.0 | 7.0 | 0.186047 | 35 | 53 | 0.116279 | 1.67 | 0 |
5276 | 20.03.2010, 17:00 | Projekt Warszawa | 0 | 37.0 | 59.0 | 1.0 | 10.0 | 0.33 | 1.810811 | 8 | ... | 16 | 23 | 82.0 | 8.0 | 0.162162 | 31 | 38 | 0.162162 | 2.00 | 0 |
5277 | 20.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 50.0 | 66.0 | 1.0 | 9.0 | 0.33 | 1.040000 | 1 | ... | 26 | 50 | 73.0 | 7.0 | 0.020000 | 42 | 58 | 0.140000 | 2.33 | 1 |
5278 rows × 23 columns
model training#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
df1.isna().any(axis=0)
Date False
Team_ False
T_Score False
T_Sum False
T_BP False
T_Ratio False
T_Srv_Sum False
T_Srv_Err False
T_Srv_Ace False
T_Srv_Eff False
T_Rec_Sum False
T_Rec_Err False
T_Rec_Pos False
T_Rec_Perf False
T_Att_Sum False
T_Att_Err False
T_Att_Blk False
T_Att_Kill False
T_Att_Kill_Perc False
T_Att_Eff False
T_Blk_Sum False
T_Blk_As False
Winner False
dtype: bool
the first few columns represent total points and sets they win, they are not good for prediction.
X = df1.loc[:, "T_Srv_Sum":"T_Blk_As"]
y=df1["Winner"]
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5)
a=clf.score(X_train, y_train)
b=clf.score(X_test, y_test)
print(a,b)
0.8889152060634771 0.8702651515151515
there is no overfitting, so we can predict data next.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
df_err = pd.DataFrame(columns=["leaves", "error", "set"])
get the test error curve
for n in range(2, 200):
clf2 = DecisionTreeClassifier(max_leaf_nodes=n, random_state=42)
clf2.fit(X_train, y_train)
train_error = 1 - clf2.score(X_train, y_train)
test_error = 1 - clf2.score(X_test, y_test)
d_train = {"leaves": n, "error": train_error, "set":"train"}
d_test = {"leaves": n, "error": test_error, "set":"test"}
df_err.loc[len(df_err)] = d_train
df_err.loc[len(df_err)] = d_test
import altair as alt
c = alt.Chart(df_err).mark_line().encode(
x="leaves",
y="error",
color="set"
)
c
the sweet spot is approximately at 17
clf1 = DecisionTreeClassifier(max_depth=5, max_leaf_nodes=17)
clf1.fit(X, y)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=17)
clf1.score(X, y)
0.8819628647214854
fig = plt.figure(figsize=(100,200))
plot_tree(
clf1,
feature_names=clf1.feature_names_in_,
filled=True
);
use logistic regression to predict#
from sklearn.linear_model import LogisticRegression
clf3 = LogisticRegression()
clf3.fit(X,y)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()
df1["Pred"] = clf3.predict(X)
df1
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Att_Sum | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | Pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01.10.2022, 14:45 | AZS Olsztyn | 1 | 60.0 | 17.0 | 11.0 | 79.0 | 18.00 | 0.100000 | -13 | ... | 100 | 7.0 | 14.0 | 0.783333 | 47 | 26 | 0.116667 | 11.00 | 0 | 0 |
1 | 30.09.2022, 17:30 | Jastrzębski Węgiel | 3 | 51.0 | 17.0 | 27.0 | 77.0 | 15.00 | 0.078431 | -7 | ... | 88 | 4.0 | 1.0 | 0.843137 | 48 | 43 | 0.078431 | 8.00 | 1 | 1 |
2 | 01.10.2022, 20:30 | LUK Lublin | 2 | 76.0 | 23.0 | 35.0 | 109.0 | 16.00 | 0.039474 | -9 | ... | 115 | 6.0 | 10.0 | 0.828947 | 54 | 40 | 0.131579 | 9.00 | 0 | 0 |
3 | 02.10.2022, 14:45 | Warta Zawiercie | 3 | 66.0 | 16.0 | 22.0 | 98.0 | 21.00 | 0.075758 | -16 | ... | 92 | 8.0 | 7.0 | 0.787879 | 56 | 40 | 0.136364 | 11.00 | 1 | 0 |
4 | 03.10.2022, 17:30 | BBTS Bielsko-Biała | 1 | 63.0 | 22.0 | 17.0 | 100.0 | 19.00 | 0.111111 | -7 | ... | 97 | 5.0 | 10.0 | 0.761905 | 49 | 34 | 0.126984 | 10.00 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5273 | 20.03.2010, 17:00 | AZS Częstochowa | 0 | 34.0 | 52.0 | 0.0 | 15.0 | 0.00 | 1.764706 | 3 | ... | 43 | 70.0 | 9.0 | 0.264706 | 27 | 39 | 0.205882 | 2.33 | 0 | 0 |
5274 | 19.03.2010, 18:00 | AZS Olsztyn | 0 | 39.0 | 57.0 | 2.0 | 11.0 | 0.67 | 1.615385 | 4 | ... | 22 | 80.0 | 10.0 | 0.230769 | 28 | 35 | 0.230769 | 3.00 | 0 | 0 |
5275 | 20.03.2010, 17:00 | Jadar Radom | 0 | 43.0 | 67.0 | 4.0 | 13.0 | 1.33 | 1.465116 | 5 | ... | 17 | 66.0 | 7.0 | 0.186047 | 35 | 53 | 0.116279 | 1.67 | 0 | 0 |
5276 | 20.03.2010, 17:00 | Projekt Warszawa | 0 | 37.0 | 59.0 | 1.0 | 10.0 | 0.33 | 1.810811 | 8 | ... | 23 | 82.0 | 8.0 | 0.162162 | 31 | 38 | 0.162162 | 2.00 | 0 | 0 |
5277 | 20.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 50.0 | 66.0 | 1.0 | 9.0 | 0.33 | 1.040000 | 1 | ... | 50 | 73.0 | 7.0 | 0.020000 | 42 | 58 | 0.140000 | 2.33 | 1 | 1 |
5278 rows × 24 columns
df1[df1["Winner"] == df1["Pred"]]
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Att_Sum | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | Pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01.10.2022, 14:45 | AZS Olsztyn | 1 | 60.0 | 17.0 | 11.0 | 79.0 | 18.00 | 0.100000 | -13 | ... | 100 | 7.0 | 14.0 | 0.783333 | 47 | 26 | 0.116667 | 11.00 | 0 | 0 |
1 | 30.09.2022, 17:30 | Jastrzębski Węgiel | 3 | 51.0 | 17.0 | 27.0 | 77.0 | 15.00 | 0.078431 | -7 | ... | 88 | 4.0 | 1.0 | 0.843137 | 48 | 43 | 0.078431 | 8.00 | 1 | 1 |
2 | 01.10.2022, 20:30 | LUK Lublin | 2 | 76.0 | 23.0 | 35.0 | 109.0 | 16.00 | 0.039474 | -9 | ... | 115 | 6.0 | 10.0 | 0.828947 | 54 | 40 | 0.131579 | 9.00 | 0 | 0 |
4 | 03.10.2022, 17:30 | BBTS Bielsko-Biała | 1 | 63.0 | 22.0 | 17.0 | 100.0 | 19.00 | 0.111111 | -7 | ... | 97 | 5.0 | 10.0 | 0.761905 | 49 | 34 | 0.126984 | 10.00 | 0 | 0 |
5 | 02.10.2022, 20:30 | Stal Nysa | 3 | 68.0 | 23.0 | 29.0 | 100.0 | 22.00 | 0.102941 | -12 | ... | 105 | 5.0 | 4.0 | 0.823529 | 53 | 44 | 0.073529 | 9.00 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5273 | 20.03.2010, 17:00 | AZS Częstochowa | 0 | 34.0 | 52.0 | 0.0 | 15.0 | 0.00 | 1.764706 | 3 | ... | 43 | 70.0 | 9.0 | 0.264706 | 27 | 39 | 0.205882 | 2.33 | 0 | 0 |
5274 | 19.03.2010, 18:00 | AZS Olsztyn | 0 | 39.0 | 57.0 | 2.0 | 11.0 | 0.67 | 1.615385 | 4 | ... | 22 | 80.0 | 10.0 | 0.230769 | 28 | 35 | 0.230769 | 3.00 | 0 | 0 |
5275 | 20.03.2010, 17:00 | Jadar Radom | 0 | 43.0 | 67.0 | 4.0 | 13.0 | 1.33 | 1.465116 | 5 | ... | 17 | 66.0 | 7.0 | 0.186047 | 35 | 53 | 0.116279 | 1.67 | 0 | 0 |
5276 | 20.03.2010, 17:00 | Projekt Warszawa | 0 | 37.0 | 59.0 | 1.0 | 10.0 | 0.33 | 1.810811 | 8 | ... | 23 | 82.0 | 8.0 | 0.162162 | 31 | 38 | 0.162162 | 2.00 | 0 | 0 |
5277 | 20.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 50.0 | 66.0 | 1.0 | 9.0 | 0.33 | 1.040000 | 1 | ... | 50 | 73.0 | 7.0 | 0.020000 | 42 | 58 | 0.140000 | 2.33 | 1 | 1 |
4489 rows × 24 columns
4683/5278
0.8872679045092838
predict accuracy is high
from sklearn.metrics import mean_absolute_error
mean_absolute_error(clf3.predict(X_test), y_test)
0.1571969696969697
mean_absolute_error(clf3.predict(X_train), y_train)
0.14756039791567976
most important factor#
In men’s volleyball, many people complain that too many serving ace negatively affect the fluenty of the matches. I want to know if serving ace is the dominant factor.
clf1.feature_importances_
array([0.05616922, 0. , 0.64921671, 0. , 0.03569843,
0.00612576, 0. , 0. , 0.00312938, 0. ,
0. , 0.0188901 , 0.00658239, 0.22418801, 0. ,
0. ])
pd.Series(clf1.feature_importances_, index=X.columns)
T_Srv_Sum 0.056169
T_Srv_Err 0.000000
T_Srv_Ace 0.649217
T_Srv_Eff 0.000000
T_Rec_Sum 0.035698
T_Rec_Err 0.006126
T_Rec_Pos 0.000000
T_Rec_Perf 0.000000
T_Att_Sum 0.003129
T_Att_Err 0.000000
T_Att_Blk 0.000000
T_Att_Kill 0.018890
T_Att_Kill_Perc 0.006582
T_Att_Eff 0.224188
T_Blk_Sum 0.000000
T_Blk_As 0.000000
dtype: float64
df4 = pd.DataFrame({"importance": clf1.feature_importances_, "factors": clf1.feature_names_in_})
df4
importance | factors | |
---|---|---|
0 | 0.056169 | T_Srv_Sum |
1 | 0.000000 | T_Srv_Err |
2 | 0.649217 | T_Srv_Ace |
3 | 0.000000 | T_Srv_Eff |
4 | 0.035698 | T_Rec_Sum |
5 | 0.006126 | T_Rec_Err |
6 | 0.000000 | T_Rec_Pos |
7 | 0.000000 | T_Rec_Perf |
8 | 0.003129 | T_Att_Sum |
9 | 0.000000 | T_Att_Err |
10 | 0.000000 | T_Att_Blk |
11 | 0.018890 | T_Att_Kill |
12 | 0.006582 | T_Att_Kill_Perc |
13 | 0.224188 | T_Att_Eff |
14 | 0.000000 | T_Blk_Sum |
15 | 0.000000 | T_Blk_As |
alt.Chart(df4).mark_bar().encode(
x="factors",
y="importance",
tooltip=["importance"]
).properties(
title = 'Importance of Factors'
)
the result shows us serving ace dose the most important factor in men’s volleyball
df1_1=df1.head(2000)
alt.Chart(df1_1).mark_circle().encode(
x=alt.X("T_Srv_Ace", scale=alt.Scale(zero=False)),
y=alt.Y("T_Att_Eff", scale=alt.Scale(zero=False)),
color="Winner",
tooltip=["Team_", "Date", "T_Att_Eff","T_Srv_Ace"],
).interactive()
You may find this chart a little weird because it is divided into two parts, that is because in volleyball game, some macthes have 5 sets but some macthes have 3 sets, thats give us two parts. I’m a little surprised in this figure, because I thought more serving ace means more possible to win. This figure shows us serving ace has more portion in total scores, it is less likely to win the game; also, the more effective attack you make, the more likely you win the game.
predict team performance#
many people don’t pay attention to defensive skills when they watch the games, so I want to predict if defensive skills important for a strong team.
alt.Chart(df1_1).mark_circle().encode(
x=alt.X("T_Blk_Sum", scale=alt.Scale(zero=False)),
y=alt.Y("T_Rec_Pos", scale=alt.Scale(zero=False)),
color="Winner",
tooltip=["Team_", "Date", "T_Att_Eff","T_Srv_Ace"],
).interactive()
This is no obvious relation between defensive skills, but it slightly shows there are more deep blue dots in right and top. let’s focus on one team.
I use neural networks because they consist of multiple layers of interconnected neurons and can be applied to various prediction tasks.
pip install xgboost
Collecting xgboost
Downloading xgboost-1.6.2-py3-none-manylinux2014_x86_64.whl (255.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 255.9/255.9 MB 6.4 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.7/py/lib/python3.7/site-packages (from xgboost) (1.21.6)
Requirement already satisfied: scipy in /shared-libs/python3.7/py/lib/python3.7/site-packages (from xgboost) (1.7.3)
Installing collected packages: xgboost
Successfully installed xgboost-1.6.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Z=df1[["T_Blk_Sum","T_Rec_Pos"]]
X_train1, X_test1, y_train1, y_test1 = train_test_split(Z, y, test_size=0.2)
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
model1 = xgb.XGBClassifier()
model1.fit(X_train1, y_train1)
predictions = model1.predict(X_test1)
t=model1.predict(Z)
accurac = accuracy_score(y, t)
print("Accuracy:", accurac)
Accuracy: 0.7391057218643425
df1["p"]=model1.predict(Z)
df5=df1[df1["Winner"]==1]
df5
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | Pred | p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 30.09.2022, 17:30 | Jastrzębski Węgiel | 3 | 51.0 | 17.0 | 27.0 | 77.0 | 15.00 | 0.078431 | -7 | ... | 4.0 | 1.0 | 0.843137 | 48 | 43 | 0.078431 | 8.00 | 1 | 1 | 0 |
3 | 02.10.2022, 14:45 | Warta Zawiercie | 3 | 66.0 | 16.0 | 22.0 | 98.0 | 21.00 | 0.075758 | -16 | ... | 8.0 | 7.0 | 0.787879 | 56 | 40 | 0.136364 | 11.00 | 1 | 0 | 0 |
5 | 02.10.2022, 20:30 | Stal Nysa | 3 | 68.0 | 23.0 | 29.0 | 100.0 | 22.00 | 0.102941 | -12 | ... | 5.0 | 4.0 | 0.823529 | 53 | 44 | 0.073529 | 9.00 | 1 | 1 | 0 |
6 | 02.10.2022, 17:30 | Trefl Gdańsk | 3 | 59.0 | 22.0 | 33.0 | 73.0 | 16.00 | 0.101695 | -6 | ... | 2.0 | 8.0 | 0.847458 | 60 | 48 | 0.050847 | 10.00 | 1 | 1 | 0 |
7 | 01.10.2022, 17:30 | Asseco Resovia | 3 | 52.0 | 23.0 | 25.0 | 74.0 | 14.00 | 0.115385 | -8 | ... | 7.0 | 5.0 | 0.692308 | 52 | 34 | 0.192308 | 8.00 | 1 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5269 | 14.03.2010, 14:45 | Asseco Resovia | 3 | 60.0 | 78.0 | 6.0 | 8.0 | 2.00 | 1.016667 | 2 | ... | 85.0 | 5.0 | 0.050000 | 48 | 56 | 0.100000 | 2.00 | 1 | 1 | 0 |
5270 | 14.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 73.0 | 87.0 | 4.0 | 15.0 | 1.00 | 0.917808 | 3 | ... | 100.0 | 4.0 | 0.095890 | 58 | 58 | 0.150685 | 2.75 | 1 | 1 | 1 |
5271 | 14.03.2010, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 73.0 | 97.0 | 6.0 | 14.0 | 1.50 | 0.904110 | 4 | ... | 96.0 | 7.0 | 0.082192 | 51 | 53 | 0.219178 | 4.00 | 1 | 1 | 1 |
5272 | 13.03.2010, 18:00 | PGE Skra Bełchatów | 3 | 67.0 | 97.0 | 3.0 | 14.0 | 0.75 | 1.044776 | 2 | ... | 104.0 | 8.0 | 0.164179 | 53 | 51 | 0.164179 | 2.75 | 1 | 1 | 1 |
5277 | 20.03.2010, 14:45 | Jastrzębski Węgiel | 3 | 50.0 | 66.0 | 1.0 | 9.0 | 0.33 | 1.040000 | 1 | ... | 73.0 | 7.0 | 0.020000 | 42 | 58 | 0.140000 | 2.33 | 1 | 1 | 1 |
2639 rows × 25 columns
I find the team who won the most amount of games.
a=df5["Team_"].value_counts()
a
ZAKSA Kędzierzyn-Koźle 324
PGE Skra Bełchatów 321
Asseco Resovia 290
Jastrzębski Węgiel 286
Projekt Warszawa 194
AZS Olsztyn 172
Trefl Gdańsk 171
Chemik Bydgoszcz 129
Cuprum Lubin 110
Czarni Radom 108
Warta Zawiercie 91
AZS Częstochowa 83
GKS Katowice 74
Społem Kielce 71
MKS Będzin 44
BBTS Bielsko-Biała 35
Ślepsk Malow Suwałki 32
Stal Nysa 26
LUK Lublin 22
Stocznia Szczecin 19
Jadar Radom 17
Pamapol Wielton Wieluń 14
Barkom Każany Lwów 6
Name: Team_, dtype: int64
df_2 = df1.loc[df1['Team_'] == 'ZAKSA Kędzierzyn-Koźle']
df_2
Date | Team_ | T_Score | T_Sum | T_BP | T_Ratio | T_Srv_Sum | T_Srv_Err | T_Srv_Ace | T_Srv_Eff | ... | T_Att_Err | T_Att_Blk | T_Att_Kill | T_Att_Kill_Perc | T_Att_Eff | T_Blk_Sum | T_Blk_As | Winner | Pred | p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | 04.10.2022, 21:00 | ZAKSA Kędzierzyn-Koźle | 3 | 59.0 | 23.0 | 37.0 | 73.0 | 9.00 | 0.050847 | -5 | ... | 5.0 | 4.0 | 0.762712 | 59 | 47 | 0.186441 | 7.00 | 1 | 1 | 0 |
29 | 13.10.2022, 17:30 | ZAKSA Kędzierzyn-Koźle | 3 | 53.0 | 20.0 | 34.0 | 73.0 | 7.00 | 0.075472 | 0 | ... | 3.0 | 2.0 | 0.735849 | 56 | 49 | 0.188679 | 5.00 | 1 | 1 | 1 |
44 | 30.10.2022, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 50.0 | 22.0 | 26.0 | 73.0 | 11.00 | 0.120000 | 0 | ... | 2.0 | 6.0 | 0.660000 | 50 | 37 | 0.220000 | 9.00 | 1 | 1 | 1 |
59 | 06.11.2022, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 62.0 | 20.0 | 26.0 | 95.0 | 14.00 | 0.032258 | -9 | ... | 8.0 | 8.0 | 0.790323 | 49 | 33 | 0.177419 | 18.00 | 1 | 1 | 1 |
74 | 19.11.2022, 17:30 | ZAKSA Kędzierzyn-Koźle | 3 | 60.0 | 15.0 | 40.0 | 74.0 | 9.00 | 0.033333 | -6 | ... | 5.0 | 3.0 | 0.800000 | 57 | 48 | 0.166667 | 6.00 | 1 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5235 | 10.01.2010, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 59.0 | 74.0 | 11.0 | 14.0 | 3.67 | 0.694915 | 0 | ... | 68.0 | 2.0 | 0.084746 | 44 | 65 | 0.067797 | 1.33 | 1 | 1 | 1 |
5245 | 12.02.2010, 18:00 | ZAKSA Kędzierzyn-Koźle | 3 | 72.0 | 103.0 | 8.0 | 13.0 | 1.60 | 1.125000 | 11 | ... | 102.0 | 5.0 | 0.166667 | 54 | 53 | 0.138889 | 2.00 | 1 | 1 | 0 |
5254 | 20.02.2010, 17:00 | ZAKSA Kędzierzyn-Koźle | 3 | 67.0 | 91.0 | 3.0 | 16.0 | 0.75 | 1.074627 | 1 | ... | 124.0 | 6.0 | 0.164179 | 56 | 45 | 0.134328 | 2.25 | 1 | 1 | 0 |
5263 | 06.03.2010, 18:00 | ZAKSA Kędzierzyn-Koźle | 3 | 77.0 | 106.0 | 12.0 | 18.0 | 2.40 | 1.077922 | 3 | ... | 115.0 | 8.0 | 0.207792 | 52 | 45 | 0.168831 | 2.60 | 1 | 0 | 1 |
5271 | 14.03.2010, 14:45 | ZAKSA Kędzierzyn-Koźle | 3 | 73.0 | 97.0 | 6.0 | 14.0 | 1.50 | 0.904110 | 4 | ... | 96.0 | 7.0 | 0.082192 | 51 | 53 | 0.219178 | 4.00 | 1 | 1 | 1 |
445 rows × 25 columns
alt.Chart(df_2).mark_circle().encode(
x="T_Blk_Sum",
y="T_Rec_Pos",
color=alt.Color("p", title="Wins"),
tooltip = ("T_Blk_Sum", "T_Rec_Pos")
).properties(
width=500,
height=200,
)
It doesn’t have much connections between block and reception skills, the deep blue dots are more on the top, but it is not a strong relation, so defensive skills are not as important as offensive skills.
Summary#
After predicting, I found that the score portion of serving ace is the most important factor for men’s volleyball, but more bigger ace scores portion actually causes negative result; On contrary, effective attack is more important to win a game. Therefore, it is more essential to practice attack skills than serving skills. Also, we found block and reception don’t have significant effect, so defensive skills don’t show equal importance as offensive in this data.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/code/kacpergregorowicz/predicting-volleyball-match-winners/notebook
List any other references that you found helpful.
https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.filter.html
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote