Dota2 match outcome prediction based on mid player performance
Contents
Dota2 match outcome prediction based on mid player performance¶
Author:Yi Ding
Course Project, UC Irvine, Math 10, S22
Introduction¶
In this project, I intent to find some features from players’ performance data that are more correlate with the results of the matches. After that I use these features and some sklearn tools, Decision tree and K-Nearest Neighbors to predict the match outcomes.
Main portion of the project¶
import altair as alt
import glob
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
files=glob.glob("*.csv")
df=pd.concat([pd.read_csv(n).dropna() for n in files],ignore_index=True)
df.corr()["win"].sort_values(ascending=False)
win 1.000000
kda 0.532235
gpm 0.507651
kpm 0.499565
towerdamage 0.486888
xpm 0.464428
assists 0.442029
kills 0.418474
herokill 0.418282
towerkill 0.353701
killbytower 0.353544
roshankill 0.216538
level 0.197641
tefipar 0.170175
laneff 0.163227
totalgold 0.147461
totalxp 0.124276
tengold 0.119722
rune_pickups 0.116892
herodamage 0.101018
tendn 0.086132
ancikill 0.081159
tenxp 0.079259
denies 0.069989
senkill 0.053741
apm 0.042617
courkill 0.040109
necrokill 0.028446
firstblood 0.027637
neukill 0.022762
obskill 0.022390
heroheal 0.012792
heroid 0.003412
obsuse -0.002103
obsplaced -0.005328
roam -0.005937
senpurch -0.009090
last_hits -0.016502
obspurch -0.021447
senplaced -0.027812
matchid -0.030580
senuse -0.036714
accountid -0.044914
radiwin -0.045053
lanekill -0.055853
duration -0.110246
tppurch -0.190447
deaths -0.499130
Name: win, dtype: float64
Since all the data files are of the same pattern, I use glob to store them to avoid repetition and use concat to combine all datasets into one .
From the coefficient of the correlation the greater the absolute value is the higher the correlation, I choose the following features to predict the outcome of a match
cols=["gpm","kpm","kda","towerdamage","xpm","assists","kills","deaths"]
chart_list=[]
for c in cols:
chart = alt.Chart(df).mark_bar(
opacity=0.6
).encode(
x = alt.X(c,bin=alt.Bin(maxbins=50)),
y = alt.Y("count()",stack=None),
color = alt.Color("win:N",title="result")
)
chart_list.append(chart)
alt.vconcat(*chart_list)
Here I use layered histogram to visualize the correlation between those features and the matche’s outcome. 0 stands for lose and 1 stands for win. From the altair charts, we can easily tell that the histograms of winning games are more right-skewed than that of losing games which means winning game tend to have higher kda, kill,etc, except for the histogram for deaths, the winning game is more left-skewed, which make sense that fewer deaths, higher win rates.
X_train,X_test,y_train,y_test=train_test_split(df[cols],df["win"],train_size=0.8)
train_error_dict={}
train_accuracy={}
test_error_dict={}
test_accuracy={}
storage={}
for n in range(2,40):
Dtree=DecisionTreeClassifier(max_leaf_nodes=n)
storage[n]=Dtree
Dtree.fit(X_train,y_train)
train_error_dict[n]=(log_loss(y_train,Dtree.predict_proba(X_train)))
train_accuracy[n]=Dtree.score(X_train,y_train)
test_error_dict[n]=(log_loss(y_test,Dtree.predict_proba(X_test)))
test_accuracy[n]=Dtree.score(X_test,y_test)
df_train = pd.DataFrame({"y":train_error_dict, "type": "train","accuracy":train_accuracy})
df_test = pd.DataFrame({"y":test_error_dict, "type": "test","accuracy":test_accuracy})
df_error = pd.concat([df_train, df_test]).reset_index()
alt.Chart(df_error).mark_line().encode(
x="index",
y="y",
color="type",
).properties(
width=800,
height=300
)
From this error curve chart, we can conclude that when max leaf nodes exceeds around 13, it overfits.
choose 9 is a relatively good model we got accuracy of 85.47%
my data set is relatively small, so uncertainty may occure often
print(train_accuracy[9])
print(test_accuracy[9])
0.8945530726256983
0.8547486033519553
scaler=StandardScaler()
scaler.fit(df[cols])
X_scaled=scaler.transform(df[cols])
I want to use KNN in this part, and KNN is sensitive to the variance of the data. Thus, I scale the data
Xtrain,Xtest,ytrain,ytest=train_test_split(X_scaled,df["win"],train_size=0.8)
trainError={}
testError={}
s={}
for k in range(10,200,10):
KNN=KNeighborsClassifier(n_neighbors=k)
KNN.fit(Xtrain,ytrain)
s[k]=KNN
trainError[k]=log_loss(ytrain,KNN.predict_proba(Xtrain))
testError[k]=log_loss(ytest,KNN.predict_proba(Xtest))
dfTrain = pd.DataFrame({"y":trainError, "type": "train"})
dfTest = pd.DataFrame({"y":testError, "type": "test"})
dfError = pd.concat([dfTrain, dfTest]).reset_index()
alt.Chart(dfError).mark_line().encode(
x="index",
y="y",
color="type",
)
For KNN when K is small it over fits, in this chart, for example, k=10, the error for train set is really small, which leads to huge error in test set which is a sign of overfitting. when k keep increasing, both error increase
So for our data, choose k around 30 is a good model. In addition, since my dataset is small, the uncertainty sometimes dominate.
print(s[30].score(Xtrain,ytrain))
print(s[30].score(Xtest,ytest))
0.8931564245810056
0.8770949720670391
For k=30, we got a accuracy of 87.7%
Summary¶
I used decision tree and K-Nearest Neighbor to predict the match outcome based on player performance. In order to find a good model, I used altair chart to display the error curves of both algorithm and thus I can analyze when the model is overfitting and when it is good.
References¶
the datasets are from Dota2 Midplayer Performance
the inspiration of using altair charts to visualize correlation is from Fork of Decision tree classifier
the code of drawing overlapped altair charts is adapted from Layered Histogram
K-Nearest Neighbors AlgorithmA Practical Introduction to K-Nearest Neighbors Algorithm for Regression (with Python code)