Overwatch 2#
Author: Luke Galang (lrgalang@uci.edu)
Course Project, UC Irvine, Math 10, S23
Introduction#
This is some data from Blizzard Entertainment’s Overwatch 2, a video game where a multitude of heroes face off to accomplish a series of objectives using their different abilities. What I want to try to accomplish is to use what we learned in Math 10 to show if there are any relationships in the data visually and to predict probabilities to certain skill tiers or heroes. I also want to predict relationships between certain statistics in the dataset to understand if there is any correlation between those stats
Main Portion of Project#
Load Data#
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("Overwatch2.csv")
df
Hero | Skill Tier | KDA Ratio | Pick Rate, % | Win Rate, % | Eliminations / 10min | Objective Kills / 10min | Objective Time / 10min | Damage / 10min | Healing / 10min | ... | Jagged Blade Accuracy, % | Carnage Kills / 10min | Wound Uptime, % | Rampage Kills / 10min | Focusing Beam Accuracy, % | Focusing Beam Kills / 10min | Sticky Bomb Accuracy, % | Sticky Bomb Kills / 10min | Duplicate Kills / 10min | Role | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ana | All | 4.49 | 10.18 | 50.99 | 9.46 | 4.05 | 63 | 2676 | 8686.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Support |
1 | Ana | Bronze | 3.87 | 3.71 | 43.97 | 8.37 | 3.80 | 61 | 2508 | 7483.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Support |
2 | Ana | Silver | 4.01 | 3.78 | 46.68 | 8.88 | 4.08 | 64 | 2573 | 7875.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Support |
3 | Ana | Gold | 4.36 | 5.46 | 48.90 | 9.21 | 4.14 | 65 | 2610 | 8251.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Support |
4 | Ana | Platinum | 4.56 | 9.33 | 50.45 | 9.53 | 4.18 | 65 | 2661 | 8650.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Support |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
283 | Echo | Gold | 2.61 | 0.62 | 46.33 | 17.62 | 6.52 | 62 | 7874 | NaN | ... | NaN | NaN | NaN | NaN | 36.42 | 7.41 | 25.0 | 8.36 | 3.40 | Damage |
284 | Echo | Platinum | 2.67 | 0.86 | 45.16 | 18.34 | 6.44 | 60 | 8249 | NaN | ... | NaN | NaN | NaN | NaN | 37.88 | 7.77 | 26.0 | 8.65 | 3.57 | Damage |
285 | Echo | Diamond | 2.73 | 1.57 | 47.70 | 18.85 | 6.22 | 57 | 8639 | NaN | ... | NaN | NaN | NaN | NaN | 39.25 | 8.07 | 27.0 | 8.83 | 3.69 | Damage |
286 | Echo | Master | 2.74 | 2.05 | 50.15 | 19.17 | 5.98 | 53 | 8929 | NaN | ... | NaN | NaN | NaN | NaN | 39.86 | 8.25 | 27.0 | 8.84 | 3.82 | Damage |
287 | Echo | Grandmaster | 2.90 | 3.32 | 50.45 | 19.25 | 5.51 | 49 | 9419 | NaN | ... | NaN | NaN | NaN | NaN | 41.25 | 8.25 | 27.0 | 8.81 | 3.65 | Damage |
288 rows Ă— 131 columns
df.columns
Index(['Hero', 'Skill Tier', 'KDA Ratio', 'Pick Rate, %', 'Win Rate, %',
'Eliminations / 10min', 'Objective Kills / 10min',
'Objective Time / 10min', 'Damage / 10min', 'Healing / 10min',
...
'Jagged Blade Accuracy, %', 'Carnage Kills / 10min', 'Wound Uptime, %',
'Rampage Kills / 10min', 'Focusing Beam Accuracy, %',
'Focusing Beam Kills / 10min', 'Sticky Bomb Accuracy, %',
'Sticky Bomb Kills / 10min', 'Duplicate Kills / 10min', 'Role'],
dtype='object', length=131)
df.shape
(288, 131)
Clean Data#
As you can see from the data, there are columns where it only applies to certain heroes and their specific ability so we drop them from the dataset. We drop the “All” row in the “Skill Tier” column as well to generate data based off indivdual ranks. We also want to rename some of the columns to make it shorter. Note that the columns after Win Rate% are in a span of 10 minutes so we will take that out for simplicity. Lets name this new dataframe df_ow
df = df.dropna(axis=1)
df_ow = df[df["Skill Tier"] != "All"].copy()
df_ow.rename({"KDA Ratio":"KDA", "Pick Rate, %": "Pick Rate", "Win Rate, %" : "Win Rate", "Eliminations / 10min" : "Eliminations", "Objective Kills / 10min": "Objective Kills","Objective Time / 10min": "Objective Time", "Damage / 10min" : "Damage", "Deaths / 10min" : "Deaths"}, axis=1, inplace=True)
df_ow
Hero | Skill Tier | KDA | Pick Rate | Win Rate | Eliminations | Objective Kills | Objective Time | Damage | Deaths | Role | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Ana | Bronze | 3.87 | 3.71 | 43.97 | 8.37 | 3.80 | 61 | 2508 | 6.47 | Support |
2 | Ana | Silver | 4.01 | 3.78 | 46.68 | 8.88 | 4.08 | 64 | 2573 | 6.36 | Support |
3 | Ana | Gold | 4.36 | 5.46 | 48.90 | 9.21 | 4.14 | 65 | 2610 | 6.13 | Support |
4 | Ana | Platinum | 4.56 | 9.33 | 50.45 | 9.53 | 4.18 | 65 | 2661 | 5.89 | Support |
5 | Ana | Diamond | 4.60 | 16.50 | 51.30 | 9.70 | 4.11 | 63 | 2709 | 5.74 | Support |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
283 | Echo | Gold | 2.61 | 0.62 | 46.33 | 17.62 | 6.52 | 62 | 7874 | 8.28 | Damage |
284 | Echo | Platinum | 2.67 | 0.86 | 45.16 | 18.34 | 6.44 | 60 | 8249 | 8.24 | Damage |
285 | Echo | Diamond | 2.73 | 1.57 | 47.70 | 18.85 | 6.22 | 57 | 8639 | 8.15 | Damage |
286 | Echo | Master | 2.74 | 2.05 | 50.15 | 19.17 | 5.98 | 53 | 8929 | 8.11 | Damage |
287 | Echo | Grandmaster | 2.90 | 3.32 | 50.45 | 19.25 | 5.51 | 49 | 9419 | 7.70 | Damage |
252 rows Ă— 11 columns
df_ow.columns
Index(['Hero', 'Skill Tier', 'KDA', 'Pick Rate', 'Win Rate', 'Eliminations',
'Objective Kills', 'Objective Time', 'Damage', 'Deaths', 'Role'],
dtype='object')
df_ow.shape
(252, 11)
Visualize Data#
Context :#
A common question about the game’s ranked system that most players have is “Why am I ranked so low? I feel like I am better than most players since I have more kills than them”. Like I said before this game also focuses on completing an objective where an objective kill is receiving an elimination while completing the objective. Another common argument is to either focus on kills as a whole or focus on the objective, so let us visualize this data by using x = Eliminations and y = Objective Kills and see if there is a trend present.
Note: Skill Tiers from lowest to highest is Bronze, Silver, Gold, Platinum, Diamond, Master, then Grandmaster#
brush = alt.selection_interval()
c = alt.Chart(df_ow).mark_point().encode(
x='Eliminations',
y='Objective Kills',
color="Skill Tier"
).add_selection(brush)
c1= alt.Chart(df_ow).mark_bar().encode(
x = 'Skill Tier',
y='Eliminations'
).transform_filter(brush)
c2= alt.Chart(df_ow).mark_bar().encode(
x = 'Skill Tier',
y='Objective Kills'
).transform_filter(brush)
c|c1|c2
With the bar graphs we notice the distribution of the amount of eliminations and Objective Kills in relation to their skill tier. These bar graphs show that towards the higher ranks the amount of eliminations are relatively high and similar all together, while with Objective Kills the two highest ranks are lower in terms of getting eliminations on the objective. Based off the scatterplot we notice that it looks like there is a positive relation between Eliminations and Objective Kills so we can try and predict the trend with a Linear Regression model.
Linear Regression#
reg=LinearRegression()
reg.fit(df_ow[["Eliminations"]],df_ow["Objective Kills"])
reg.predict(df_ow[["Eliminations"]])
df_ow["Prediction"]=reg.predict(df_ow[["Eliminations"]])
c3 = alt.Chart(df_ow).mark_circle().encode(
x="Eliminations",
y="Objective Kills"
)
c4=alt.Chart(df_ow).mark_line(color="red").encode(
x="Eliminations",
y="Prediction"
)
c3+c4
From the linear regression model, we recognize that there is a positive relationship between eliminations and objective kills in relation to a heroes skill tier. With this information let us see if we can predict a heroes role from their eliminations and objective kills.
col = ["Hero", "Role", "Prediction"]
df_ow[col]
Hero | Role | Prediction | |
---|---|---|---|
1 | Ana | Support | 3.649432 |
2 | Ana | Support | 3.823823 |
3 | Ana | Support | 3.936665 |
4 | Ana | Support | 4.046087 |
5 | Ana | Support | 4.104217 |
... | ... | ... | ... |
283 | Echo | Damage | 6.812416 |
284 | Echo | Damage | 7.058616 |
285 | Echo | Damage | 7.233008 |
286 | Echo | Damage | 7.342430 |
287 | Echo | Damage | 7.369785 |
252 rows Ă— 3 columns
Since heroes who are in the support role tend to have lower amounts of eliminations, lets focus on that role
Logistic Regression#
We use logistic regression to see whether or not we can predict if a hero is a support
df_ow["Support"]=df_ow["Role"].apply(str)
df_ow["isSupport_bool"]=(df["Role"] == "Support")
df_ow["isSupport"] = df_ow["isSupport_bool"].astype(int)
df_ow["isSupport"].mean()
0.2222222222222222
Predicted probability of support heroes in the dataset
c5 = alt.Chart(df_ow).mark_circle().encode(
x="Eliminations",
y="isSupport"
)
c6 = alt.Chart(df_ow).mark_circle().encode(
x="Objective Kills",
y="isSupport"
)
clf = LogisticRegression()
clf.fit(df_ow[["Eliminations"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Eliminations"]])
c7 = alt.Chart(df_ow).mark_circle(color= "red").encode(
x = "Eliminations",
y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Eliminations"]])
df_ow["pred_proba1"] = arr[:,1]
c8 = alt.Chart(df_ow).mark_circle(color= "red").encode(
x = "Eliminations",
y = "pred_proba1"
)
clf.fit(df_ow[["Objective Kills"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Objective Kills"]])
c9 = alt.Chart(df_ow).mark_circle(color= "red").encode(
x = "Objective Kills",
y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Objective Kills"]])
df_ow["pred_proba2"] = arr[:,1]
c10 = alt.Chart(df_ow).mark_circle(color= "red").encode(
x = "Objective Kills",
y = "pred_proba2"
)
c5|c6
Shows the distribution of who is a support character in each rank based off of their eliminations (c5) and their objective kills (c6)
c7|c5+c8
Shows prediction of who is a support and predicts based off of eliminations (c7) and shows logistic regression curve in relation c5 (c5+c8). Prediction curve shows a binary logistic regression curve which shows that supports majority of the time gets at least 8-14 eliminations.
c9|c6+c10
Shows prediction of who is a support and predicts based off of objective kills (c9) and shows logistic regression curve in relation to c5 (c5+c10) Prediction curve has a binary regression as well and similarly to Eliminations follows a similar curve. The predicted probabilities curve though is lengthier than the other curve which shows that the prediction is a little bit off in comparison to Eliminations.
Train Test Split#
Now we let us see if we can classify whether or not a hero is a support based off of Eliminations and Objective Kills using this concept. We can use “isSupport_bool” since this is a boolean column that tells us if a hero is a support or not
cols = ["Eliminations", "Objective Kills"]
X_train, X_test, y_train, y_test = train_test_split(df_ow[cols], df_ow["isSupport_bool"], test_size=0.2, random_state=0)
clf1=LogisticRegression()
clf1.fit(X_train, y_train)
(clf1.predict(X_test) == y_test).sum()
50
We were correct 50 times
(clf1.predict(X_test) == y_test).sum()/len(X_test)
0.9803921568627451
The proportion from the entire dataset where we made the correct prediction
clf1.coef_
Eliminations_coef,ObjectiveKills_coef=clf1.coef_[0]
ObjectiveKills_coef
0.8360825461137451
Specifying the index to retrieve correct coeficients
sigmoid = lambda x: 1/(1+np.exp(-x))
Eliminations = 14
ObjectiveKills = 4
sigmoid(Eliminations_coef*Eliminations+ObjectiveKills_coef*ObjectiveKills+clf1.intercept_)
array([0.19970528])
Model prediction when a hero has 14 eliminations and 4 objective kills. (20% this is a support hero)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_test=log_loss(y_test,clf.predict_proba(X_test))
loss_train
0.07970796733173745
Shows the measure of uncertainty in relation to the train data. Since this value is pretty low this means that there is a low chance for erros which is good.
loss_test
0.11020020342756341
Shows the measure of uncertainty in relation to the test data. This value is just as low as the train data, but since the test data is smaller than the train data this makes sense since there were a less amount of points in the test data.
KNeighborsRegressor#
Now let us look in to a different aspect of the game, how many times you a hero is eliminated during the duration of a match. So let us use this concept of KNeighborsRegressor to see if there is a relationship between Eliminations and Objective Kills in relation to Deaths. This will also tell us if there is any sense of overfitting or underfitting in the data.
X_train1, X_test1, y_train1, y_test1 = train_test_split(df_ow[cols], df_ow["Deaths"], train_size=0.25)
reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg.predict(X_test)
array([0. , 1. , 1. , 0. , 0. , 0. , 0.1, 0.1, 0.1, 1. , 0. , 0. , 0. ,
0. , 0. , 0.4, 0.1, 1. , 0. , 0. , 0. , 0. , 0.3, 1. , 0. , 0. ,
1. , 0. , 0.8, 0.4, 0. , 0.1, 0. , 0. , 1. , 0.2, 0. , 1. , 0. ,
0.2, 0. , 1. , 0.1, 0.2, 0. , 0. , 0.2, 0. , 0.1, 0. , 0.9])
Predicts number of deaths in the test data
X_test.shape
(51, 2)
mean_absolute_error(reg.predict(X_test), y_test)
0.07647058823529412
mean_absolute_error(reg.predict(X_train), y_train)
0.06766169154228856
reg1=LinearRegression()
reg1.fit(X_train1,y_train1)
X_train1["Pred"]=reg1.predict(X_train1)
X_test1["Pred"]=reg1.predict(X_test1)
Let’s plot a graph where the “Pred” line is the predicted line in relation to deaths which are separated by the two predictor columns
c10 = alt.Chart(X_test1).mark_line().encode(
x="Eliminations",
y="Pred"
)
c11 =alt.Chart(X_test1).mark_line(color="red").encode(
x="Objective Kills",
y="Pred"
)
c10 + c11
By the graphs we can see that it is close, but deaths influence objective kills more since the predicted line increases as objective kills increase. Next we will use a function retrieve the mean absolute errors of the train and test data to formulate a test error curve to see if there is any overfitting or underfitting
def get_scores(k):
reg = KNeighborsRegressor(n_neighbors=k)
reg.fit(X_train, y_train)
train_error = mean_absolute_error(reg.predict(X_train), y_train)
test_error = mean_absolute_error(reg.predict(X_test), y_test)
return (train_error, test_error)
get_scores(10)
(0.06766169154228856, 0.07647058823529412)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
When we plot a test error curve, what we want is a higher flexibility which means higher variance on the right. But since there are higher values of K that correspond to the lower flexibility, we would need to add a new column to the dataframe containing the reciprocals of the K values.
for i in df_scores.index:
df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.000000 | 0.078431 |
1 | 2 | 0.034826 | 0.068627 |
2 | 3 | 0.046434 | 0.091503 |
3 | 4 | 0.053483 | 0.083333 |
4 | 5 | 0.054726 | 0.078431 |
... | ... | ... | ... |
144 | 145 | 0.216435 | 0.222718 |
145 | 146 | 0.216656 | 0.223476 |
146 | 147 | 0.216435 | 0.223556 |
147 | 148 | 0.216754 | 0.223768 |
148 | 149 | 0.217069 | 0.224240 |
149 rows Ă— 3 columns
df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
x = "kinv",
y = "test_error"
)
ctrain+ctest
The blue curve is the training error and the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K
Summary#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
To sum everything up, we realized that there is a positive relationship when it comes to Eliminations and Objective Kills. We also found the probabilites to figure out whether or not a hero was a part of the “Support” Role (22%). Lastly we realized that underfitting occurs in our dataset due to the high values of K and it overfits with smaller values of K.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
List any other references that you found helpful.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote