Overwatch 2#

Author: Luke Galang (lrgalang@uci.edu)

Course Project, UC Irvine, Math 10, S23

Introduction#

This is some data from Blizzard Entertainment’s Overwatch 2, a video game where a multitude of heroes face off to accomplish a series of objectives using their different abilities. What I want to try to accomplish is to use what we learned in Math 10 to show if there are any relationships in the data visually and to predict probabilities to certain skill tiers or heroes. I also want to predict relationships between certain statistics in the dataset to understand if there is any correlation between those stats

Main Portion of Project#

Load Data#

import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("Overwatch2.csv")
df
Hero Skill Tier KDA Ratio Pick Rate, % Win Rate, % Eliminations / 10min Objective Kills / 10min Objective Time / 10min Damage / 10min Healing / 10min ... Jagged Blade Accuracy, % Carnage Kills / 10min Wound Uptime, % Rampage Kills / 10min Focusing Beam Accuracy, % Focusing Beam Kills / 10min Sticky Bomb Accuracy, % Sticky Bomb Kills / 10min Duplicate Kills / 10min Role
0 Ana All 4.49 10.18 50.99 9.46 4.05 63 2676 8686.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Support
1 Ana Bronze 3.87 3.71 43.97 8.37 3.80 61 2508 7483.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Support
2 Ana Silver 4.01 3.78 46.68 8.88 4.08 64 2573 7875.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Support
3 Ana Gold 4.36 5.46 48.90 9.21 4.14 65 2610 8251.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Support
4 Ana Platinum 4.56 9.33 50.45 9.53 4.18 65 2661 8650.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Support
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 Echo Gold 2.61 0.62 46.33 17.62 6.52 62 7874 NaN ... NaN NaN NaN NaN 36.42 7.41 25.0 8.36 3.40 Damage
284 Echo Platinum 2.67 0.86 45.16 18.34 6.44 60 8249 NaN ... NaN NaN NaN NaN 37.88 7.77 26.0 8.65 3.57 Damage
285 Echo Diamond 2.73 1.57 47.70 18.85 6.22 57 8639 NaN ... NaN NaN NaN NaN 39.25 8.07 27.0 8.83 3.69 Damage
286 Echo Master 2.74 2.05 50.15 19.17 5.98 53 8929 NaN ... NaN NaN NaN NaN 39.86 8.25 27.0 8.84 3.82 Damage
287 Echo Grandmaster 2.90 3.32 50.45 19.25 5.51 49 9419 NaN ... NaN NaN NaN NaN 41.25 8.25 27.0 8.81 3.65 Damage

288 rows Ă— 131 columns

df.columns
Index(['Hero', 'Skill Tier', 'KDA Ratio', 'Pick Rate, %', 'Win Rate, %',
       'Eliminations / 10min', 'Objective Kills / 10min',
       'Objective Time / 10min', 'Damage / 10min', 'Healing / 10min',
       ...
       'Jagged Blade Accuracy, %', 'Carnage Kills / 10min', 'Wound Uptime, %',
       'Rampage Kills / 10min', 'Focusing Beam Accuracy, %',
       'Focusing Beam Kills / 10min', 'Sticky Bomb Accuracy, %',
       'Sticky Bomb Kills / 10min', 'Duplicate Kills / 10min', 'Role'],
      dtype='object', length=131)
df.shape
(288, 131)

Clean Data#

As you can see from the data, there are columns where it only applies to certain heroes and their specific ability so we drop them from the dataset. We drop the “All” row in the “Skill Tier” column as well to generate data based off indivdual ranks. We also want to rename some of the columns to make it shorter. Note that the columns after Win Rate% are in a span of 10 minutes so we will take that out for simplicity. Lets name this new dataframe df_ow

df = df.dropna(axis=1)
df_ow = df[df["Skill Tier"] != "All"].copy()
df_ow.rename({"KDA Ratio":"KDA", "Pick Rate, %": "Pick Rate", "Win Rate, %" : "Win Rate", "Eliminations / 10min" : "Eliminations", "Objective Kills / 10min": "Objective Kills","Objective Time / 10min": "Objective Time", "Damage / 10min" : "Damage", "Deaths / 10min" : "Deaths"}, axis=1, inplace=True)
df_ow
Hero Skill Tier KDA Pick Rate Win Rate Eliminations Objective Kills Objective Time Damage Deaths Role
1 Ana Bronze 3.87 3.71 43.97 8.37 3.80 61 2508 6.47 Support
2 Ana Silver 4.01 3.78 46.68 8.88 4.08 64 2573 6.36 Support
3 Ana Gold 4.36 5.46 48.90 9.21 4.14 65 2610 6.13 Support
4 Ana Platinum 4.56 9.33 50.45 9.53 4.18 65 2661 5.89 Support
5 Ana Diamond 4.60 16.50 51.30 9.70 4.11 63 2709 5.74 Support
... ... ... ... ... ... ... ... ... ... ... ...
283 Echo Gold 2.61 0.62 46.33 17.62 6.52 62 7874 8.28 Damage
284 Echo Platinum 2.67 0.86 45.16 18.34 6.44 60 8249 8.24 Damage
285 Echo Diamond 2.73 1.57 47.70 18.85 6.22 57 8639 8.15 Damage
286 Echo Master 2.74 2.05 50.15 19.17 5.98 53 8929 8.11 Damage
287 Echo Grandmaster 2.90 3.32 50.45 19.25 5.51 49 9419 7.70 Damage

252 rows Ă— 11 columns

df_ow.columns
Index(['Hero', 'Skill Tier', 'KDA', 'Pick Rate', 'Win Rate', 'Eliminations',
       'Objective Kills', 'Objective Time', 'Damage', 'Deaths', 'Role'],
      dtype='object')
df_ow.shape
(252, 11)

Visualize Data#

Context :#

A common question about the game’s ranked system that most players have is “Why am I ranked so low? I feel like I am better than most players since I have more kills than them”. Like I said before this game also focuses on completing an objective where an objective kill is receiving an elimination while completing the objective. Another common argument is to either focus on kills as a whole or focus on the objective, so let us visualize this data by using x = Eliminations and y = Objective Kills and see if there is a trend present.

Note: Skill Tiers from lowest to highest is Bronze, Silver, Gold, Platinum, Diamond, Master, then Grandmaster#

brush = alt.selection_interval()
c = alt.Chart(df_ow).mark_point().encode(
    x='Eliminations',
    y='Objective Kills',
    color="Skill Tier"
).add_selection(brush)

c1= alt.Chart(df_ow).mark_bar().encode(
    x = 'Skill Tier',
    y='Eliminations'
).transform_filter(brush)

c2= alt.Chart(df_ow).mark_bar().encode(
    x = 'Skill Tier',
    y='Objective Kills'
).transform_filter(brush)

c|c1|c2

With the bar graphs we notice the distribution of the amount of eliminations and Objective Kills in relation to their skill tier. These bar graphs show that towards the higher ranks the amount of eliminations are relatively high and similar all together, while with Objective Kills the two highest ranks are lower in terms of getting eliminations on the objective. Based off the scatterplot we notice that it looks like there is a positive relation between Eliminations and Objective Kills so we can try and predict the trend with a Linear Regression model.

Linear Regression#

reg=LinearRegression()
reg.fit(df_ow[["Eliminations"]],df_ow["Objective Kills"])
reg.predict(df_ow[["Eliminations"]])
df_ow["Prediction"]=reg.predict(df_ow[["Eliminations"]])
c3 = alt.Chart(df_ow).mark_circle().encode(
    x="Eliminations",
    y="Objective Kills"
)
c4=alt.Chart(df_ow).mark_line(color="red").encode(
    x="Eliminations",
    y="Prediction"
)
c3+c4

From the linear regression model, we recognize that there is a positive relationship between eliminations and objective kills in relation to a heroes skill tier. With this information let us see if we can predict a heroes role from their eliminations and objective kills.

col = ["Hero", "Role", "Prediction"]
df_ow[col] 
Hero Role Prediction
1 Ana Support 3.649432
2 Ana Support 3.823823
3 Ana Support 3.936665
4 Ana Support 4.046087
5 Ana Support 4.104217
... ... ... ...
283 Echo Damage 6.812416
284 Echo Damage 7.058616
285 Echo Damage 7.233008
286 Echo Damage 7.342430
287 Echo Damage 7.369785

252 rows Ă— 3 columns

Since heroes who are in the support role tend to have lower amounts of eliminations, lets focus on that role

Logistic Regression#

We use logistic regression to see whether or not we can predict if a hero is a support

df_ow["Support"]=df_ow["Role"].apply(str)
df_ow["isSupport_bool"]=(df["Role"] == "Support")
df_ow["isSupport"] = df_ow["isSupport_bool"].astype(int)
df_ow["isSupport"].mean()
0.2222222222222222

Predicted probability of support heroes in the dataset

c5 = alt.Chart(df_ow).mark_circle().encode(
    x="Eliminations",
    y="isSupport"
)
c6 = alt.Chart(df_ow).mark_circle().encode(
    x="Objective Kills",
    y="isSupport"
)
clf = LogisticRegression()
clf.fit(df_ow[["Eliminations"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Eliminations"]])
c7 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Eliminations",
    y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Eliminations"]])
df_ow["pred_proba1"] = arr[:,1]
c8 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Eliminations",
    y = "pred_proba1"
)
clf.fit(df_ow[["Objective Kills"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Objective Kills"]])
c9 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Objective Kills",
    y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Objective Kills"]])
df_ow["pred_proba2"] = arr[:,1]
c10 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Objective Kills",
    y = "pred_proba2"
)
c5|c6

Shows the distribution of who is a support character in each rank based off of their eliminations (c5) and their objective kills (c6)

c7|c5+c8

Shows prediction of who is a support and predicts based off of eliminations (c7) and shows logistic regression curve in relation c5 (c5+c8). Prediction curve shows a binary logistic regression curve which shows that supports majority of the time gets at least 8-14 eliminations.

c9|c6+c10

Shows prediction of who is a support and predicts based off of objective kills (c9) and shows logistic regression curve in relation to c5 (c5+c10) Prediction curve has a binary regression as well and similarly to Eliminations follows a similar curve. The predicted probabilities curve though is lengthier than the other curve which shows that the prediction is a little bit off in comparison to Eliminations.

Train Test Split#

Now we let us see if we can classify whether or not a hero is a support based off of Eliminations and Objective Kills using this concept. We can use “isSupport_bool” since this is a boolean column that tells us if a hero is a support or not

cols = ["Eliminations", "Objective Kills"]
X_train, X_test, y_train, y_test = train_test_split(df_ow[cols], df_ow["isSupport_bool"], test_size=0.2, random_state=0)
clf1=LogisticRegression()
clf1.fit(X_train, y_train) 
(clf1.predict(X_test) == y_test).sum()
50

We were correct 50 times

(clf1.predict(X_test) == y_test).sum()/len(X_test)
0.9803921568627451

The proportion from the entire dataset where we made the correct prediction

clf1.coef_
Eliminations_coef,ObjectiveKills_coef=clf1.coef_[0]
ObjectiveKills_coef
0.8360825461137451

Specifying the index to retrieve correct coeficients

sigmoid = lambda x: 1/(1+np.exp(-x))
Eliminations = 14
ObjectiveKills = 4
sigmoid(Eliminations_coef*Eliminations+ObjectiveKills_coef*ObjectiveKills+clf1.intercept_)
array([0.19970528])

Model prediction when a hero has 14 eliminations and 4 objective kills. (20% this is a support hero)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_test=log_loss(y_test,clf.predict_proba(X_test))
loss_train
0.07970796733173745

Shows the measure of uncertainty in relation to the train data. Since this value is pretty low this means that there is a low chance for erros which is good.

loss_test
0.11020020342756341

Shows the measure of uncertainty in relation to the test data. This value is just as low as the train data, but since the test data is smaller than the train data this makes sense since there were a less amount of points in the test data.

KNeighborsRegressor#

Now let us look in to a different aspect of the game, how many times you a hero is eliminated during the duration of a match. So let us use this concept of KNeighborsRegressor to see if there is a relationship between Eliminations and Objective Kills in relation to Deaths. This will also tell us if there is any sense of overfitting or underfitting in the data.

X_train1, X_test1, y_train1, y_test1 = train_test_split(df_ow[cols], df_ow["Deaths"], train_size=0.25)
reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg.predict(X_test)
array([0. , 1. , 1. , 0. , 0. , 0. , 0.1, 0.1, 0.1, 1. , 0. , 0. , 0. ,
       0. , 0. , 0.4, 0.1, 1. , 0. , 0. , 0. , 0. , 0.3, 1. , 0. , 0. ,
       1. , 0. , 0.8, 0.4, 0. , 0.1, 0. , 0. , 1. , 0.2, 0. , 1. , 0. ,
       0.2, 0. , 1. , 0.1, 0.2, 0. , 0. , 0.2, 0. , 0.1, 0. , 0.9])

Predicts number of deaths in the test data

X_test.shape
(51, 2)
mean_absolute_error(reg.predict(X_test), y_test)
0.07647058823529412
mean_absolute_error(reg.predict(X_train), y_train)
0.06766169154228856
reg1=LinearRegression()
reg1.fit(X_train1,y_train1)
X_train1["Pred"]=reg1.predict(X_train1)
X_test1["Pred"]=reg1.predict(X_test1)

Let’s plot a graph where the “Pred” line is the predicted line in relation to deaths which are separated by the two predictor columns

c10 = alt.Chart(X_test1).mark_line().encode(
    x="Eliminations",
    y="Pred"
)
c11 =alt.Chart(X_test1).mark_line(color="red").encode(
    x="Objective Kills",
    y="Pred"
)
c10 + c11

By the graphs we can see that it is close, but deaths influence objective kills more since the predicted line increases as objective kills increase. Next we will use a function retrieve the mean absolute errors of the train and test data to formulate a test error curve to see if there is any overfitting or underfitting

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)
get_scores(10)
(0.06766169154228856, 0.07647058823529412)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

When we plot a test error curve, what we want is a higher flexibility which means higher variance on the right. But since there are higher values of K that correspond to the lower flexibility, we would need to add a new column to the dataframe containing the reciprocals of the K values.

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores
k train_error test_error
0 1 0.000000 0.078431
1 2 0.034826 0.068627
2 3 0.046434 0.091503
3 4 0.053483 0.083333
4 5 0.054726 0.078431
... ... ... ...
144 145 0.216435 0.222718
145 146 0.216656 0.223476
146 147 0.216435 0.223556
147 148 0.216754 0.223768
148 149 0.217069 0.224240

149 rows Ă— 3 columns

df_scores["kinv"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)
ctrain+ctest

The blue curve is the training error and the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

To sum everything up, we realized that there is a positive relationship when it comes to Eliminations and Objective Kills. We also found the probabilites to figure out whether or not a hero was a part of the “Support” Role (22%). Lastly we realized that underfitting occurs in our dataset due to the high values of K and it overfits with smaller values of K.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

Kaggle

  • List any other references that you found helpful.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote