Overwatch 2

Overwatch 2#

Course Project, UC Irvine, Math 10, S23

Introduction#

This is some data from Blizzard Entertainment’s Overwatch 2, a video game where a multitude of heroes face off to accomplish a series of objectives using their different abilities. What I want to try to accomplish is to use what we learned in Math 10 to show if there are any relationships in the data visually and to predict probabilities to certain skill tiers or heroes. I also want to predict relationships between certain statistics in the dataset to understand if there is any correlation between those stats

Main Portion of Project#

Load Data#

import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("Overwatch2.csv")
df

	Hero	Skill Tier	KDA Ratio	Pick Rate, %	Win Rate, %	Eliminations / 10min	Objective Kills / 10min	Objective Time / 10min	Damage / 10min	Healing / 10min	...	Jagged Blade Accuracy, %	Carnage Kills / 10min	Wound Uptime, %	Rampage Kills / 10min	Focusing Beam Accuracy, %	Focusing Beam Kills / 10min	Sticky Bomb Accuracy, %	Sticky Bomb Kills / 10min	Duplicate Kills / 10min	Role
0	Ana	All	4.49	10.18	50.99	9.46	4.05	63	2676	8686.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Support
1	Ana	Bronze	3.87	3.71	43.97	8.37	3.80	61	2508	7483.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Support
2	Ana	Silver	4.01	3.78	46.68	8.88	4.08	64	2573	7875.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Support
3	Ana	Gold	4.36	5.46	48.90	9.21	4.14	65	2610	8251.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Support
4	Ana	Platinum	4.56	9.33	50.45	9.53	4.18	65	2661	8650.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Support
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
283	Echo	Gold	2.61	0.62	46.33	17.62	6.52	62	7874	NaN	...	NaN	NaN	NaN	NaN	36.42	7.41	25.0	8.36	3.40	Damage
284	Echo	Platinum	2.67	0.86	45.16	18.34	6.44	60	8249	NaN	...	NaN	NaN	NaN	NaN	37.88	7.77	26.0	8.65	3.57	Damage
285	Echo	Diamond	2.73	1.57	47.70	18.85	6.22	57	8639	NaN	...	NaN	NaN	NaN	NaN	39.25	8.07	27.0	8.83	3.69	Damage
286	Echo	Master	2.74	2.05	50.15	19.17	5.98	53	8929	NaN	...	NaN	NaN	NaN	NaN	39.86	8.25	27.0	8.84	3.82	Damage
287	Echo	Grandmaster	2.90	3.32	50.45	19.25	5.51	49	9419	NaN	...	NaN	NaN	NaN	NaN	41.25	8.25	27.0	8.81	3.65	Damage

288 rows × 131 columns

df.columns

Index(['Hero', 'Skill Tier', 'KDA Ratio', 'Pick Rate, %', 'Win Rate, %',
       'Eliminations / 10min', 'Objective Kills / 10min',
       'Objective Time / 10min', 'Damage / 10min', 'Healing / 10min',
       ...
       'Jagged Blade Accuracy, %', 'Carnage Kills / 10min', 'Wound Uptime, %',
       'Rampage Kills / 10min', 'Focusing Beam Accuracy, %',
       'Focusing Beam Kills / 10min', 'Sticky Bomb Accuracy, %',
       'Sticky Bomb Kills / 10min', 'Duplicate Kills / 10min', 'Role'],
      dtype='object', length=131)

df.shape

(288, 131)

Clean Data#

As you can see from the data, there are columns where it only applies to certain heroes and their specific ability so we drop them from the dataset. We drop the “All” row in the “Skill Tier” column as well to generate data based off indivdual ranks. We also want to rename some of the columns to make it shorter. Note that the columns after Win Rate% are in a span of 10 minutes so we will take that out for simplicity. Lets name this new dataframe df_ow

df = df.dropna(axis=1)

df_ow = df[df["Skill Tier"] != "All"].copy()
df_ow.rename({"KDA Ratio":"KDA", "Pick Rate, %": "Pick Rate", "Win Rate, %" : "Win Rate", "Eliminations / 10min" : "Eliminations", "Objective Kills / 10min": "Objective Kills","Objective Time / 10min": "Objective Time", "Damage / 10min" : "Damage", "Deaths / 10min" : "Deaths"}, axis=1, inplace=True)
df_ow

	Hero	Skill Tier	KDA	Pick Rate	Win Rate	Eliminations	Objective Kills	Objective Time	Damage	Deaths	Role
1	Ana	Bronze	3.87	3.71	43.97	8.37	3.80	61	2508	6.47	Support
2	Ana	Silver	4.01	3.78	46.68	8.88	4.08	64	2573	6.36	Support
3	Ana	Gold	4.36	5.46	48.90	9.21	4.14	65	2610	6.13	Support
4	Ana	Platinum	4.56	9.33	50.45	9.53	4.18	65	2661	5.89	Support
5	Ana	Diamond	4.60	16.50	51.30	9.70	4.11	63	2709	5.74	Support
...	...	...	...	...	...	...	...	...	...	...	...
283	Echo	Gold	2.61	0.62	46.33	17.62	6.52	62	7874	8.28	Damage
284	Echo	Platinum	2.67	0.86	45.16	18.34	6.44	60	8249	8.24	Damage
285	Echo	Diamond	2.73	1.57	47.70	18.85	6.22	57	8639	8.15	Damage
286	Echo	Master	2.74	2.05	50.15	19.17	5.98	53	8929	8.11	Damage
287	Echo	Grandmaster	2.90	3.32	50.45	19.25	5.51	49	9419	7.70	Damage

252 rows × 11 columns

df_ow.columns

Index(['Hero', 'Skill Tier', 'KDA', 'Pick Rate', 'Win Rate', 'Eliminations',
       'Objective Kills', 'Objective Time', 'Damage', 'Deaths', 'Role'],
      dtype='object')

df_ow.shape

(252, 11)

Visualize Data#

Context :#

A common question about the game’s ranked system that most players have is “Why am I ranked so low? I feel like I am better than most players since I have more kills than them”. Like I said before this game also focuses on completing an objective where an objective kill is receiving an elimination while completing the objective. Another common argument is to either focus on kills as a whole or focus on the objective, so let us visualize this data by using x = Eliminations and y = Objective Kills and see if there is a trend present.

Note: Skill Tiers from lowest to highest is Bronze, Silver, Gold, Platinum, Diamond, Master, then Grandmaster#

brush = alt.selection_interval()
c = alt.Chart(df_ow).mark_point().encode(
    x='Eliminations',
    y='Objective Kills',
    color="Skill Tier"
).add_selection(brush)

c1= alt.Chart(df_ow).mark_bar().encode(
    x = 'Skill Tier',
    y='Eliminations'
).transform_filter(brush)

c2= alt.Chart(df_ow).mark_bar().encode(
    x = 'Skill Tier',
    y='Objective Kills'
).transform_filter(brush)

c|c1|c2

With the bar graphs we notice the distribution of the amount of eliminations and Objective Kills in relation to their skill tier. These bar graphs show that towards the higher ranks the amount of eliminations are relatively high and similar all together, while with Objective Kills the two highest ranks are lower in terms of getting eliminations on the objective. Based off the scatterplot we notice that it looks like there is a positive relation between Eliminations and Objective Kills so we can try and predict the trend with a Linear Regression model.

Linear Regression#

reg=LinearRegression()
reg.fit(df_ow[["Eliminations"]],df_ow["Objective Kills"])
reg.predict(df_ow[["Eliminations"]])
df_ow["Prediction"]=reg.predict(df_ow[["Eliminations"]])

c3 = alt.Chart(df_ow).mark_circle().encode(
    x="Eliminations",
    y="Objective Kills"
)
c4=alt.Chart(df_ow).mark_line(color="red").encode(
    x="Eliminations",
    y="Prediction"
)
c3+c4

From the linear regression model, we recognize that there is a positive relationship between eliminations and objective kills in relation to a heroes skill tier. With this information let us see if we can predict a heroes role from their eliminations and objective kills.

col = ["Hero", "Role", "Prediction"]
df_ow[col] 

	Hero	Role	Prediction
1	Ana	Support	3.649432
2	Ana	Support	3.823823
3	Ana	Support	3.936665
4	Ana	Support	4.046087
5	Ana	Support	4.104217
...	...	...	...
283	Echo	Damage	6.812416
284	Echo	Damage	7.058616
285	Echo	Damage	7.233008
286	Echo	Damage	7.342430
287	Echo	Damage	7.369785

252 rows × 3 columns

Since heroes who are in the support role tend to have lower amounts of eliminations, lets focus on that role

Logistic Regression#

We use logistic regression to see whether or not we can predict if a hero is a support

df_ow["Support"]=df_ow["Role"].apply(str)

df_ow["isSupport_bool"]=(df["Role"] == "Support")

df_ow["isSupport"] = df_ow["isSupport_bool"].astype(int)

df_ow["isSupport"].mean()

0.2222222222222222

Predicted probability of support heroes in the dataset

c5 = alt.Chart(df_ow).mark_circle().encode(
    x="Eliminations",
    y="isSupport"
)
c6 = alt.Chart(df_ow).mark_circle().encode(
    x="Objective Kills",
    y="isSupport"
)

clf = LogisticRegression()
clf.fit(df_ow[["Eliminations"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Eliminations"]])
c7 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Eliminations",
    y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Eliminations"]])
df_ow["pred_proba1"] = arr[:,1]
c8 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Eliminations",
    y = "pred_proba1"
)

clf.fit(df_ow[["Objective Kills"]],df_ow["isSupport"])
df_ow["pred_log"] = clf.predict(df_ow[["Objective Kills"]])
c9 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Objective Kills",
    y = "pred_log"
)
arr = clf.predict_proba(df_ow[["Objective Kills"]])
df_ow["pred_proba2"] = arr[:,1]
c10 = alt.Chart(df_ow).mark_circle(color= "red").encode(
    x = "Objective Kills",
    y = "pred_proba2"
)

c5|c6

Shows the distribution of who is a support character in each rank based off of their eliminations (c5) and their objective kills (c6)

c7|c5+c8

Shows prediction of who is a support and predicts based off of eliminations (c7) and shows logistic regression curve in relation c5 (c5+c8). Prediction curve shows a binary logistic regression curve which shows that supports majority of the time gets at least 8-14 eliminations.

c9|c6+c10

Shows prediction of who is a support and predicts based off of objective kills (c9) and shows logistic regression curve in relation to c5 (c5+c10) Prediction curve has a binary regression as well and similarly to Eliminations follows a similar curve. The predicted probabilities curve though is lengthier than the other curve which shows that the prediction is a little bit off in comparison to Eliminations.

Train Test Split#

Now we let us see if we can classify whether or not a hero is a support based off of Eliminations and Objective Kills using this concept. We can use “isSupport_bool” since this is a boolean column that tells us if a hero is a support or not

cols = ["Eliminations", "Objective Kills"]
X_train, X_test, y_train, y_test = train_test_split(df_ow[cols], df_ow["isSupport_bool"], test_size=0.2, random_state=0)

clf1=LogisticRegression()
clf1.fit(X_train, y_train) 
(clf1.predict(X_test) == y_test).sum()

We were correct 50 times

(clf1.predict(X_test) == y_test).sum()/len(X_test)

0.9803921568627451

The proportion from the entire dataset where we made the correct prediction

clf1.coef_
Eliminations_coef,ObjectiveKills_coef=clf1.coef_[0]
ObjectiveKills_coef

0.8360825461137451

Specifying the index to retrieve correct coeficients

sigmoid = lambda x: 1/(1+np.exp(-x))
Eliminations = 14
ObjectiveKills = 4
sigmoid(Eliminations_coef*Eliminations+ObjectiveKills_coef*ObjectiveKills+clf1.intercept_)

array([0.19970528])

Model prediction when a hero has 14 eliminations and 4 objective kills. (20% this is a support hero)

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_test=log_loss(y_test,clf.predict_proba(X_test))

loss_train

0.07970796733173745

Shows the measure of uncertainty in relation to the train data. Since this value is pretty low this means that there is a low chance for erros which is good.

loss_test

0.11020020342756341

Shows the measure of uncertainty in relation to the test data. This value is just as low as the train data, but since the test data is smaller than the train data this makes sense since there were a less amount of points in the test data.

KNeighborsRegressor#

Now let us look in to a different aspect of the game, how many times you a hero is eliminated during the duration of a match. So let us use this concept of KNeighborsRegressor to see if there is a relationship between Eliminations and Objective Kills in relation to Deaths. This will also tell us if there is any sense of overfitting or underfitting in the data.

X_train1, X_test1, y_train1, y_test1 = train_test_split(df_ow[cols], df_ow["Deaths"], train_size=0.25)

reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=10)

reg.predict(X_test)

array([0. , 1. , 1. , 0. , 0. , 0. , 0.1, 0.1, 0.1, 1. , 0. , 0. , 0. ,
       0. , 0. , 0.4, 0.1, 1. , 0. , 0. , 0. , 0. , 0.3, 1. , 0. , 0. ,
       1. , 0. , 0.8, 0.4, 0. , 0.1, 0. , 0. , 1. , 0.2, 0. , 1. , 0. ,
       0.2, 0. , 1. , 0.1, 0.2, 0. , 0. , 0.2, 0. , 0.1, 0. , 0.9])

Predicts number of deaths in the test data

X_test.shape

(51, 2)

mean_absolute_error(reg.predict(X_test), y_test)

0.07647058823529412

mean_absolute_error(reg.predict(X_train), y_train)

0.06766169154228856

reg1=LinearRegression()
reg1.fit(X_train1,y_train1)
X_train1["Pred"]=reg1.predict(X_train1)
X_test1["Pred"]=reg1.predict(X_test1)

Let’s plot a graph where the “Pred” line is the predicted line in relation to deaths which are separated by the two predictor columns

c10 = alt.Chart(X_test1).mark_line().encode(
    x="Eliminations",
    y="Pred"
)
c11 =alt.Chart(X_test1).mark_line(color="red").encode(
    x="Objective Kills",
    y="Pred"
)
c10 + c11

By the graphs we can see that it is close, but deaths influence objective kills more since the predicted line increases as objective kills increase. Next we will use a function retrieve the mean absolute errors of the train and test data to formulate a test error curve to see if there is any overfitting or underfitting

def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train, y_train)
    train_error = mean_absolute_error(reg.predict(X_train), y_train)
    test_error = mean_absolute_error(reg.predict(X_test), y_test)
    return (train_error, test_error)

get_scores(10)

(0.06766169154228856, 0.07647058823529412)

df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

When we plot a test error curve, what we want is a higher flexibility which means higher variance on the right. But since there are higher values of K that correspond to the lower flexibility, we would need to add a new column to the dataframe containing the reciprocals of the K values.

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])

df_scores

	k	train_error	test_error
0	1	0.000000	0.078431
1	2	0.034826	0.068627
2	3	0.046434	0.091503
3	4	0.053483	0.083333
4	5	0.054726	0.078431
...	...	...	...
144	145	0.216435	0.222718
145	146	0.216656	0.223476
146	147	0.216435	0.223556
147	148	0.216754	0.223768
148	149	0.217069	0.224240

149 rows × 3 columns

df_scores["kinv"] = 1/df_scores.k

ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)

ctrain+ctest

The blue curve is the training error and the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

To sum everything up, we realized that there is a positive relationship when it comes to Eliminations and Objective Kills. We also found the probabilites to figure out whether or not a hero was a part of the “Support” Role (22%). Lastly we realized that underfitting occurs in our dataset due to the high values of K and it overfits with smaller values of K.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

Kaggle

List any other references that you found helpful.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote