Pokemon Types with Machine Learning

Author: Jia Bao Zhen

Course Project, UC Irvine, Math 10, W22

Introduction

Introduce your project here. About 3 sentences.

The aspect of this data set that I want to explore is whether the total stats of a Pokemon can determine the type of the Pokemon. Determine whether the total stats can also predict a Pokemons’ secondary typing if applicable to the Pokemon as not all Pokemon have two types. As well as trying to predict whether total stats can determine whether a Pokemon has dual typing or not.

Main portion of the project

(You can either have all one section or divide into multiple sections)

import pandas as pd
import altair as alt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
pokemon = pd.read_csv('Pokemon.csv')
pokemon['Dual Type'] = ~pokemon['Type 2'].isna()
pokemon.head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Dual Type
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False True
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False True
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False True
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False True
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False False

Overview of the dataset, there are NA values for some Pokemon in the column Type 2 as not every Pokemon has two typings, we will not drop the Pokemon that are missing a second typing in this dataset, but I have added a boolean column called Dual Type that returns true if the Pokemon has a Type 2 and false otherwise.

choice = alt.selection_multi(fields=['Type 1'], bind='legend')

hist1 = alt.Chart(pokemon).mark_bar(size=10).encode(
    x = 'Total',
    y = 'count()',
    color = 'Type 1',
    opacity = alt.condition(choice, alt.value(1), alt.value(0.2))
).add_selection(
    choice
).properties(
    title='Pokemon Type and Total Stat Distribution'
)
hist2 = alt.Chart(pokemon).mark_bar(size=10).encode(
    x = 'Total',
    y = 'count()',
    color = 'Type 1'
).transform_filter(choice).properties(
    title='Pokemon Type and Total Stat Distribution'
)
hist1 | hist2

Create two interactive histogram and concat them together to show the distribution of total stats by the type of Pokemon.

type1 = alt.Chart(pokemon).mark_bar(size=10).encode(
    x = alt.X('Type 1', sort='y'),
    y = 'count()',
    color = alt.Color('Type 1', legend=None)
).properties(
    title='Number of Pokemon by Type 1'
)
type1

Graph to show the number of Pokemon based on their first type sorted from lowest to highest. From it we can see that there are a lot of Pokemon whose main typing is water.

data = pokemon.copy()
data.dropna(inplace=True)
poke_type2 = pd.DataFrame(data)
poke_type2.reset_index(inplace=True)
poke_type2.head()
index # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Dual Type
0 0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False True
1 1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False True
2 2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False True
3 3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False True
4 6 6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False True
print(f'The number of Pokemon in this dataset with a second type is {poke_type2.shape[0]}')
The number of Pokemon in this dataset with a second type is 414

Store the secondary typing of Pokemons into a new data frame and drop the NA values.

type2 = alt.Chart(poke_type2).mark_bar(size=10).encode(
    x = alt.X('Type 2', sort='y'),
    y = 'count()',
    color = alt.Color('Type 2', legend=None)
).properties(
    title='Number of Pokemon by Type 2'
)
type2

Graph to show the number of Pokemon based on their second type sorted from lowest to highest. Visualization does not contain the same number of Pokemon as the previous graph as not all Pokemon have a second type. However, from it we can see that there are a lot of Pokemon whose secondary typing is flying.

corr_data = (pokemon.drop(columns=['#', 'Name', 'Type 1', 'Type 2', 'Dual Type', 'Legendary', 'Generation'])
    .corr().stack()
    .reset_index()
    .rename(columns={0: 'Correlation', 'level_0' : 'Var1', 'level_1' : 'Var2'})
    )

corr_data['Correlation'] = corr_data['Correlation'].round(2)

corr_data.head()
Var1 Var2 Correlation
0 Total Total 1.00
1 Total HP 0.62
2 Total Attack 0.74
3 Total Defense 0.61
4 Total Sp. Atk 0.75

Calculate the correlation between different variables with ‘.corr’ function and using ‘.stack()’ to be able to graph it in altair.

base = alt.Chart(corr_data).encode(
    x = 'Var1:O',
    y = 'Var2:O'
)

text = base.mark_text().encode(
    text = 'Correlation',
    color = alt.condition(
        alt.datum.correlation > 0.5,
        alt.value('white'),
        alt.value('black')
    )
)

corr_plot = base.mark_rect().encode(
    color = 'Correlation:Q'
).properties(
    title='Correlation by Pokemon Stats',
    width=350,
    height=350
)

corr_plot + text

Create a correlation heatmap to represent the correlation between different stats and total stats.

I tried to make the heatmap interactive by adapting the code here, but the attempt was unsuccessful.

poke_stats = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
X = pokemon[poke_stats]
y = pokemon['Type 1']
scaler = StandardScaler()
scaler.fit(X)
StandardScaler()

Rescaling the data to change the values of the numeric columns such as Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed into a common scale.

X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
KNeighborsClassifier()

Fit the training data

pokemon['pred_type1'] = pd.Series(clf.predict(X_test))

Predict the Pokemon types with the testing data, this produced an error in which the length were different so I made it into a pandas Series.

pokemon['pred_type1'] = pokemon['pred_type1'].fillna(method='ffill')

Because of the error that occured in the previous line of code, there were null values that were resulted from the model being unable to predict the type of the Pokemon, so I filled the null values with the highest frequency Pokemon type.

log_loss(pokemon['Type 1'], clf.predict_proba(X))
5.289917201155853

The log loss is considerably high and suggest that the model is indequate for predicting the Pokemon type.

pred_type1_graph = alt.Chart(pokemon).mark_circle().encode(
    x = alt.X('Type 1', title = 'Actual Type 1'),
    y = alt.Y('pred_type1', title = 'Predicted Type 1')
).properties(
    title='Predicted Pokemon Types by Pokemon Stats'
)

pred_type1_graph

From this model, we can see that it is hard to predict the first type of a Pokemon simply based on their stats because there can be Pokemon with the same stats that are different types. Since this was a classification question, I chose to illustrate the data with a scatter plot to show what actual types the model is predicting. For example if we look at the x axis which is the actual type 1 of the Pokemon, we can see that Bug type Pokemons have been predicted to be Water, Rock, Normal, Grass, Ghost, Fire, Electric, and Dark besides their actual true type of Bug.

X2 = poke_type2[poke_stats]
y2 = poke_type2['Type 2']
scaler = StandardScaler()
scaler.fit(X2)
StandardScaler()

Rescaling the data again to change the values of the numeric columns such as Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed into a common scale.

X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2)
clf2 = KNeighborsClassifier()
clf2.fit(X2_train, y2_train)
KNeighborsClassifier()

Fit the training data

poke_type2['pred_type2'] = pd.Series(clf2.predict(X2_test))

Predict the type 2 of Pokemons’ that have a type 2, like above with type 1 it produced a length error and so I made it a pandas Series.

poke_type2['pred_type2'] = poke_type2['pred_type2'].fillna(method='ffill')

Again filling the null values with the highest frequency predicted type 2 for the sake of graphing.

log_loss(poke_type2['Type 2'], clf2.predict_proba(X2))
6.141105960309548

Not really a surprised that the log loss for predicting type 2 of Pokemons’ is also quite high, suggesting that this model is not suitable to predict Pokemon typing.

pred_type2_graph = alt.Chart(poke_type2).mark_circle().encode(
    x = alt.X('Type 2', title = 'Actual Type 2'),
    y = alt.Y('pred_type2', title = 'Predicted Type 2')
).properties(
    title='Predicted Pokemon Types by Pokemon Stats'
)

pred_type2_graph

As with the previous model, it is difficult to use the stats of Pokemon to predict what their secondary type is. It seems that it has performed worse when trying to predict the type 2 of Pokemon as it has only predicted Flying as the type 2 for Bug type 2 Pokemon and if we had not filled the null values after creating out model with the test data, then there would not have been a predicted type 2 for Bug type 2 Pokemon at all.

X3 = pokemon[['Total']]
y3 = pokemon['Dual Type']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3,y3)
clf3 = KNeighborsClassifier()
clf3.fit(X3_train, y3_train)
KNeighborsClassifier()
pokemon['pred_dual'] = pd.Series(clf3.predict(X3_test))
log_loss(pokemon['Dual Type'], clf3.predict_proba(X3))
1.196846442141802
pred_dual_type = (pokemon[['Dual Type', 'pred_dual']]).copy()
pdt = pred_dual_type.value_counts(normalize=True)
pdt
Dual Type  pred_dual
False      True         0.295
           False        0.235
True       False        0.235
           True         0.235
dtype: float64

Here we tried to see whether the total stat of a Pokemon can predict wheter that Pokemon has two types. From the data we can see that 29.5% of the time the model predicted that a Pokemon without double typing to have double typing, while 23.5% of the time it was correct in predicting Pokemon without double typing to not have double typing. On the other hand, it had the same probability of predicting true double type Pokemons as falsely predicting true double type Pokemons at 23.5%.

Summary

Either summarize what you did, or summarize the results. About 3 sentences.

In this project, I created graphs to show the distribution of Pokemon types based on their typing. Some Pokemon have a second typing so I accomadated that by creating graphs illustrating the distribution of the second Pokemon type some Pokemon possess. Then I fitted the training data, to try to predict both types that Pokemons can possess with the stats of the Pokemon, however it turns out that there is not much indication that the Pokemon stats can properly predict the type of a Pokemon. I also fitted data to see whether the model can predict truly if a Pokemon has two types or only one and it turns out that it’s roughly about the same odds that it will predict correctly or predict incorrectly so it seems that Pokemon stats are not a good variable for predicting Pokemon typing. Something that suprised me in my finding was when predicting the type 1 and type 2 of Pokemon, I had to fill in the null values of the model with the highest frequency predicted type and for type 1, the highest was Bug type even though the highest actual type 1 Pokemon in the data set was Water. On the other hand for type 2, it the highest frequency type 2 was Flying and Flying was also the highest frequency actual type 2 that Pokemons had.

References

Include references that you found helpful. Also say where you found the dataset you used.

I found the dataset on Kaggle.

I found some graphs that I liked and wanted to recreate from here

I found code to recreate some of the aformentioned graphs in altair online here

Created in deepnote.com Created in Deepnote