Pokemon Types with Machine Learning
Contents
Pokemon Types with Machine Learning¶
Author: Jia Bao Zhen
Course Project, UC Irvine, Math 10, W22
Introduction¶
Introduce your project here. About 3 sentences.
The aspect of this data set that I want to explore is whether the total stats of a Pokemon can determine the type of the Pokemon. Determine whether the total stats can also predict a Pokemons’ secondary typing if applicable to the Pokemon as not all Pokemon have two types. As well as trying to predict whether total stats can determine whether a Pokemon has dual typing or not.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
import pandas as pd
import altair as alt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
pokemon = pd.read_csv('Pokemon.csv')
pokemon['Dual Type'] = ~pokemon['Type 2'].isna()
pokemon.head()
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Dual Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | True |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | True |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False | True |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False | True |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | False |
Overview of the dataset, there are NA values for some Pokemon in the column Type 2
as not every Pokemon has two typings, we will not drop the Pokemon that are missing a second typing in this dataset, but I have added a boolean column called Dual Type
that returns true if the Pokemon has a Type 2
and false otherwise.
choice = alt.selection_multi(fields=['Type 1'], bind='legend')
hist1 = alt.Chart(pokemon).mark_bar(size=10).encode(
x = 'Total',
y = 'count()',
color = 'Type 1',
opacity = alt.condition(choice, alt.value(1), alt.value(0.2))
).add_selection(
choice
).properties(
title='Pokemon Type and Total Stat Distribution'
)
hist2 = alt.Chart(pokemon).mark_bar(size=10).encode(
x = 'Total',
y = 'count()',
color = 'Type 1'
).transform_filter(choice).properties(
title='Pokemon Type and Total Stat Distribution'
)
hist1 | hist2
Create two interactive histogram and concat them together to show the distribution of total stats by the type of Pokemon.
type1 = alt.Chart(pokemon).mark_bar(size=10).encode(
x = alt.X('Type 1', sort='y'),
y = 'count()',
color = alt.Color('Type 1', legend=None)
).properties(
title='Number of Pokemon by Type 1'
)
type1
Graph to show the number of Pokemon based on their first type sorted from lowest to highest. From it we can see that there are a lot of Pokemon whose main typing is water.
data = pokemon.copy()
data.dropna(inplace=True)
poke_type2 = pd.DataFrame(data)
poke_type2.reset_index(inplace=True)
poke_type2.head()
index | # | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Dual Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | True |
1 | 1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | True |
2 | 2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False | True |
3 | 3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False | True |
4 | 6 | 6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False | True |
print(f'The number of Pokemon in this dataset with a second type is {poke_type2.shape[0]}')
The number of Pokemon in this dataset with a second type is 414
Store the secondary typing of Pokemons into a new data frame and drop the NA values.
type2 = alt.Chart(poke_type2).mark_bar(size=10).encode(
x = alt.X('Type 2', sort='y'),
y = 'count()',
color = alt.Color('Type 2', legend=None)
).properties(
title='Number of Pokemon by Type 2'
)
type2
Graph to show the number of Pokemon based on their second type sorted from lowest to highest. Visualization does not contain the same number of Pokemon as the previous graph as not all Pokemon have a second type. However, from it we can see that there are a lot of Pokemon whose secondary typing is flying.
corr_data = (pokemon.drop(columns=['#', 'Name', 'Type 1', 'Type 2', 'Dual Type', 'Legendary', 'Generation'])
.corr().stack()
.reset_index()
.rename(columns={0: 'Correlation', 'level_0' : 'Var1', 'level_1' : 'Var2'})
)
corr_data['Correlation'] = corr_data['Correlation'].round(2)
corr_data.head()
Var1 | Var2 | Correlation | |
---|---|---|---|
0 | Total | Total | 1.00 |
1 | Total | HP | 0.62 |
2 | Total | Attack | 0.74 |
3 | Total | Defense | 0.61 |
4 | Total | Sp. Atk | 0.75 |
Calculate the correlation between different variables with ‘.corr’ function and using ‘.stack()’ to be able to graph it in altair.
base = alt.Chart(corr_data).encode(
x = 'Var1:O',
y = 'Var2:O'
)
text = base.mark_text().encode(
text = 'Correlation',
color = alt.condition(
alt.datum.correlation > 0.5,
alt.value('white'),
alt.value('black')
)
)
corr_plot = base.mark_rect().encode(
color = 'Correlation:Q'
).properties(
title='Correlation by Pokemon Stats',
width=350,
height=350
)
corr_plot + text
Create a correlation heatmap to represent the correlation between different stats and total stats.
I tried to make the heatmap interactive by adapting the code here, but the attempt was unsuccessful.
poke_stats = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
X = pokemon[poke_stats]
y = pokemon['Type 1']
scaler = StandardScaler()
scaler.fit(X)
StandardScaler()
Rescaling the data to change the values of the numeric columns such as Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed
into a common scale.
X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
KNeighborsClassifier()
Fit the training data
pokemon['pred_type1'] = pd.Series(clf.predict(X_test))
Predict the Pokemon types with the testing data, this produced an error in which the length were different so I made it into a pandas Series.
pokemon['pred_type1'] = pokemon['pred_type1'].fillna(method='ffill')
Because of the error that occured in the previous line of code, there were null values that were resulted from the model being unable to predict the type of the Pokemon, so I filled the null values with the highest frequency Pokemon type.
log_loss(pokemon['Type 1'], clf.predict_proba(X))
5.289917201155853
The log loss is considerably high and suggest that the model is indequate for predicting the Pokemon type.
pred_type1_graph = alt.Chart(pokemon).mark_circle().encode(
x = alt.X('Type 1', title = 'Actual Type 1'),
y = alt.Y('pred_type1', title = 'Predicted Type 1')
).properties(
title='Predicted Pokemon Types by Pokemon Stats'
)
pred_type1_graph
From this model, we can see that it is hard to predict the first type of a Pokemon simply based on their stats because there can be Pokemon with the same stats that are different types. Since this was a classification question, I chose to illustrate the data with a scatter plot to show what actual types the model is predicting. For example if we look at the x axis which is the actual type 1 of the Pokemon, we can see that Bug type Pokemons have been predicted to be Water, Rock, Normal, Grass, Ghost, Fire, Electric, and Dark besides their actual true type of Bug.
X2 = poke_type2[poke_stats]
y2 = poke_type2['Type 2']
scaler = StandardScaler()
scaler.fit(X2)
StandardScaler()
Rescaling the data again to change the values of the numeric columns such as Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed
into a common scale.
X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2)
clf2 = KNeighborsClassifier()
clf2.fit(X2_train, y2_train)
KNeighborsClassifier()
Fit the training data
poke_type2['pred_type2'] = pd.Series(clf2.predict(X2_test))
Predict the type 2 of Pokemons’ that have a type 2, like above with type 1 it produced a length error and so I made it a pandas Series.
poke_type2['pred_type2'] = poke_type2['pred_type2'].fillna(method='ffill')
Again filling the null values with the highest frequency predicted type 2 for the sake of graphing.
log_loss(poke_type2['Type 2'], clf2.predict_proba(X2))
6.141105960309548
Not really a surprised that the log loss for predicting type 2 of Pokemons’ is also quite high, suggesting that this model is not suitable to predict Pokemon typing.
pred_type2_graph = alt.Chart(poke_type2).mark_circle().encode(
x = alt.X('Type 2', title = 'Actual Type 2'),
y = alt.Y('pred_type2', title = 'Predicted Type 2')
).properties(
title='Predicted Pokemon Types by Pokemon Stats'
)
pred_type2_graph
As with the previous model, it is difficult to use the stats of Pokemon to predict what their secondary type is. It seems that it has performed worse when trying to predict the type 2 of Pokemon as it has only predicted Flying as the type 2 for Bug type 2 Pokemon and if we had not filled the null values after creating out model with the test data, then there would not have been a predicted type 2 for Bug type 2 Pokemon at all.
X3 = pokemon[['Total']]
y3 = pokemon['Dual Type']
X3_train, X3_test, y3_train, y3_test = train_test_split(X3,y3)
clf3 = KNeighborsClassifier()
clf3.fit(X3_train, y3_train)
KNeighborsClassifier()
pokemon['pred_dual'] = pd.Series(clf3.predict(X3_test))
log_loss(pokemon['Dual Type'], clf3.predict_proba(X3))
1.196846442141802
pred_dual_type = (pokemon[['Dual Type', 'pred_dual']]).copy()
pdt = pred_dual_type.value_counts(normalize=True)
pdt
Dual Type pred_dual
False True 0.295
False 0.235
True False 0.235
True 0.235
dtype: float64
Here we tried to see whether the total stat of a Pokemon can predict wheter that Pokemon has two types. From the data we can see that 29.5% of the time the model predicted that a Pokemon without double typing to have double typing, while 23.5% of the time it was correct in predicting Pokemon without double typing to not have double typing. On the other hand, it had the same probability of predicting true double type Pokemons as falsely predicting true double type Pokemons at 23.5%.
Summary¶
Either summarize what you did, or summarize the results. About 3 sentences.
In this project, I created graphs to show the distribution of Pokemon types based on their typing. Some Pokemon have a second typing so I accomadated that by creating graphs illustrating the distribution of the second Pokemon type some Pokemon possess. Then I fitted the training data, to try to predict both types that Pokemons can possess with the stats of the Pokemon, however it turns out that there is not much indication that the Pokemon stats can properly predict the type of a Pokemon. I also fitted data to see whether the model can predict truly if a Pokemon has two types or only one and it turns out that it’s roughly about the same odds that it will predict correctly or predict incorrectly so it seems that Pokemon stats are not a good variable for predicting Pokemon typing. Something that suprised me in my finding was when predicting the type 1 and type 2 of Pokemon, I had to fill in the null values of the model with the highest frequency predicted type and for type 1, the highest was Bug type even though the highest actual type 1 Pokemon in the data set was Water. On the other hand for type 2, it the highest frequency type 2 was Flying and Flying was also the highest frequency actual type 2 that Pokemons had.
References¶
Include references that you found helpful. Also say where you found the dataset you used.
I found the dataset on Kaggle.
I found some graphs that I liked and wanted to recreate from here
I found code to recreate some of the aformentioned graphs in altair online here
Created in Deepnote