Pokemon
Contents
Pokemon¶
Author: Wenqi Zhao
Course Project, UC Irvine, Math 10, S22
Introduction¶
I chose a dataset about Pokemon Pet, which is a Japanese Animated TV series that most of us have seen during our childhood. The data set contains each Pokemon pet’s features: name, type, attack, defense, total. So in my project, I would make great use of machine learning such as linear or logistic prediction, which using the pokemon pets’ feature (types, attack, defense, and speed )as input to predict its species and the name of the pet.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
Import and Clean Datasetimport pandas as pd
df=pd.read_csv("Pokemon.csv") #Read the dataset
df=df.dropna(axis=1) #Clean the data
df
# | Name | Type 1 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
800 rows × 12 columns
df.describe()
# | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
---|---|---|---|---|---|---|---|---|---|
count | 800.000000 | 800.00000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.00000 |
mean | 362.813750 | 435.10250 | 69.258750 | 79.001250 | 73.842500 | 72.820000 | 71.902500 | 68.277500 | 3.32375 |
std | 208.343798 | 119.96304 | 25.534669 | 32.457366 | 31.183501 | 32.722294 | 27.828916 | 29.060474 | 1.66129 |
min | 1.000000 | 180.00000 | 1.000000 | 5.000000 | 5.000000 | 10.000000 | 20.000000 | 5.000000 | 1.00000 |
25% | 184.750000 | 330.00000 | 50.000000 | 55.000000 | 50.000000 | 49.750000 | 50.000000 | 45.000000 | 2.00000 |
50% | 364.500000 | 450.00000 | 65.000000 | 75.000000 | 70.000000 | 65.000000 | 70.000000 | 65.000000 | 3.00000 |
75% | 539.250000 | 515.00000 | 80.000000 | 100.000000 | 90.000000 | 95.000000 | 90.000000 | 90.000000 | 5.00000 |
max | 721.000000 | 780.00000 | 255.000000 | 190.000000 | 230.000000 | 194.000000 | 230.000000 | 180.000000 | 6.00000 |
df.columns
Index(['#', 'Name', 'Type 1', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk',
'Sp. Def', 'Speed', 'Generation', 'Legendary'],
dtype='object')
import altair as alt
brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(
x='Attack',
y='Defense',
color='Name'
).add_selection(brush)
c2= alt.Chart(df).mark_bar().encode(
x = 'Name',
y='Defense'
).transform_filter(brush)
c1|c2
#reference: week6 Monday notebook
from sklearn.linear_model import LinearRegression #Import
reg=LinearRegression()#create/instantiate
reg.fit(df[["Attack"]],df[["Defense"]]) #Fit attack as input, defense as output
reg.predict(df[["Attack"]]) #Make our prediction
df["Pred"]=reg.predict(df[["Attack"]]) #Add a new column named as "Pred"
df.head()
# | Name | Type 1 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | 61.197881 |
1 | 2 | Ivysaur | Grass | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | 66.676987 |
2 | 3 | Venusaur | Grass | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False | 75.106382 |
3 | 3 | VenusaurMega Venusaur | Grass | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False | 82.692838 |
4 | 4 | Charmander | Fire | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | 62.462290 |
#draw the linear regression line prediction chart
c = alt.Chart(df).mark_circle().encode(
x="Attack",
y="Defense"
)
c1=alt.Chart(df).mark_line(color="red").encode(
x="Attack",
y="Pred"
)
c+c1
By the graph above, we can easily confirm that there is a positive trend between Attack and Defense for each Pokemon.
Next, I am interested in predicting the Pokemon's generation by Pokemon's attack and Defensefrom sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
Before we do any further prediction, we need to make the generations become string instead of “int”, so the numbers in the generation column don’t actually have any numerical meaning, but only represents level, and then rename the column to avoid the two being messed up
df["Generation1"]=df["Generation"].apply(str)
cols=["Attack","Defense"] #Make a sub dataframe that only containes the necessary input that we want
df["is_1"]=(df["Generation1"]=="1") #Make the new colnmn that returns "True" if the pokemon's generation is one, otherwise returns "False"
Since the original dataset is a little large, make a train test split to divide dataset would make better and more accurate prediction
X_train, X_test, y_train, y_test = train_test_split(df[cols], df["is_1"], test_size=0.2, random_state=0)
from sklearn.linear_model import LogisticRegression #import
clf=LogisticRegression()
clf.fit(X_train, y_train) #fit
(clf.predict(X_test) == y_test).sum() # How often were we correct?
131
# The proportion from the whole dataset that we made correct prediction
(clf.predict(X_test) == y_test).sum()/len(X_test)
0.81875
Since clf.coef_would return a length-2 array, we need to make sure that we get the right coefficient by specifing the index
clf.coef_
Attack_coef,Defense_coef=clf.coef_[0]
Defense_coef
-0.004358914889161834
What does our model predict if we have the attack is 140 and the defense is 87?
sigmoid = lambda x: 1/(1+np.exp(-x))
Attack = 140
Defense = 87
sigmoid(Attack_coef*Attack+Defense_coef*Defense+clf.intercept_)
array([0.22464458])
Hence, we can interpret this as our model predicts that this pokemon with(attack 140 and Defense 87) has a 22% chance of being a 1th generation Pokemon. Double check with predict_proba
clf.predict_proba([[Attack,Defense]]) #The first array says that there is a 77.53% chance of this pokemon is not a 1st generation, and the second array gives the same result as the sigmoid function gives.
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
"X does not have valid feature names, but"
array([[0.77535542, 0.22464458]])
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_test=log_loss(y_test,clf.predict_proba(X_test))
loss_train #the log loss for x_train and y_train is about 0.379
0.37936574298643205
loss_test #the log loss for x_test and y_test is about 2.110
2.110437076931544
Since the log loss of the testing data is larger, so there is a sign of overfitting. And from the coefficient that we get from the logistic prediction, we can tell that there is little relationship between the input:attack and defense with the output:Pokemon’s generation. It is not what we think that bigger attack and defense value lead to higher generation. Instead, different generations of pokemon might give the same amount of attack and defense.
Since there is little connection between Pokemon’s Attack and Defense with Pokemon’s generation. I am interested in finding how does attack and defense related to the special attack of the pokemons.
Next, I wonder How does Attack and Defense are related to the Sp.Atk(which mean special attck) of a pokemondf[cols]
Attack | Defense | |
---|---|---|
0 | 49 | 49 |
1 | 62 | 63 |
2 | 82 | 83 |
3 | 100 | 123 |
4 | 52 | 43 |
... | ... | ... |
795 | 100 | 150 |
796 | 160 | 110 |
797 | 110 | 60 |
798 | 160 | 60 |
799 | 110 | 120 |
800 rows × 2 columns
I still want to divide my dataset into two datasets and make my prediction.
X_train1, X_test1, y_train1, y_test1 = train_test_split(df[cols], df["Sp. Atk"], train_size=0.25)
X_train1
Attack | Defense | |
---|---|---|
765 | 55 | 52 |
587 | 57 | 55 |
156 | 85 | 100 |
511 | 132 | 105 |
389 | 70 | 130 |
... | ... | ... |
267 | 134 | 110 |
509 | 62 | 50 |
668 | 30 | 55 |
223 | 85 | 200 |
406 | 75 | 60 |
200 rows × 2 columns
Make our prediction and find the MSE and MAE to evaluate the prediction performance
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
reg1=LinearRegression()
reg1.fit(X_train1,y_train1)
MSE1 = mean_squared_error(y_train1,reg1.predict(X_train1))
MAE1 = mean_absolute_error(y_train1,reg1.predict(X_train1))
X_train1["Pred"]=reg1.predict(X_train1)
X_test1["Pred"]=reg1.predict(X_test1)
print(f"the coefficients of reg1 are {reg1.coef_}")
print(f"the intercept of reg1 is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')
the coefficients of reg1 are [0.45471745 0.01710352]
the intercept of reg1 is 33.51731900993459.
The Mean square error is 789.539
The Mean absolute error is 22.997
in this case the MSE is much greater than MAE because MSE punishes outlier very heavily.
#Make chart of both the prediction
#draw the linear regression line prediction chart
c3 = alt.Chart(X_train1).mark_circle().encode(
x="Attack",
y="Defense"
)
c4=alt.Chart(X_train1).mark_line(color="red").encode(
x="Attack",
y="Pred"
)
c3+c4
This chart suggests that my prediction is underfitting and needs more degrees. Overall, the output special attack has a positive relationship with the inputs:attack and defense. As the attack and defense increases, the special attack is also likely to increase.
Use KNeighborRegressor to adjust the underfitting of the previous prediction.from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
reg2 = KNeighborsRegressor(n_neighbors=10, weights='uniform')
reg2.fit(X_train1,y_train1)
MAE2_test=mean_absolute_error(reg2.predict(X_test1), y_test1)
MAE2_train=mean_absolute_error(reg2.predict(X_train1), y_train1)
MSE2_test=mean_squared_error(reg2.predict(X_test1), y_test1)
MSE2_train=mean_squared_error(reg2.predict(X_train1), y_train1)
print(f"the mean squared error for X_train1 is {MSE2_test}")
print(f"the mean squared error X_test1 is {MSE2_train}")
print(f"the mean absolute error for X_train1 is {MAE2_test}")
print(f"the mean absolute error X_test1 is {MAE2_train}")
the mean squared error for X_train1 is 922.33005
the mean squared error X_test1 is 764.9995499999999
the mean absolute error for X_train1 is 24.20116666666667
the mean absolute error X_test1 is 22.3525
The MSEs are similar and the MAEs are also similar , with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.
X_train1["Pred"]=reg2.predict(X_train1)
X_test1["Pred"]=reg2.predict(X_test1)
#Make chart of both the prediction
#draw the linear regression line prediction chart
c4 = alt.Chart(X_train1).mark_circle().encode(
x="Attack",
y="Defense"
)
c5=alt.Chart(X_train1).mark_line(color="red").encode(
x="Attack",
y="Pred"
)
c4+c5
#The chart still suggests overffiting where the prediction is too stick to the datas that we provided that fails to predict future trend.
c6 = alt.Chart(X_test1).mark_circle().encode(
x="Attack",
y="Defense"
)
c7=alt.Chart(X_test1).mark_line(color="red").encode(
x="Attack",
y="Pred"
)
c6+c7
# similar result will happen to the test chart
Determine which value affects the special attack more : Attack or Defense
c8 = alt.Chart(X_test1).mark_line().encode(
x="Attack",
y="Pred"
)
c9=alt.Chart(X_test1).mark_line(color="red").encode(
x="Defense",
y="Pred"
)
c8+c9
By the chart above, we can visualize that the attack value influence the special attack value more than the defense value as the blue line increases as the prediction of our special attack increases.
Test different values of n_neighbors and see how the MAEs for with test_error and train_error vary.def get_scores(k):
reg3 = KNeighborsRegressor(n_neighbors=k)
reg3.fit(X_train1, y_train1)
train_error = mean_absolute_error(reg3.predict(X_train1), y_train1)
test_error = mean_absolute_error(reg3.predict(X_test1), y_test1)
return (train_error, test_error)
get_scores(20)
(22.612250000000003, 24.16025)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores["kinv"] = 1/df_scores.k
#When we plot a test error curve, we want higher flexibility which also means higher variance on the right. But since higher value of K correspond to lower flexibility, so we need to add a new column to the dataframe containing the reciprocals of the K values.
ctrain = alt.Chart(df_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
x = "kinv",
y = "test_error"
)
ctrain+ctest
The blue curve is the training error, while the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K
Summary¶
Either summarize what you did, or summarize the results. Maybe 3 sentences.
For my project, I fist found out that there is a positive relationship between each Pokemon’s attack and defense. Then I found out that there is a 81.9% of chance predicting the certain defense and attack corresponds to Pokemon’s first generation. There is also a positive relationship between pokemon’s attack and defense with pokemon’s special attack. Pokemon’s attack would have more influences on their special attack value.
References¶
What is the source of your dataset(s)? Pokemon dataset: https://www.kaggle.com/datasets/abcsds/pokemon
Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.
Reference of code of linear regression: Week 6 Monday’s lecture notebook:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week6/Week6-Monday.html KNearest Neighbor Regressor code: from winter 2022 math10 notebook: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://scikit-learn.org/stable/modules/neighbors.html Logistic Regression for Classification: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Friday.html Analyze the performance of prediction:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Monday.html Analyze overfitting or underfitting: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Wednesday.html
List other references that you found helpful.
KNearest Neighbor Regressor: https://scikit-learn.org/stable/modules/neighbors.html Overfitting and Underfitting: https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/ MSE and MAE:https://stackoverflow.com/questions/66426928/how-to-calculate-mean-absolute-error-mae-and-mean-signed-error-mse-using-pan
Created in Deepnote