Pokemon¶

Author: Wenqi Zhao

Course Project, UC Irvine, Math 10, S22

Introduction¶

I chose a dataset about Pokemon Pet, which is a Japanese Animated TV series that most of us have seen during our childhood. The data set contains each Pokemon pet’s features: name, type, attack, defense, total. So in my project, I would make great use of machine learning such as linear or logistic prediction, which using the pokemon pets’ feature (types, attack, defense, and speed )as input to predict its species and the name of the pet.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

Import and Clean Dataset

import pandas as pd
df=pd.read_csv("Pokemon.csv") #Read the dataset
df=df.dropna(axis=1) #Clean the data
df

	#	Name	Type 1	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	600	80	110	120	130	90	70	6	True

800 rows × 12 columns

A Brief Intro of my Dataset Name: Name of each pokemon; Type 1: Each pokemon has a type, this determines weakness/resistance to attacks; Type 2: Some pokemon are dual type and have 2; Total: sum of all stats that come after this, a general guide to how strong a pokemon is; HP: hit points, or health, defines how much damage a pokemon can withstand before fainting; Attack: the base modifier for normal attacks (eg. Scratch, Punch); Defense: the base damage resistance against normal attacks; SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam); SP Def: the base damage resistance against special attacks; Speed: determines which pokemon attacks first each round;

df.describe()

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
count	800.000000	800.00000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.00000
mean	362.813750	435.10250	69.258750	79.001250	73.842500	72.820000	71.902500	68.277500	3.32375
std	208.343798	119.96304	25.534669	32.457366	31.183501	32.722294	27.828916	29.060474	1.66129
min	1.000000	180.00000	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000	1.00000
25%	184.750000	330.00000	50.000000	55.000000	50.000000	49.750000	50.000000	45.000000	2.00000
50%	364.500000	450.00000	65.000000	75.000000	70.000000	65.000000	70.000000	65.000000	3.00000
75%	539.250000	515.00000	80.000000	100.000000	90.000000	95.000000	90.000000	90.000000	5.00000
max	721.000000	780.00000	255.000000	190.000000	230.000000	194.000000	230.000000	180.000000	6.00000

df.columns

Index(['#', 'Name', 'Type 1', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

Make charts to see each Pokemon's attack and defense value

import altair as alt
brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(
    x='Attack',
    y='Defense',
    color='Name'
).add_selection(brush)

c2= alt.Chart(df).mark_bar().encode(
    x = 'Name',
    y='Defense'
).transform_filter(brush)

c1|c2

From the graphs above, we can tell that there might be a positive relationship between pokemon's attack and defense. Let's predict the trend by drawing a regression line using linear regression from machine learning. Then, I am interested in finding the relationship between pokemon's attack and defense by linear Regression.

#reference: week6 Monday notebook
from sklearn.linear_model import LinearRegression #Import
reg=LinearRegression()#create/instantiate
reg.fit(df[["Attack"]],df[["Defense"]]) #Fit attack as input, defense as output
reg.predict(df[["Attack"]]) #Make our prediction
df["Pred"]=reg.predict(df[["Attack"]]) #Add a new column named as "Pred"
df.head()

	#	Name	Type 1	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	Pred
0	1	Bulbasaur	Grass	318	45	49	49	65	65	45	1	False	61.197881
1	2	Ivysaur	Grass	405	60	62	63	80	80	60	1	False	66.676987
2	3	Venusaur	Grass	525	80	82	83	100	100	80	1	False	75.106382
3	3	VenusaurMega Venusaur	Grass	625	80	100	123	122	120	80	1	False	82.692838
4	4	Charmander	Fire	309	39	52	43	60	50	65	1	False	62.462290

Make Charts with our Linear Regression line

#draw the linear regression line prediction chart
c = alt.Chart(df).mark_circle().encode(
    x="Attack",
    y="Defense"
)
c1=alt.Chart(df).mark_line(color="red").encode(
    x="Attack",
    y="Pred"
)
c+c1

By the graph above, we can easily confirm that there is a positive trend between Attack and Defense for each Pokemon.

Next, I am interested in predicting the Pokemon's generation by Pokemon's attack and Defense

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

Before we do any further prediction, we need to make the generations become string instead of “int”, so the numbers in the generation column don’t actually have any numerical meaning, but only represents level, and then rename the column to avoid the two being messed up

df["Generation1"]=df["Generation"].apply(str)

cols=["Attack","Defense"] #Make a sub dataframe that only containes the necessary input that we want
df["is_1"]=(df["Generation1"]=="1") #Make the new colnmn that returns "True" if the pokemon's generation is one, otherwise returns "False" 

Since the original dataset is a little large, make a train test split to divide dataset would make better and more accurate prediction

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["is_1"], test_size=0.2, random_state=0)

from sklearn.linear_model import LogisticRegression #import
clf=LogisticRegression()
clf.fit(X_train, y_train) #fit
(clf.predict(X_test) == y_test).sum() # How often were we correct?

# The proportion from the whole dataset that we made correct prediction
(clf.predict(X_test) == y_test).sum()/len(X_test)

0.81875

Since clf.coef_would return a length-2 array, we need to make sure that we get the right coefficient by specifing the index

clf.coef_
Attack_coef,Defense_coef=clf.coef_[0]
Defense_coef

-0.004358914889161834

What does our model predict if we have the attack is 140 and the defense is 87?

sigmoid = lambda x: 1/(1+np.exp(-x))
Attack = 140
Defense = 87
sigmoid(Attack_coef*Attack+Defense_coef*Defense+clf.intercept_)

array([0.22464458])

Hence, we can interpret this as our model predicts that this pokemon with(attack 140 and Defense 87) has a 22% chance of being a 1th generation Pokemon. Double check with predict_proba

clf.predict_proba([[Attack,Defense]]) #The first array says that there is a 77.53% chance of this pokemon is not a 1st generation, and the second array gives the same result as the sigmoid function gives.

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"

array([[0.77535542, 0.22464458]])

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_test=log_loss(y_test,clf.predict_proba(X_test))

loss_train #the log loss for x_train and y_train is about 0.379

0.37936574298643205

loss_test #the log loss for x_test and y_test is about 2.110

2.110437076931544

Since the log loss of the testing data is larger, so there is a sign of overfitting. And from the coefficient that we get from the logistic prediction, we can tell that there is little relationship between the input:attack and defense with the output:Pokemon’s generation. It is not what we think that bigger attack and defense value lead to higher generation. Instead, different generations of pokemon might give the same amount of attack and defense.

Since there is little connection between Pokemon’s Attack and Defense with Pokemon’s generation. I am interested in finding how does attack and defense related to the special attack of the pokemons.

Next, I wonder How does Attack and Defense are related to the Sp.Atk(which mean special attck) of a pokemon

df[cols]

	Attack	Defense
0	49	49
1	62	63
2	82	83
3	100	123
4	52	43
...	...	...
795	100	150
796	160	110
797	110	60
798	160	60
799	110	120

800 rows × 2 columns

I still want to divide my dataset into two datasets and make my prediction.

X_train1, X_test1, y_train1, y_test1 = train_test_split(df[cols], df["Sp. Atk"], train_size=0.25)

X_train1

	Attack	Defense
765	55	52
587	57	55
156	85	100
511	132	105
389	70	130
...	...	...
267	134	110
509	62	50
668	30	55
223	85	200
406	75	60

200 rows × 2 columns

Make our prediction and find the MSE and MAE to evaluate the prediction performance

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
reg1=LinearRegression()
reg1.fit(X_train1,y_train1)
MSE1 = mean_squared_error(y_train1,reg1.predict(X_train1))
MAE1 = mean_absolute_error(y_train1,reg1.predict(X_train1))
X_train1["Pred"]=reg1.predict(X_train1)
X_test1["Pred"]=reg1.predict(X_test1)
print(f"the coefficients of reg1 are {reg1.coef_}")
print(f"the intercept of reg1 is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')

the coefficients of reg1 are [0.45471745 0.01710352]
the intercept of reg1 is 33.51731900993459.
The Mean square error is 789.539
The Mean absolute error is 22.997

in this case the MSE is much greater than MAE because MSE punishes outlier very heavily.

#Make chart of both the prediction
 #draw the linear regression line prediction chart
c3 = alt.Chart(X_train1).mark_circle().encode(
    x="Attack",
    y="Defense"
)
c4=alt.Chart(X_train1).mark_line(color="red").encode(
    x="Attack",
    y="Pred"
)
c3+c4

This chart suggests that my prediction is underfitting and needs more degrees. Overall, the output special attack has a positive relationship with the inputs:attack and defense. As the attack and defense increases, the special attack is also likely to increase.

Use KNeighborRegressor to adjust the underfitting of the previous prediction.

from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
reg2 = KNeighborsRegressor(n_neighbors=10, weights='uniform')
reg2.fit(X_train1,y_train1)
MAE2_test=mean_absolute_error(reg2.predict(X_test1), y_test1)
MAE2_train=mean_absolute_error(reg2.predict(X_train1), y_train1)
MSE2_test=mean_squared_error(reg2.predict(X_test1), y_test1)
MSE2_train=mean_squared_error(reg2.predict(X_train1), y_train1)
print(f"the mean squared error for X_train1 is {MSE2_test}")
print(f"the mean squared error X_test1 is {MSE2_train}")
print(f"the mean absolute error for X_train1 is {MAE2_test}")
print(f"the mean absolute error X_test1 is {MAE2_train}")

the mean squared error for X_train1 is 922.33005
the mean squared error X_test1 is 764.9995499999999
the mean absolute error for X_train1 is 24.20116666666667
the mean absolute error X_test1 is 22.3525

The MSEs are similar and the MAEs are also similar , with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.

X_train1["Pred"]=reg2.predict(X_train1)
X_test1["Pred"]=reg2.predict(X_test1)

#Make chart of both the prediction
 #draw the linear regression line prediction chart
c4 = alt.Chart(X_train1).mark_circle().encode(
    x="Attack",
    y="Defense"
)
c5=alt.Chart(X_train1).mark_line(color="red").encode(
    x="Attack",
    y="Pred"
)
c4+c5
#The chart still suggests overffiting where the prediction is too stick to the datas that we provided that fails to predict future trend.

c6 = alt.Chart(X_test1).mark_circle().encode(
    x="Attack",
    y="Defense"
)
c7=alt.Chart(X_test1).mark_line(color="red").encode(
    x="Attack",
    y="Pred"
)
c6+c7
# similar result will happen to the test chart

Determine which value affects the special attack more : Attack or Defense

c8 = alt.Chart(X_test1).mark_line().encode(
    x="Attack",
    y="Pred"
)
c9=alt.Chart(X_test1).mark_line(color="red").encode(
    x="Defense",
    y="Pred"
)
c8+c9

By the chart above, we can visualize that the attack value influence the special attack value more than the defense value as the blue line increases as the prediction of our special attack increases.

Test different values of n_neighbors and see how the MAEs for with test_error and train_error vary.

def get_scores(k):
    reg3 = KNeighborsRegressor(n_neighbors=k)
    reg3.fit(X_train1, y_train1)
    train_error = mean_absolute_error(reg3.predict(X_train1), y_train1)
    test_error = mean_absolute_error(reg3.predict(X_test1), y_test1)
    return (train_error, test_error)
get_scores(20)

(22.612250000000003, 24.16025)

df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores["kinv"] = 1/df_scores.k 
#When we plot a test error curve, we want higher flexibility which also means higher variance on the right. But since higher value of K correspond to lower flexibility, so we need to add a new column to the dataframe containing the reciprocals of the K values.

ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)
ctrain+ctest

The blue curve is the training error, while the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K

Summary¶

Either summarize what you did, or summarize the results. Maybe 3 sentences.

For my project, I fist found out that there is a positive relationship between each Pokemon’s attack and defense. Then I found out that there is a 81.9% of chance predicting the certain defense and attack corresponds to Pokemon’s first generation. There is also a positive relationship between pokemon’s attack and defense with pokemon’s special attack. Pokemon’s attack would have more influences on their special attack value.

References¶

What is the source of your dataset(s)? Pokemon dataset: https://www.kaggle.com/datasets/abcsds/pokemon

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

Reference of code of linear regression: Week 6 Monday’s lecture notebook:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week6/Week6-Monday.html KNearest Neighbor Regressor code: from winter 2022 math10 notebook: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://scikit-learn.org/stable/modules/neighbors.html Logistic Regression for Classification: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Friday.html Analyze the performance of prediction:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Monday.html Analyze overfitting or underfitting: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Wednesday.html

List other references that you found helpful.

KNearest Neighbor Regressor: https://scikit-learn.org/stable/modules/neighbors.html Overfitting and Underfitting: https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/ MSE and MAE:https://stackoverflow.com/questions/66426928/how-to-calculate-mean-absolute-error-mae-and-mean-signed-error-mse-using-pan

Created in Deepnote

UC Irvine Math 10 S22

Pokemon

Contents

Pokemon¶

Introduction¶

Main portion of the project¶

Summary¶

References¶

	Attack	Defense
0	49	49
1	62	63
2	82	83
3	100	123
4	52	43
...	...	...
795	100	150
796	160	110
797	110	60
798	160	60
799	110	120

	Attack	Defense
765	55	52
587	57	55
156	85	100
511	132	105
389	70	130
...	...	...
267	134	110
509	62	50
668	30	55
223	85	200
406	75	60

	Attack	Defense
0	49	49
1	62	63
2	82	83
3	100	123
4	52	43
...	...	...
795	100	150
796	160	110
797	110	60
798	160	60
799	110	120

	Attack	Defense
765	55	52
587	57	55
156	85	100
511	132	105
389	70	130
...	...	...
267	134	110
509	62	50
668	30	55
223	85	200
406	75	60

	Attack	Defense
0	49	49
1	62	63
2	82	83
3	100	123
4	52	43
...	...	...
795	100	150
796	160	110
797	110	60
798	160	60
799	110	120

	Attack	Defense
765	55	52
587	57	55
156	85	100
511	132	105
389	70	130
...	...	...
267	134	110
509	62	50
668	30	55
223	85	200
406	75	60