
Author: Wenqi Zhao

Course Project, UC Irvine, Math 10, S22


I chose a dataset about Pokemon Pet, which is a Japanese Animated TV series that most of us have seen during our childhood. The data set contains each Pokemon pet’s features: name, type, attack, defense, total. So in my project, I would make great use of machine learning such as linear or logistic prediction, which using the pokemon pets’ feature (types, attack, defense, and speed )as input to predict its species and the name of the pet.

Main portion of the project

Import and Clean Dataset
import pandas as pd
df=pd.read_csv("Pokemon.csv") #Read the dataset
df=df.dropna(axis=1) #Clean the data
# Name Type 1 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire 600 80 110 120 130 90 70 6 True

800 rows × 12 columns

A Brief Intro of my Dataset Name: Name of each pokemon; Type 1: Each pokemon has a type, this determines weakness/resistance to attacks; Type 2: Some pokemon are dual type and have 2; Total: sum of all stats that come after this, a general guide to how strong a pokemon is; HP: hit points, or health, defines how much damage a pokemon can withstand before fainting; Attack: the base modifier for normal attacks (eg. Scratch, Punch); Defense: the base damage resistance against normal attacks; SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam); SP Def: the base damage resistance against special attacks; Speed: determines which pokemon attacks first each round;
# Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
count 800.000000 800.00000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.00000
mean 362.813750 435.10250 69.258750 79.001250 73.842500 72.820000 71.902500 68.277500 3.32375
std 208.343798 119.96304 25.534669 32.457366 31.183501 32.722294 27.828916 29.060474 1.66129
min 1.000000 180.00000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000 1.00000
25% 184.750000 330.00000 50.000000 55.000000 50.000000 49.750000 50.000000 45.000000 2.00000
50% 364.500000 450.00000 65.000000 75.000000 70.000000 65.000000 70.000000 65.000000 3.00000
75% 539.250000 515.00000 80.000000 100.000000 90.000000 95.000000 90.000000 90.000000 5.00000
max 721.000000 780.00000 255.000000 190.000000 230.000000 194.000000 230.000000 180.000000 6.00000
Index(['#', 'Name', 'Type 1', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
Make charts to see each Pokemon's attack and defense value
import altair as alt
brush = alt.selection_interval()
c1 = alt.Chart(df).mark_point().encode(

c2= alt.Chart(df).mark_bar().encode(
    x = 'Name',

From the graphs above, we can tell that there might be a positive relationship between pokemon's attack and defense. Let's predict the trend by drawing a regression line using linear regression from machine learning. Then, I am interested in finding the relationship between pokemon's attack and defense by linear Regression.
#reference: week6 Monday notebook
from sklearn.linear_model import LinearRegression #Import
reg.fit(df[["Attack"]],df[["Defense"]]) #Fit attack as input, defense as output
reg.predict(df[["Attack"]]) #Make our prediction
df["Pred"]=reg.predict(df[["Attack"]]) #Add a new column named as "Pred"
# Name Type 1 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Pred
0 1 Bulbasaur Grass 318 45 49 49 65 65 45 1 False 61.197881
1 2 Ivysaur Grass 405 60 62 63 80 80 60 1 False 66.676987
2 3 Venusaur Grass 525 80 82 83 100 100 80 1 False 75.106382
3 3 VenusaurMega Venusaur Grass 625 80 100 123 122 120 80 1 False 82.692838
4 4 Charmander Fire 309 39 52 43 60 50 65 1 False 62.462290
Make Charts with our Linear Regression line
#draw the linear regression line prediction chart
c = alt.Chart(df).mark_circle().encode(

By the graph above, we can easily confirm that there is a positive trend between Attack and Defense for each Pokemon.

Next, I am interested in predicting the Pokemon's generation by Pokemon's attack and Defense
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

Before we do any further prediction, we need to make the generations become string instead of “int”, so the numbers in the generation column don’t actually have any numerical meaning, but only represents level, and then rename the column to avoid the two being messed up

cols=["Attack","Defense"] #Make a sub dataframe that only containes the necessary input that we want
df["is_1"]=(df["Generation1"]=="1") #Make the new colnmn that returns "True" if the pokemon's generation is one, otherwise returns "False" 

Since the original dataset is a little large, make a train test split to divide dataset would make better and more accurate prediction

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["is_1"], test_size=0.2, random_state=0)
from sklearn.linear_model import LogisticRegression #import
clf.fit(X_train, y_train) #fit
(clf.predict(X_test) == y_test).sum() # How often were we correct?
# The proportion from the whole dataset that we made correct prediction
(clf.predict(X_test) == y_test).sum()/len(X_test)

Since clf.coef_would return a length-2 array, we need to make sure that we get the right coefficient by specifing the index


What does our model predict if we have the attack is 140 and the defense is 87?

sigmoid = lambda x: 1/(1+np.exp(-x))
Attack = 140
Defense = 87

Hence, we can interpret this as our model predicts that this pokemon with(attack 140 and Defense 87) has a 22% chance of being a 1th generation Pokemon. Double check with predict_proba

clf.predict_proba([[Attack,Defense]]) #The first array says that there is a 77.53% chance of this pokemon is not a 1st generation, and the second array gives the same result as the sigmoid function gives.
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  "X does not have valid feature names, but"
array([[0.77535542, 0.22464458]])
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
loss_train=log_loss(y_train, clf.predict_proba(X_train))
loss_train #the log loss for x_train and y_train is about 0.379
loss_test #the log loss for x_test and y_test is about 2.110

Since the log loss of the testing data is larger, so there is a sign of overfitting. And from the coefficient that we get from the logistic prediction, we can tell that there is little relationship between the input:attack and defense with the output:Pokemon’s generation. It is not what we think that bigger attack and defense value lead to higher generation. Instead, different generations of pokemon might give the same amount of attack and defense.

Since there is little connection between Pokemon’s Attack and Defense with Pokemon’s generation. I am interested in finding how does attack and defense related to the special attack of the pokemons.

Next, I wonder How does Attack and Defense are related to the Sp.Atk(which mean special attck) of a pokemon
Attack Defense
0 49 49
1 62 63
2 82 83
3 100 123
4 52 43
... ... ...
795 100 150
796 160 110
797 110 60
798 160 60
799 110 120

800 rows × 2 columns

I still want to divide my dataset into two datasets and make my prediction.

X_train1, X_test1, y_train1, y_test1 = train_test_split(df[cols], df["Sp. Atk"], train_size=0.25)
Attack Defense
765 55 52
587 57 55
156 85 100
511 132 105
389 70 130
... ... ...
267 134 110
509 62 50
668 30 55
223 85 200
406 75 60

200 rows × 2 columns

Make our prediction and find the MSE and MAE to evaluate the prediction performance

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
MSE1 = mean_squared_error(y_train1,reg1.predict(X_train1))
MAE1 = mean_absolute_error(y_train1,reg1.predict(X_train1))
print(f"the coefficients of reg1 are {reg1.coef_}")
print(f"the intercept of reg1 is {reg1.intercept_}.")
print(f'The Mean square error is {MSE1:.3f}')
print(f'The Mean absolute error is {MAE1:.3f}')
the coefficients of reg1 are [0.45471745 0.01710352]
the intercept of reg1 is 33.51731900993459.
The Mean square error is 789.539
The Mean absolute error is 22.997

in this case the MSE is much greater than MAE because MSE punishes outlier very heavily.

#Make chart of both the prediction
 #draw the linear regression line prediction chart
c3 = alt.Chart(X_train1).mark_circle().encode(

This chart suggests that my prediction is underfitting and needs more degrees. Overall, the output special attack has a positive relationship with the inputs:attack and defense. As the attack and defense increases, the special attack is also likely to increase.

Use KNeighborRegressor to adjust the underfitting of the previous prediction.
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
reg2 = KNeighborsRegressor(n_neighbors=10, weights='uniform')
MAE2_test=mean_absolute_error(reg2.predict(X_test1), y_test1)
MAE2_train=mean_absolute_error(reg2.predict(X_train1), y_train1)
MSE2_test=mean_squared_error(reg2.predict(X_test1), y_test1)
MSE2_train=mean_squared_error(reg2.predict(X_train1), y_train1)
print(f"the mean squared error for X_train1 is {MSE2_test}")
print(f"the mean squared error X_test1 is {MSE2_train}")
print(f"the mean absolute error for X_train1 is {MAE2_test}")
print(f"the mean absolute error X_test1 is {MAE2_train}")
the mean squared error for X_train1 is 922.33005
the mean squared error X_test1 is 764.9995499999999
the mean absolute error for X_train1 is 24.20116666666667
the mean absolute error X_test1 is 22.3525

The MSEs are similar and the MAEs are also similar , with reg performing just slightly better on the training data. That suggests that for this training set, we are not overfitting the data when using K=10.

#Make chart of both the prediction
 #draw the linear regression line prediction chart
c4 = alt.Chart(X_train1).mark_circle().encode(
#The chart still suggests overffiting where the prediction is too stick to the datas that we provided that fails to predict future trend.
c6 = alt.Chart(X_test1).mark_circle().encode(
# similar result will happen to the test chart

Determine which value affects the special attack more : Attack or Defense

c8 = alt.Chart(X_test1).mark_line().encode(

By the chart above, we can visualize that the attack value influence the special attack value more than the defense value as the blue line increases as the prediction of our special attack increases.

Test different values of n_neighbors and see how the MAEs for with test_error and train_error vary.
def get_scores(k):
    reg3 = KNeighborsRegressor(n_neighbors=k)
    reg3.fit(X_train1, y_train1)
    train_error = mean_absolute_error(reg3.predict(X_train1), y_train1)
    test_error = mean_absolute_error(reg3.predict(X_test1), y_test1)
    return (train_error, test_error)
(22.612250000000003, 24.16025)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores["kinv"] = 1/df_scores.k 
#When we plot a test error curve, we want higher flexibility which also means higher variance on the right. But since higher value of K correspond to lower flexibility, so we need to add a new column to the dataframe containing the reciprocals of the K values.
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"

The blue curve is the training error, while the orange curve is the test error. From the graph, we observe that underfitting occurs for very high values of K and overfitting for smaller values of K


Either summarize what you did, or summarize the results. Maybe 3 sentences.

For my project, I fist found out that there is a positive relationship between each Pokemon’s attack and defense. Then I found out that there is a 81.9% of chance predicting the certain defense and attack corresponds to Pokemon’s first generation. There is also a positive relationship between pokemon’s attack and defense with pokemon’s special attack. Pokemon’s attack would have more influences on their special attack value.


  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

Reference of code of linear regression: Week 6 Monday’s lecture notebook:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week6/Week6-Monday.html KNearest Neighbor Regressor code: from winter 2022 math10 notebook: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://scikit-learn.org/stable/modules/neighbors.html Logistic Regression for Classification: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Friday.html Analyze the performance of prediction:https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Monday.html Analyze overfitting or underfitting: https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Wednesday.html

  • List other references that you found helpful.

KNearest Neighbor Regressor: https://scikit-learn.org/stable/modules/neighbors.html Overfitting and Underfitting: https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/ MSE and MAE:https://stackoverflow.com/questions/66426928/how-to-calculate-mean-absolute-error-mae-and-mean-signed-error-mse-using-pan

