Dance of the COVID-19: The Regression Model of the Vaccine Hesitancy
Contents
Dance of the COVID-19: The Regression Model of the Vaccine Hesitancy¶
Author: Jiahui Sheng
Course Project, UC Irvine, Math 10, S22
Introduction¶
To investigate and predict vaccine hesitancy in America, a Kaggle dataset is proposed with the variable of the party, the sex, etc. Initially, to implement the data preprocessing, the padding NaN filling method is applied and the text is converted to the number by pandas. Then, the KMeans method is proposed to expand the features. Ultimately, the data is split into a training set and the test set, then it sends to the multilayer perceptron, using the L2 norm as the loss function, and evaluate the results to determine whether the model is overfitting or not.
Main portion of the project¶
The Data Preprocessing. The preprocessing by pandas includes the missing value filling, the text converting, and the data scaling. The method of the missing value filling in the ‘pad’ method, which uses the front normal value to fill the space. The text converting is converting the text into a number (in this dataset, the text is categorial so it can be mapped into numbers). The data scaling method is the min-max normalization method.
The KMeans clustering. Because the problem is the regression problem, there are no cluster labels. Therefore, to determine the optimal k value, the silhouette method is proposed. Then, the KMeans labels are added to the last column of the data as other features.
The regression. The multilayer perceptron method is proposed because it has a strong fitting ability. In addition, the dataset is split into the training and the test set to determine whether the model is overfitting or not.
Import All Package for analysis inially
import numpy as np
import pandas as pd
import random
from scipy.cluster import vq
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import torch
import torch.nn as nn
from tqdm import trange
import altair as alt
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
read the file and show it
df = pd.read_csv("vh_data15.csv",encoding="utf8")
df
County_Density | Vaccine_Trust_Index | Personal_Responsibility | Trust_Science_Apolitical | Trust_Science_Politicians | Trust_Science_Media | Trust_Science_Community | Trust_National | Trust_State | Trust_Local | ... | Pandemic_Impact_Network | Infected_Personal | Infected_Network | Biden | Trump | Party_ID | Household_Income | Vaccine_Required | Evangelical | Vaccine_Hesitant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 137.851795 | 0.000000 | 10.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | ... | 0.0 | 0.0 | 0.0 | No | Yes | Republican | 1.0 | 0 | 1.0 | 1 |
1 | 38.751406 | 9.000000 | 5.0 | 8.0 | 4.0 | 6.0 | 9.0 | 8.0 | 2.0 | 7.0 | ... | 5.0 | 0.0 | 0.0 | Yes | No | Democrat | 3.0 | 0 | 0.0 | 0 |
2 | 18.103752 | 8.666667 | 7.0 | 6.0 | 1.0 | 1.0 | 6.0 | 7.0 | 7.0 | 9.0 | ... | 6.0 | 0.0 | 1.0 | No | Yes | Republican | 6.0 | 0 | 0.0 | 0 |
3 | 26.912917 | 4.000000 | 7.0 | 6.0 | 6.0 | 4.0 | 6.0 | 6.0 | 6.0 | 6.0 | ... | 7.0 | 0.0 | 0.0 | Yes | No | Democrat | 4.0 | 0 | 1.0 | 1 |
4 | 1541.026670 | 7.000000 | 6.0 | 6.0 | 1.0 | 2.0 | 6.0 | 6.0 | 2.0 | 7.0 | ... | 2.5 | 0.0 | 1.0 | Yes | Yes | Republican | 3.0 | 0 | 1.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3348 | 470.033815 | 1.666667 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 4.0 | ... | 7.5 | 0.0 | 0.0 | No | Yes | Independent | 8.0 | 0 | 1.0 | 1 |
3349 | 225.323560 | 9.333333 | 2.0 | 9.0 | 1.0 | 8.0 | 9.0 | 6.0 | 7.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | Yes | No | Republican | 2.0 | 0 | 0.0 | 0 |
3350 | 108.474026 | 9.000000 | 8.0 | 6.0 | 3.0 | 3.0 | 5.0 | 6.0 | 7.0 | 6.0 | ... | 6.0 | 0.0 | 0.0 | No | Yes | Republican | 2.0 | 1 | 0.0 | 0 |
3351 | 63.385677 | 9.333333 | 5.0 | 5.0 | 4.0 | 6.0 | 9.0 | 8.0 | 8.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | Yes | No | Democrat | 12.0 | 1 | 0.0 | 0 |
3352 | 599.558111 | 9.000000 | 10.0 | 1.0 | 0.0 | 0.0 | 10.0 | 9.0 | 9.0 | 10.0 | ... | 9.0 | 1.0 | 1.0 | No | Yes | Republican | 5.0 | 0 | 1.0 | 1 |
3353 rows × 42 columns
Data PreProcessing by filling in the front of values (pad method) to fill NaN value
df = df.fillna(method="pad")
df
County_Density | Vaccine_Trust_Index | Personal_Responsibility | Trust_Science_Apolitical | Trust_Science_Politicians | Trust_Science_Media | Trust_Science_Community | Trust_National | Trust_State | Trust_Local | ... | Pandemic_Impact_Network | Infected_Personal | Infected_Network | Biden | Trump | Party_ID | Household_Income | Vaccine_Required | Evangelical | Vaccine_Hesitant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 137.851795 | 0.000000 | 10.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | ... | 0.0 | 0.0 | 0.0 | No | Yes | Republican | 1.0 | 0 | 1.0 | 1 |
1 | 38.751406 | 9.000000 | 5.0 | 8.0 | 4.0 | 6.0 | 9.0 | 8.0 | 2.0 | 7.0 | ... | 5.0 | 0.0 | 0.0 | Yes | No | Democrat | 3.0 | 0 | 0.0 | 0 |
2 | 18.103752 | 8.666667 | 7.0 | 6.0 | 1.0 | 1.0 | 6.0 | 7.0 | 7.0 | 9.0 | ... | 6.0 | 0.0 | 1.0 | No | Yes | Republican | 6.0 | 0 | 0.0 | 0 |
3 | 26.912917 | 4.000000 | 7.0 | 6.0 | 6.0 | 4.0 | 6.0 | 6.0 | 6.0 | 6.0 | ... | 7.0 | 0.0 | 0.0 | Yes | No | Democrat | 4.0 | 0 | 1.0 | 1 |
4 | 1541.026670 | 7.000000 | 6.0 | 6.0 | 1.0 | 2.0 | 6.0 | 6.0 | 2.0 | 7.0 | ... | 2.5 | 0.0 | 1.0 | Yes | Yes | Republican | 3.0 | 0 | 1.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3348 | 470.033815 | 1.666667 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 4.0 | ... | 7.5 | 0.0 | 0.0 | No | Yes | Independent | 8.0 | 0 | 1.0 | 1 |
3349 | 225.323560 | 9.333333 | 2.0 | 9.0 | 1.0 | 8.0 | 9.0 | 6.0 | 7.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | Yes | No | Republican | 2.0 | 0 | 0.0 | 0 |
3350 | 108.474026 | 9.000000 | 8.0 | 6.0 | 3.0 | 3.0 | 5.0 | 6.0 | 7.0 | 6.0 | ... | 6.0 | 0.0 | 0.0 | No | Yes | Republican | 2.0 | 1 | 0.0 | 0 |
3351 | 63.385677 | 9.333333 | 5.0 | 5.0 | 4.0 | 6.0 | 9.0 | 8.0 | 8.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | Yes | No | Democrat | 12.0 | 1 | 0.0 | 0 |
3352 | 599.558111 | 9.000000 | 10.0 | 1.0 | 0.0 | 0.0 | 10.0 | 9.0 | 9.0 | 10.0 | ... | 9.0 | 1.0 | 1.0 | No | Yes | Republican | 5.0 | 0 | 1.0 | 1 |
3353 rows × 42 columns
Because there are some string in the data, using replace to replace the string
df = df.replace({"No":0,"Yes":1,"Republican":0,"Democrat":1,"Independent":2,"Libertarian":3,'Other party':4})
df
County_Density | Vaccine_Trust_Index | Personal_Responsibility | Trust_Science_Apolitical | Trust_Science_Politicians | Trust_Science_Media | Trust_Science_Community | Trust_National | Trust_State | Trust_Local | ... | Pandemic_Impact_Network | Infected_Personal | Infected_Network | Biden | Trump | Party_ID | Household_Income | Vaccine_Required | Evangelical | Vaccine_Hesitant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 137.851795 | 0.000000 | 10.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 0 | 1.0 | 1 |
1 | 38.751406 | 9.000000 | 5.0 | 8.0 | 4.0 | 6.0 | 9.0 | 8.0 | 2.0 | 7.0 | ... | 5.0 | 0.0 | 0.0 | 1 | 0 | 1 | 3.0 | 0 | 0.0 | 0 |
2 | 18.103752 | 8.666667 | 7.0 | 6.0 | 1.0 | 1.0 | 6.0 | 7.0 | 7.0 | 9.0 | ... | 6.0 | 0.0 | 1.0 | 0 | 1 | 0 | 6.0 | 0 | 0.0 | 0 |
3 | 26.912917 | 4.000000 | 7.0 | 6.0 | 6.0 | 4.0 | 6.0 | 6.0 | 6.0 | 6.0 | ... | 7.0 | 0.0 | 0.0 | 1 | 0 | 1 | 4.0 | 0 | 1.0 | 1 |
4 | 1541.026670 | 7.000000 | 6.0 | 6.0 | 1.0 | 2.0 | 6.0 | 6.0 | 2.0 | 7.0 | ... | 2.5 | 0.0 | 1.0 | 1 | 1 | 0 | 3.0 | 0 | 1.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3348 | 470.033815 | 1.666667 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 4.0 | ... | 7.5 | 0.0 | 0.0 | 0 | 1 | 2 | 8.0 | 0 | 1.0 | 1 |
3349 | 225.323560 | 9.333333 | 2.0 | 9.0 | 1.0 | 8.0 | 9.0 | 6.0 | 7.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | 1 | 0 | 0 | 2.0 | 0 | 0.0 | 0 |
3350 | 108.474026 | 9.000000 | 8.0 | 6.0 | 3.0 | 3.0 | 5.0 | 6.0 | 7.0 | 6.0 | ... | 6.0 | 0.0 | 0.0 | 0 | 1 | 0 | 2.0 | 1 | 0.0 | 0 |
3351 | 63.385677 | 9.333333 | 5.0 | 5.0 | 4.0 | 6.0 | 9.0 | 8.0 | 8.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | 1 | 0 | 1 | 12.0 | 1 | 0.0 | 0 |
3352 | 599.558111 | 9.000000 | 10.0 | 1.0 | 0.0 | 0.0 | 10.0 | 9.0 | 9.0 | 10.0 | ... | 9.0 | 1.0 | 1.0 | 0 | 1 | 0 | 5.0 | 0 | 1.0 | 1 |
3353 rows × 42 columns
Sepearate the data and the labels
labels = df['Vaccine_Hesitant']
df = df[df.columns[:-1]]
df
labels
0 1
1 0
2 0
3 1
4 0
..
3348 1
3349 0
3350 0
3351 0
3352 1
Name: Vaccine_Hesitant, Length: 3353, dtype: int64
Generate the math information of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3353 entries, 0 to 3352
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 County_Density 3353 non-null float64
1 Vaccine_Trust_Index 3353 non-null float64
2 Personal_Responsibility 3353 non-null float64
3 Trust_Science_Apolitical 3353 non-null float64
4 Trust_Science_Politicians 3353 non-null float64
5 Trust_Science_Media 3353 non-null float64
6 Trust_Science_Community 3353 non-null float64
7 Trust_National 3353 non-null float64
8 Trust_State 3353 non-null float64
9 Trust_Local 3353 non-null float64
10 Trust_Media 3353 non-null float64
11 Perceived_Risk 3353 non-null float64
12 Perceived_Network_Risk 3353 non-null float64
13 Doctor_Comfort 3353 non-null float64
14 Fear_Needles 3353 non-null float64
15 Condition_Pregnancy 3353 non-null float64
16 Condition_Asthma 3353 non-null float64
17 Condition_Lung 3353 non-null float64
18 Condition_Diabetes 3353 non-null float64
19 Condition_Immune 3353 non-null float64
20 Condition_Obesity 3353 non-null float64
21 Condition_Heart 3353 non-null float64
22 Condition_Organ 3353 non-null float64
23 County_Cases 3353 non-null float64
24 County_Cases2wk 3353 non-null float64
25 Male 3353 non-null int64
26 Race 3353 non-null int64
27 Age 3353 non-null float64
28 PS_Index 3353 non-null float64
29 Natural_Science_Literacy 3353 non-null float64
30 College_Degree 3353 non-null float64
31 Pandemic_Impact 3353 non-null float64
32 Pandemic_Impact_Network 3353 non-null float64
33 Infected_Personal 3353 non-null float64
34 Infected_Network 3353 non-null float64
35 Biden 3353 non-null int64
36 Trump 3353 non-null int64
37 Party_ID 3353 non-null int64
38 Household_Income 3353 non-null float64
39 Vaccine_Required 3353 non-null int64
40 Evangelical 3353 non-null float64
dtypes: float64(35), int64(6)
memory usage: 1.0 MB
df.describe()
County_Density | Vaccine_Trust_Index | Personal_Responsibility | Trust_Science_Apolitical | Trust_Science_Politicians | Trust_Science_Media | Trust_Science_Community | Trust_National | Trust_State | Trust_Local | ... | Pandemic_Impact | Pandemic_Impact_Network | Infected_Personal | Infected_Network | Biden | Trump | Party_ID | Household_Income | Vaccine_Required | Evangelical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | ... | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 | 3353.000000 |
mean | 882.153305 | 7.583408 | 7.033701 | 5.454817 | 2.324187 | 3.589323 | 6.494483 | 5.090665 | 5.539815 | 6.025649 | ... | 4.771399 | 4.795407 | 0.050701 | 0.401432 | 0.615866 | 0.358187 | 1.084402 | 6.345959 | 0.211452 | 0.179242 |
std | 2638.189276 | 2.532152 | 2.477415 | 2.850460 | 2.225544 | 2.952058 | 2.540727 | 2.431302 | 2.539364 | 2.251478 | ... | 1.699539 | 1.684785 | 0.219419 | 0.490261 | 0.486462 | 0.479539 | 0.921941 | 3.684398 | 0.408399 | 0.383612 |
min | 0.928098 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 108.384222 | 6.333333 | 5.000000 | 3.000000 | 0.000000 | 1.000000 | 5.000000 | 4.000000 | 4.000000 | 5.000000 | ... | 4.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 299.588930 | 8.333333 | 7.000000 | 5.000000 | 2.000000 | 4.000000 | 7.000000 | 5.000000 | 6.000000 | 6.000000 | ... | 5.000000 | 5.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | 0.000000 |
75% | 715.275377 | 9.333333 | 9.000000 | 8.000000 | 4.000000 | 5.000000 | 8.000000 | 7.000000 | 8.000000 | 8.000000 | ... | 5.500000 | 5.500000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 |
max | 27819.804800 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | ... | 10.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 12.000000 | 1.000000 | 1.000000 |
8 rows × 41 columns
To determine the value of K, after finding the literature, the silhouette method is applied to determine the optimal k value, setting max class of 20
K_Silhouette_List = [] #Initalizing the silhouette list
for k_value in trange(2, 20):
kmeans = KMeans(k_value,init='random') #perform the KMeans clustering
cluster_labels = kmeans.fit_predict(df) #obtain the cluster label
silhouette_avg = silhouette_score(df, cluster_labels) #calculating the silhouette score
K_Silhouette_List.append([k_value,silhouette_avg]) #Adding it into list
100%|██████████| 18/18 [00:56<00:00, 3.11s/it]
Visualize the change of the K_List
K_Silhouette_List = np.array(K_Silhouette_List).T
K_df = pd.DataFrame({'K':K_Silhouette_List[0],'silhouette':K_Silhouette_List[1]})
K_df
K | silhouette | |
---|---|---|
0 | 2.0 | 0.950613 |
1 | 3.0 | 0.890727 |
2 | 4.0 | 0.672800 |
3 | 5.0 | 0.658783 |
4 | 6.0 | 0.616108 |
5 | 7.0 | 0.537130 |
6 | 8.0 | 0.473735 |
7 | 9.0 | 0.536511 |
8 | 10.0 | 0.502654 |
9 | 11.0 | 0.516242 |
10 | 12.0 | 0.501662 |
11 | 13.0 | 0.448177 |
12 | 14.0 | 0.493592 |
13 | 15.0 | 0.449001 |
14 | 16.0 | 0.432740 |
15 | 17.0 | 0.410725 |
16 | 18.0 | 0.391229 |
17 | 19.0 | 0.411393 |
Using Altair iteractive chart to show the results
chart = alt.Chart(K_df).mark_point(
width=100,
height=50
).encode(
x='K',
y='silhouette',
)
chart.interactive() #use iteractive chart
chart
Obtain the k value by the max value of the silhouette coefficient
k = np.argmax(K_Silhouette_List[1][:8]) + 2 #Obtain The suitable K value
print("The Choise of K is",k)
The Choise of K is 2
After finding the k value, it should use it to perform the K means cluster. But firstly, it should implement the initial center
centers, kmeanslabels = vq.kmeans2(data = df, k = k, iter = 1, minit = 'random')
/shared-libs/python3.7/py/lib/python3.7/site-packages/scipy/cluster/vq.py:607: UserWarning: One of the clusters is empty. Re-run kmeans with a different initialization.
warnings.warn("One of the clusters is empty. "
Then, Run the KMeans based on the given k value
max_iter = 1000 #Setting to 1000 to decrese the bias
for i in trange(1, max_iter):
centers, kmeanslabels = vq.kmeans2(data = df, k = centers, iter = 1, minit = 'matrix')
100%|██████████| 999/999 [00:02<00:00, 414.15it/s]
Finally, it should obtain the final kmeans labels
kmeanslabels = pd.Series(kmeanslabels)
kmeanslabels
0 1
1 1
2 1
3 1
4 1
..
3348 1
3349 1
3350 1
3351 1
3352 1
Length: 3353, dtype: int32
add to the last column and show it
df = df.assign(kmeanslabel=kmeanslabels)
df
County_Density | Vaccine_Trust_Index | Personal_Responsibility | Trust_Science_Apolitical | Trust_Science_Politicians | Trust_Science_Media | Trust_Science_Community | Trust_National | Trust_State | Trust_Local | ... | Pandemic_Impact_Network | Infected_Personal | Infected_Network | Biden | Trump | Party_ID | Household_Income | Vaccine_Required | Evangelical | kmeanslabel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 137.851795 | 0.000000 | 10.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | ... | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 1.0 | 0 | 1.0 | 1 |
1 | 38.751406 | 9.000000 | 5.0 | 8.0 | 4.0 | 6.0 | 9.0 | 8.0 | 2.0 | 7.0 | ... | 5.0 | 0.0 | 0.0 | 1 | 0 | 1 | 3.0 | 0 | 0.0 | 1 |
2 | 18.103752 | 8.666667 | 7.0 | 6.0 | 1.0 | 1.0 | 6.0 | 7.0 | 7.0 | 9.0 | ... | 6.0 | 0.0 | 1.0 | 0 | 1 | 0 | 6.0 | 0 | 0.0 | 1 |
3 | 26.912917 | 4.000000 | 7.0 | 6.0 | 6.0 | 4.0 | 6.0 | 6.0 | 6.0 | 6.0 | ... | 7.0 | 0.0 | 0.0 | 1 | 0 | 1 | 4.0 | 0 | 1.0 | 1 |
4 | 1541.026670 | 7.000000 | 6.0 | 6.0 | 1.0 | 2.0 | 6.0 | 6.0 | 2.0 | 7.0 | ... | 2.5 | 0.0 | 1.0 | 1 | 1 | 0 | 3.0 | 0 | 1.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3348 | 470.033815 | 1.666667 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 4.0 | ... | 7.5 | 0.0 | 0.0 | 0 | 1 | 2 | 8.0 | 0 | 1.0 | 1 |
3349 | 225.323560 | 9.333333 | 2.0 | 9.0 | 1.0 | 8.0 | 9.0 | 6.0 | 7.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | 1 | 0 | 0 | 2.0 | 0 | 0.0 | 1 |
3350 | 108.474026 | 9.000000 | 8.0 | 6.0 | 3.0 | 3.0 | 5.0 | 6.0 | 7.0 | 6.0 | ... | 6.0 | 0.0 | 0.0 | 0 | 1 | 0 | 2.0 | 1 | 0.0 | 1 |
3351 | 63.385677 | 9.333333 | 5.0 | 5.0 | 4.0 | 6.0 | 9.0 | 8.0 | 8.0 | 7.0 | ... | 3.5 | 0.0 | 1.0 | 1 | 0 | 1 | 12.0 | 1 | 0.0 | 1 |
3352 | 599.558111 | 9.000000 | 10.0 | 1.0 | 0.0 | 0.0 | 10.0 | 9.0 | 9.0 | 10.0 | ... | 9.0 | 1.0 | 1.0 | 0 | 1 | 0 | 5.0 | 0 | 1.0 | 1 |
3353 rows × 42 columns
Next is to perform the regression part. Initially, it should setting the devide of the running, the epoch, etc.
device = 'cpu' #No cuda would be used
Epoch = 10 #It should be set larger than this to show the result clearly
max_iter = 50 #For this model, the max_iter should be set to 50, which means every 50 iteration to perform a test
Next, Construct the model based on PyTorch
model = nn.Sequential(
nn.Linear(df.shape[1], 512),
nn.ReLU(),
nn.Linear(512,512),
nn.ReLU(),
nn.Linear(512,512),
nn.ReLU(),
nn.Linear(512,1)
).to(device)
Because the problem is the regression problem, the loss_function is selected as the MSE
loss_fn = nn.MSELoss().to(device)
Defining the optimizer with learning rate of 1e-4 because larger learning rate will cause the network become overfitting fast.
optimizer = torch.optim.Adam(model.parameters(), lr = 1e-4)
Spliting the data and construct the training set and test set
#For the training set
x = torch.FloatTensor(np.array(df[:int(0.9*df.shape[0])])).to(device)
y = torch.FloatTensor(np.array(labels[:int(0.9*df.shape[0])])).to(device).unsqueeze(-1)
#For the testing set
x_test = torch.FloatTensor(np.array(df[int(0.9*df.shape[0]):])).to(device)
y_test = torch.FloatTensor(np.array(labels[int(0.9*df.shape[0]):])).to(device).unsqueeze(-1)
print(x.shape,y.shape)
print(x_test.shape,y_test.shape)
torch.Size([3017, 42]) torch.Size([3017, 1])
torch.Size([336, 42]) torch.Size([336, 1])
Now, it is time to perform the training and testing loop. Notice that in MLP, is has using Loop to iterate over and over again!
for epoch in range(Epoch):
#Training with the training set
for it in trange(max_iter):
y_pred = model(x) #obtain the predicted value
loss = loss_fn(y_pred, y) #compute the loss function
optimizer.zero_grad() #clear the gradient in the optimizer
loss.backward() #backwarding the loss
optimizer.step() #update the optimizer
#Testing the data using testing set
y_pred_test = model(x_test)
loss_test = loss_fn(y_pred_test, y_test)
print(f'Epoch{epoch}, Training Set Loss:{loss.item()}, Test Set Loss:{loss_test.item()}')
100%|██████████| 50/50 [00:33<00:00, 1.49it/s]
Epoch0, Training Set Loss:0.5529709458351135, Test Set Loss:0.8505077362060547
100%|██████████| 50/50 [00:33<00:00, 1.51it/s]
Epoch1, Training Set Loss:0.17739205062389374, Test Set Loss:0.21063323318958282
100%|██████████| 50/50 [00:32<00:00, 1.52it/s]
Epoch2, Training Set Loss:0.24123425781726837, Test Set Loss:0.1880612075328827
100%|██████████| 50/50 [00:33<00:00, 1.50it/s]
Epoch3, Training Set Loss:0.116368368268013, Test Set Loss:0.15852192044258118
100%|██████████| 50/50 [00:33<00:00, 1.51it/s]
Epoch4, Training Set Loss:0.10501353442668915, Test Set Loss:0.14747793972492218
100%|██████████| 50/50 [00:33<00:00, 1.49it/s]
Epoch5, Training Set Loss:0.10363566875457764, Test Set Loss:0.15894567966461182
100%|██████████| 50/50 [00:33<00:00, 1.50it/s]
Epoch6, Training Set Loss:0.0968862771987915, Test Set Loss:0.1493232548236847
100%|██████████| 50/50 [00:33<00:00, 1.50it/s]
Epoch7, Training Set Loss:0.09015238285064697, Test Set Loss:0.1419822871685028
100%|██████████| 50/50 [00:33<00:00, 1.51it/s]
Epoch8, Training Set Loss:0.08603479713201523, Test Set Loss:0.14332301914691925
100%|██████████| 50/50 [00:33<00:00, 1.51it/s]Epoch9, Training Set Loss:0.08362097293138504, Test Set Loss:0.13961467146873474
Summary¶
The Pandas library is applied to read data, implement the missing value filling, and the text converting, while the KMeans is applied to implement the feature engineering. The multilayer perceptron is applied to implement the regression problem. The result shows that the model is overfitting after 2500 iterations, with the minimum mean square error of the regression being 0.13364802300930023 in the test set.
References¶
What is the source of your dataset(s)? Answer: the source of the dataset is Kaggle
Were any portions of the code or ideas taken from another source? List those sources here and say how they were used. Answer: most of my code is original, which sources from the class, or refers to the tutorial in YouTube (See Next Question).
List other references that you found helpful.
Answer:
The PyTorch tutorial: https://www.youtube.com/watch?v=c36lUUr864M
The tqdm tutorial: https://www.youtube.com/watch?v=8zm4L3rVreI
The official document of PyTorch: https://pytorch.org/docs/stable/index.html
The definition of the silhouette method: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
Created in Deepnote