Dance of the COVID-19: The Regression Model of the Vaccine Hesitancy

Dance of the COVID-19: The Regression Model of the Vaccine Hesitancy

Author: Jiahui Sheng

Course Project, UC Irvine, Math 10, S22

Introduction

To investigate and predict vaccine hesitancy in America, a Kaggle dataset is proposed with the variable of the party, the sex, etc. Initially, to implement the data preprocessing, the padding NaN filling method is applied and the text is converted to the number by pandas. Then, the KMeans method is proposed to expand the features. Ultimately, the data is split into a training set and the test set, then it sends to the multilayer perceptron, using the L2 norm as the loss function, and evaluate the results to determine whether the model is overfitting or not.

Main portion of the project

  • The Data Preprocessing. The preprocessing by pandas includes the missing value filling, the text converting, and the data scaling. The method of the missing value filling in the ‘pad’ method, which uses the front normal value to fill the space. The text converting is converting the text into a number (in this dataset, the text is categorial so it can be mapped into numbers). The data scaling method is the min-max normalization method.

  • The KMeans clustering. Because the problem is the regression problem, there are no cluster labels. Therefore, to determine the optimal k value, the silhouette method is proposed. Then, the KMeans labels are added to the last column of the data as other features.

  • The regression. The multilayer perceptron method is proposed because it has a strong fitting ability. In addition, the dataset is split into the training and the test set to determine whether the model is overfitting or not.

Import All Package for analysis inially

import numpy as np
import pandas as pd
import random
from scipy.cluster import vq
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import torch
import torch.nn as nn
from tqdm import trange
import altair as alt
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

read the file and show it

df = pd.read_csv("vh_data15.csv",encoding="utf8")
df
County_Density Vaccine_Trust_Index Personal_Responsibility Trust_Science_Apolitical Trust_Science_Politicians Trust_Science_Media Trust_Science_Community Trust_National Trust_State Trust_Local ... Pandemic_Impact_Network Infected_Personal Infected_Network Biden Trump Party_ID Household_Income Vaccine_Required Evangelical Vaccine_Hesitant
0 137.851795 0.000000 10.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 ... 0.0 0.0 0.0 No Yes Republican 1.0 0 1.0 1
1 38.751406 9.000000 5.0 8.0 4.0 6.0 9.0 8.0 2.0 7.0 ... 5.0 0.0 0.0 Yes No Democrat 3.0 0 0.0 0
2 18.103752 8.666667 7.0 6.0 1.0 1.0 6.0 7.0 7.0 9.0 ... 6.0 0.0 1.0 No Yes Republican 6.0 0 0.0 0
3 26.912917 4.000000 7.0 6.0 6.0 4.0 6.0 6.0 6.0 6.0 ... 7.0 0.0 0.0 Yes No Democrat 4.0 0 1.0 1
4 1541.026670 7.000000 6.0 6.0 1.0 2.0 6.0 6.0 2.0 7.0 ... 2.5 0.0 1.0 Yes Yes Republican 3.0 0 1.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3348 470.033815 1.666667 8.0 0.0 0.0 0.0 0.0 2.0 2.0 4.0 ... 7.5 0.0 0.0 No Yes Independent 8.0 0 1.0 1
3349 225.323560 9.333333 2.0 9.0 1.0 8.0 9.0 6.0 7.0 7.0 ... 3.5 0.0 1.0 Yes No Republican 2.0 0 0.0 0
3350 108.474026 9.000000 8.0 6.0 3.0 3.0 5.0 6.0 7.0 6.0 ... 6.0 0.0 0.0 No Yes Republican 2.0 1 0.0 0
3351 63.385677 9.333333 5.0 5.0 4.0 6.0 9.0 8.0 8.0 7.0 ... 3.5 0.0 1.0 Yes No Democrat 12.0 1 0.0 0
3352 599.558111 9.000000 10.0 1.0 0.0 0.0 10.0 9.0 9.0 10.0 ... 9.0 1.0 1.0 No Yes Republican 5.0 0 1.0 1

3353 rows × 42 columns

Data PreProcessing by filling in the front of values (pad method) to fill NaN value

df = df.fillna(method="pad")
df
County_Density Vaccine_Trust_Index Personal_Responsibility Trust_Science_Apolitical Trust_Science_Politicians Trust_Science_Media Trust_Science_Community Trust_National Trust_State Trust_Local ... Pandemic_Impact_Network Infected_Personal Infected_Network Biden Trump Party_ID Household_Income Vaccine_Required Evangelical Vaccine_Hesitant
0 137.851795 0.000000 10.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 ... 0.0 0.0 0.0 No Yes Republican 1.0 0 1.0 1
1 38.751406 9.000000 5.0 8.0 4.0 6.0 9.0 8.0 2.0 7.0 ... 5.0 0.0 0.0 Yes No Democrat 3.0 0 0.0 0
2 18.103752 8.666667 7.0 6.0 1.0 1.0 6.0 7.0 7.0 9.0 ... 6.0 0.0 1.0 No Yes Republican 6.0 0 0.0 0
3 26.912917 4.000000 7.0 6.0 6.0 4.0 6.0 6.0 6.0 6.0 ... 7.0 0.0 0.0 Yes No Democrat 4.0 0 1.0 1
4 1541.026670 7.000000 6.0 6.0 1.0 2.0 6.0 6.0 2.0 7.0 ... 2.5 0.0 1.0 Yes Yes Republican 3.0 0 1.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3348 470.033815 1.666667 8.0 0.0 0.0 0.0 0.0 2.0 2.0 4.0 ... 7.5 0.0 0.0 No Yes Independent 8.0 0 1.0 1
3349 225.323560 9.333333 2.0 9.0 1.0 8.0 9.0 6.0 7.0 7.0 ... 3.5 0.0 1.0 Yes No Republican 2.0 0 0.0 0
3350 108.474026 9.000000 8.0 6.0 3.0 3.0 5.0 6.0 7.0 6.0 ... 6.0 0.0 0.0 No Yes Republican 2.0 1 0.0 0
3351 63.385677 9.333333 5.0 5.0 4.0 6.0 9.0 8.0 8.0 7.0 ... 3.5 0.0 1.0 Yes No Democrat 12.0 1 0.0 0
3352 599.558111 9.000000 10.0 1.0 0.0 0.0 10.0 9.0 9.0 10.0 ... 9.0 1.0 1.0 No Yes Republican 5.0 0 1.0 1

3353 rows × 42 columns

Because there are some string in the data, using replace to replace the string

df = df.replace({"No":0,"Yes":1,"Republican":0,"Democrat":1,"Independent":2,"Libertarian":3,'Other party':4})

df
County_Density Vaccine_Trust_Index Personal_Responsibility Trust_Science_Apolitical Trust_Science_Politicians Trust_Science_Media Trust_Science_Community Trust_National Trust_State Trust_Local ... Pandemic_Impact_Network Infected_Personal Infected_Network Biden Trump Party_ID Household_Income Vaccine_Required Evangelical Vaccine_Hesitant
0 137.851795 0.000000 10.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 ... 0.0 0.0 0.0 0 1 0 1.0 0 1.0 1
1 38.751406 9.000000 5.0 8.0 4.0 6.0 9.0 8.0 2.0 7.0 ... 5.0 0.0 0.0 1 0 1 3.0 0 0.0 0
2 18.103752 8.666667 7.0 6.0 1.0 1.0 6.0 7.0 7.0 9.0 ... 6.0 0.0 1.0 0 1 0 6.0 0 0.0 0
3 26.912917 4.000000 7.0 6.0 6.0 4.0 6.0 6.0 6.0 6.0 ... 7.0 0.0 0.0 1 0 1 4.0 0 1.0 1
4 1541.026670 7.000000 6.0 6.0 1.0 2.0 6.0 6.0 2.0 7.0 ... 2.5 0.0 1.0 1 1 0 3.0 0 1.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3348 470.033815 1.666667 8.0 0.0 0.0 0.0 0.0 2.0 2.0 4.0 ... 7.5 0.0 0.0 0 1 2 8.0 0 1.0 1
3349 225.323560 9.333333 2.0 9.0 1.0 8.0 9.0 6.0 7.0 7.0 ... 3.5 0.0 1.0 1 0 0 2.0 0 0.0 0
3350 108.474026 9.000000 8.0 6.0 3.0 3.0 5.0 6.0 7.0 6.0 ... 6.0 0.0 0.0 0 1 0 2.0 1 0.0 0
3351 63.385677 9.333333 5.0 5.0 4.0 6.0 9.0 8.0 8.0 7.0 ... 3.5 0.0 1.0 1 0 1 12.0 1 0.0 0
3352 599.558111 9.000000 10.0 1.0 0.0 0.0 10.0 9.0 9.0 10.0 ... 9.0 1.0 1.0 0 1 0 5.0 0 1.0 1

3353 rows × 42 columns

Sepearate the data and the labels

labels = df['Vaccine_Hesitant']
df = df[df.columns[:-1]]
df
labels
0       1
1       0
2       0
3       1
4       0
       ..
3348    1
3349    0
3350    0
3351    0
3352    1
Name: Vaccine_Hesitant, Length: 3353, dtype: int64

Generate the math information of the dataframe

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3353 entries, 0 to 3352
Data columns (total 41 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   County_Density             3353 non-null   float64
 1   Vaccine_Trust_Index        3353 non-null   float64
 2   Personal_Responsibility    3353 non-null   float64
 3   Trust_Science_Apolitical   3353 non-null   float64
 4   Trust_Science_Politicians  3353 non-null   float64
 5   Trust_Science_Media        3353 non-null   float64
 6   Trust_Science_Community    3353 non-null   float64
 7   Trust_National             3353 non-null   float64
 8   Trust_State                3353 non-null   float64
 9   Trust_Local                3353 non-null   float64
 10  Trust_Media                3353 non-null   float64
 11  Perceived_Risk             3353 non-null   float64
 12  Perceived_Network_Risk     3353 non-null   float64
 13  Doctor_Comfort             3353 non-null   float64
 14  Fear_Needles               3353 non-null   float64
 15  Condition_Pregnancy        3353 non-null   float64
 16  Condition_Asthma           3353 non-null   float64
 17  Condition_Lung             3353 non-null   float64
 18  Condition_Diabetes         3353 non-null   float64
 19  Condition_Immune           3353 non-null   float64
 20  Condition_Obesity          3353 non-null   float64
 21  Condition_Heart            3353 non-null   float64
 22  Condition_Organ            3353 non-null   float64
 23  County_Cases               3353 non-null   float64
 24  County_Cases2wk            3353 non-null   float64
 25  Male                       3353 non-null   int64  
 26  Race                       3353 non-null   int64  
 27  Age                        3353 non-null   float64
 28  PS_Index                   3353 non-null   float64
 29  Natural_Science_Literacy   3353 non-null   float64
 30  College_Degree             3353 non-null   float64
 31  Pandemic_Impact            3353 non-null   float64
 32  Pandemic_Impact_Network    3353 non-null   float64
 33  Infected_Personal          3353 non-null   float64
 34  Infected_Network           3353 non-null   float64
 35  Biden                      3353 non-null   int64  
 36  Trump                      3353 non-null   int64  
 37  Party_ID                   3353 non-null   int64  
 38  Household_Income           3353 non-null   float64
 39  Vaccine_Required           3353 non-null   int64  
 40  Evangelical                3353 non-null   float64
dtypes: float64(35), int64(6)
memory usage: 1.0 MB
df.describe()
County_Density Vaccine_Trust_Index Personal_Responsibility Trust_Science_Apolitical Trust_Science_Politicians Trust_Science_Media Trust_Science_Community Trust_National Trust_State Trust_Local ... Pandemic_Impact Pandemic_Impact_Network Infected_Personal Infected_Network Biden Trump Party_ID Household_Income Vaccine_Required Evangelical
count 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 ... 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000 3353.000000
mean 882.153305 7.583408 7.033701 5.454817 2.324187 3.589323 6.494483 5.090665 5.539815 6.025649 ... 4.771399 4.795407 0.050701 0.401432 0.615866 0.358187 1.084402 6.345959 0.211452 0.179242
std 2638.189276 2.532152 2.477415 2.850460 2.225544 2.952058 2.540727 2.431302 2.539364 2.251478 ... 1.699539 1.684785 0.219419 0.490261 0.486462 0.479539 0.921941 3.684398 0.408399 0.383612
min 0.928098 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
25% 108.384222 6.333333 5.000000 3.000000 0.000000 1.000000 5.000000 4.000000 4.000000 5.000000 ... 4.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000
50% 299.588930 8.333333 7.000000 5.000000 2.000000 4.000000 7.000000 5.000000 6.000000 6.000000 ... 5.000000 5.000000 0.000000 0.000000 1.000000 0.000000 1.000000 6.000000 0.000000 0.000000
75% 715.275377 9.333333 9.000000 8.000000 4.000000 5.000000 8.000000 7.000000 8.000000 8.000000 ... 5.500000 5.500000 0.000000 1.000000 1.000000 1.000000 2.000000 10.000000 0.000000 0.000000
max 27819.804800 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 ... 10.000000 10.000000 1.000000 1.000000 1.000000 1.000000 4.000000 12.000000 1.000000 1.000000

8 rows × 41 columns

To determine the value of K, after finding the literature, the silhouette method is applied to determine the optimal k value, setting max class of 20

K_Silhouette_List = [] #Initalizing the silhouette list

for k_value in trange(2, 20):
	kmeans = KMeans(k_value,init='random') #perform the KMeans clustering
	cluster_labels = kmeans.fit_predict(df) #obtain the cluster label
	silhouette_avg = silhouette_score(df, cluster_labels) #calculating the silhouette score
	K_Silhouette_List.append([k_value,silhouette_avg]) #Adding it into list
100%|██████████| 18/18 [00:56<00:00,  3.11s/it]

Visualize the change of the K_List

K_Silhouette_List = np.array(K_Silhouette_List).T
K_df = pd.DataFrame({'K':K_Silhouette_List[0],'silhouette':K_Silhouette_List[1]})
K_df
K silhouette
0 2.0 0.950613
1 3.0 0.890727
2 4.0 0.672800
3 5.0 0.658783
4 6.0 0.616108
5 7.0 0.537130
6 8.0 0.473735
7 9.0 0.536511
8 10.0 0.502654
9 11.0 0.516242
10 12.0 0.501662
11 13.0 0.448177
12 14.0 0.493592
13 15.0 0.449001
14 16.0 0.432740
15 17.0 0.410725
16 18.0 0.391229
17 19.0 0.411393

Using Altair iteractive chart to show the results

chart = alt.Chart(K_df).mark_point(
	    width=100,
	    height=50
	).encode(
	    x='K',
	    y='silhouette',
	)
chart.interactive() #use iteractive chart
chart

Obtain the k value by the max value of the silhouette coefficient

k = np.argmax(K_Silhouette_List[1][:8]) + 2 #Obtain The suitable K value
print("The Choise of K is",k)
The Choise of K is 2

After finding the k value, it should use it to perform the K means cluster. But firstly, it should implement the initial center

centers, kmeanslabels = vq.kmeans2(data = df, k = k, iter = 1, minit = 'random') 
/shared-libs/python3.7/py/lib/python3.7/site-packages/scipy/cluster/vq.py:607: UserWarning: One of the clusters is empty. Re-run kmeans with a different initialization.
  warnings.warn("One of the clusters is empty. "

Then, Run the KMeans based on the given k value

max_iter = 1000 #Setting to 1000 to decrese the bias
for i in trange(1, max_iter):
	centers, kmeanslabels = vq.kmeans2(data = df, k = centers, iter = 1, minit = 'matrix')
100%|██████████| 999/999 [00:02<00:00, 414.15it/s]

Finally, it should obtain the final kmeans labels

kmeanslabels = pd.Series(kmeanslabels) 
kmeanslabels
0       1
1       1
2       1
3       1
4       1
       ..
3348    1
3349    1
3350    1
3351    1
3352    1
Length: 3353, dtype: int32

add to the last column and show it

df = df.assign(kmeanslabel=kmeanslabels)
df
County_Density Vaccine_Trust_Index Personal_Responsibility Trust_Science_Apolitical Trust_Science_Politicians Trust_Science_Media Trust_Science_Community Trust_National Trust_State Trust_Local ... Pandemic_Impact_Network Infected_Personal Infected_Network Biden Trump Party_ID Household_Income Vaccine_Required Evangelical kmeanslabel
0 137.851795 0.000000 10.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 ... 0.0 0.0 0.0 0 1 0 1.0 0 1.0 1
1 38.751406 9.000000 5.0 8.0 4.0 6.0 9.0 8.0 2.0 7.0 ... 5.0 0.0 0.0 1 0 1 3.0 0 0.0 1
2 18.103752 8.666667 7.0 6.0 1.0 1.0 6.0 7.0 7.0 9.0 ... 6.0 0.0 1.0 0 1 0 6.0 0 0.0 1
3 26.912917 4.000000 7.0 6.0 6.0 4.0 6.0 6.0 6.0 6.0 ... 7.0 0.0 0.0 1 0 1 4.0 0 1.0 1
4 1541.026670 7.000000 6.0 6.0 1.0 2.0 6.0 6.0 2.0 7.0 ... 2.5 0.0 1.0 1 1 0 3.0 0 1.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3348 470.033815 1.666667 8.0 0.0 0.0 0.0 0.0 2.0 2.0 4.0 ... 7.5 0.0 0.0 0 1 2 8.0 0 1.0 1
3349 225.323560 9.333333 2.0 9.0 1.0 8.0 9.0 6.0 7.0 7.0 ... 3.5 0.0 1.0 1 0 0 2.0 0 0.0 1
3350 108.474026 9.000000 8.0 6.0 3.0 3.0 5.0 6.0 7.0 6.0 ... 6.0 0.0 0.0 0 1 0 2.0 1 0.0 1
3351 63.385677 9.333333 5.0 5.0 4.0 6.0 9.0 8.0 8.0 7.0 ... 3.5 0.0 1.0 1 0 1 12.0 1 0.0 1
3352 599.558111 9.000000 10.0 1.0 0.0 0.0 10.0 9.0 9.0 10.0 ... 9.0 1.0 1.0 0 1 0 5.0 0 1.0 1

3353 rows × 42 columns

Next is to perform the regression part. Initially, it should setting the devide of the running, the epoch, etc.

device = 'cpu' #No cuda would be used
Epoch = 10 #It should be set larger than this to show the result clearly
max_iter = 50 #For this model, the max_iter should be set to 50, which means every 50 iteration to perform a test

Next, Construct the model based on PyTorch

model = nn.Sequential(
		nn.Linear(df.shape[1], 512),
		nn.ReLU(),
		nn.Linear(512,512),
		nn.ReLU(),
		nn.Linear(512,512),
		nn.ReLU(),
		nn.Linear(512,1)
	).to(device)

Because the problem is the regression problem, the loss_function is selected as the MSE

loss_fn = nn.MSELoss().to(device)

Defining the optimizer with learning rate of 1e-4 because larger learning rate will cause the network become overfitting fast.

optimizer = torch.optim.Adam(model.parameters(), lr = 1e-4)

Spliting the data and construct the training set and test set

#For the training set
x = torch.FloatTensor(np.array(df[:int(0.9*df.shape[0])])).to(device)
y = torch.FloatTensor(np.array(labels[:int(0.9*df.shape[0])])).to(device).unsqueeze(-1)
#For the testing set
x_test = torch.FloatTensor(np.array(df[int(0.9*df.shape[0]):])).to(device)
y_test = torch.FloatTensor(np.array(labels[int(0.9*df.shape[0]):])).to(device).unsqueeze(-1)
print(x.shape,y.shape)
print(x_test.shape,y_test.shape)
torch.Size([3017, 42]) torch.Size([3017, 1])
torch.Size([336, 42]) torch.Size([336, 1])

Now, it is time to perform the training and testing loop. Notice that in MLP, is has using Loop to iterate over and over again!

for epoch in range(Epoch):
	#Training with the training set
	for it in trange(max_iter):
		y_pred = model(x) #obtain the predicted value
		loss = loss_fn(y_pred, y) #compute the loss function
		optimizer.zero_grad() #clear the gradient in the optimizer
		loss.backward() #backwarding the loss
		optimizer.step() #update the optimizer

	#Testing the data using testing set
	y_pred_test = model(x_test)
	loss_test = loss_fn(y_pred_test, y_test)
	print(f'Epoch{epoch}, Training Set Loss:{loss.item()}, Test Set Loss:{loss_test.item()}')
100%|██████████| 50/50 [00:33<00:00,  1.49it/s]
Epoch0, Training Set Loss:0.5529709458351135, Test Set Loss:0.8505077362060547
100%|██████████| 50/50 [00:33<00:00,  1.51it/s]
Epoch1, Training Set Loss:0.17739205062389374, Test Set Loss:0.21063323318958282
100%|██████████| 50/50 [00:32<00:00,  1.52it/s]
Epoch2, Training Set Loss:0.24123425781726837, Test Set Loss:0.1880612075328827
100%|██████████| 50/50 [00:33<00:00,  1.50it/s]
Epoch3, Training Set Loss:0.116368368268013, Test Set Loss:0.15852192044258118
100%|██████████| 50/50 [00:33<00:00,  1.51it/s]
Epoch4, Training Set Loss:0.10501353442668915, Test Set Loss:0.14747793972492218
100%|██████████| 50/50 [00:33<00:00,  1.49it/s]
Epoch5, Training Set Loss:0.10363566875457764, Test Set Loss:0.15894567966461182
100%|██████████| 50/50 [00:33<00:00,  1.50it/s]
Epoch6, Training Set Loss:0.0968862771987915, Test Set Loss:0.1493232548236847
100%|██████████| 50/50 [00:33<00:00,  1.50it/s]
Epoch7, Training Set Loss:0.09015238285064697, Test Set Loss:0.1419822871685028
100%|██████████| 50/50 [00:33<00:00,  1.51it/s]
Epoch8, Training Set Loss:0.08603479713201523, Test Set Loss:0.14332301914691925
100%|██████████| 50/50 [00:33<00:00,  1.51it/s]Epoch9, Training Set Loss:0.08362097293138504, Test Set Loss:0.13961467146873474

Summary

The Pandas library is applied to read data, implement the missing value filling, and the text converting, while the KMeans is applied to implement the feature engineering. The multilayer perceptron is applied to implement the regression problem. The result shows that the model is overfitting after 2500 iterations, with the minimum mean square error of the regression being 0.13364802300930023 in the test set.

References

  • What is the source of your dataset(s)? Answer: the source of the dataset is Kaggle

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used. Answer: most of my code is original, which sources from the class, or refers to the tutorial in YouTube (See Next Question).

  • List other references that you found helpful.

Answer:

The PyTorch tutorial: https://www.youtube.com/watch?v=c36lUUr864M

The tqdm tutorial: https://www.youtube.com/watch?v=8zm4L3rVreI

The official document of PyTorch: https://pytorch.org/docs/stable/index.html

The definition of the silhouette method: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Created in deepnote.com Created in Deepnote