Monday Worksheet¶
On Friday, we ran K-Means clustering on the numeric columns from the Spotify dataset. We should add an additional step, because it turns out that in the version from Friday, one of the dimensions (“input variables”/”features”/predictors”) was dominating all the others.
Warm-up with any and all¶
On Friday, we removed the rows which were bad in the Energy column, and were lucky that was good enough. We really should have removed the rows that were bad in any column. We can do that using either any or all.
Using
rng.choiceandpd.DataFrame, make a 15x3 pandas DataFrame consisting of randomly chosen values of True and False. Call itdf_bool.Try evaluating
df_bool.any(axis = 1),df_bool.all(axis = 1),df_bool.any(axis = 0),df_bool.all(axis = 0)to learn about theseanyandallmethods.
(This same syntax works for a NumPy array. We asked you to convert it to a pandas DataFrame so that the values were easier to display in the notebook.)
Identifying the problem¶
In a Jupyter notebook, we will basically follow the procedure from Friday’s class, but there will be a few changes.
Import the Spotify dataset as
df(be sure to use thena_valuesargument, so that the appropriate columns become numeric automatically).Verify that the correct columns are numeric, by using
dtypes.Let’s say a row is bad if any of its entries is NaN. Using either
isnaornotnaand eitheranyorall, remove the bad rows from the DataFrame. Call the resultdf2. Put a.copy()afterwards just to be safe.Check your answer by computing the
shapeofdf2. It should be(1545, 23).Get a list
numeric_colscontaining the names of all the numeric columns fromdf2. Like on Friday, useis_numeric_dtype. Your list should have length 14Make a new DataFrame
df3containing only the numeric columns ofdf2.Instantiate a
KMeansobject, specifying 10 clusters. Call the resultkmeans.fitthe object usingdf3.We are going to store the
clustervalues into the previous DataFramedf2, not indf3. The reason that is better, is becausedf2contains information like artist name and song title. Use code that’s in the following form.
df2[???] = kmeans.predict(df2[???])
Plot the result using the following code with Altair. Make the colors look better by specifying the appropriate encoding type for the color column.
alt.Chart(df2).mark_circle().encode(
x = "Artist Followers",
y = "Valence",
color = "cluster"
)
Our clusters are fit using 14-dimensional data, but one of those dimensions is dominating because the values in it are so much larger than the others.
Check the mean and standard deviation of all the different columns in
df3by usingdf3.mean(axis = ???)anddf3.std(axis = ???). (If you want to get numbers that are easier to read, without the scientific notation, you can change the way pandas displays float values, as in this Stack Overflow answer.)
Fixing the problem using StandardScaler¶
This is such a common issue in Machine Learning (not just for clustering), that there is a built-in tool in scikit-learn to rescale the different columns. The tool is called StandardScaler, and its syntax is very similar to the usual scikit-learn syntax.
Import
StandardScalerusing the following code.
from sklearn.preprocessing import StandardScaler
Instantiate a new
StandardScalerobject, and name itscaler. Do this the same way you instantiate aLinearRegressionobject or aKMeansobject. You don’t need to pass any arguments inside the parentheses.fitthe object usingscaler.fit(df3).Normally we would try
scaler.predict(df3), but since we are not predicting anything, a different word is used. Evaluatescaler.transform(df3). Convert it to a pandas DataFrame and name the resultdf4. Givedf4the same column names asdf3.
You should think of df4 and df3 as containing the same data, with the only difference being that the data in df4 has been normalized.
Evaluate the mean and standard deviation of the columns of
df4.Instantiate a new
KMeansobject with however many clusters you want.fitthe new object usingdf4, and thenpredictalso usingdf4. (It would be a little more robust to instead predict usingdf2, but then we would need another round of scaling and putting in column names.) Put the resulting cluster numbers into the “cluster” column ofdf2(so overwrite the old “cluster” column.)Redo the Altair plot from above. Add in a tooltip specifying the song name and artist name. If everything went correctly, the colors should look less strictly ordered than before the scaling was done.