Monday Worksheet¶
On Friday, we ran K-Means clustering on the numeric columns from the Spotify dataset. We should add an additional step, because it turns out that in the version from Friday, one of the dimensions (“input variables”/”features”/predictors”) was dominating all the others.
Warm-up with any
and all
¶
On Friday, we removed the rows which were bad in the Energy column, and were lucky that was good enough. We really should have removed the rows that were bad in any column. We can do that using either any
or all
.
Using
rng.choice
andpd.DataFrame
, make a 15x3 pandas DataFrame consisting of randomly chosen values of True and False. Call itdf_bool
.Try evaluating
df_bool.any(axis = 1)
,df_bool.all(axis = 1)
,df_bool.any(axis = 0)
,df_bool.all(axis = 0)
to learn about theseany
andall
methods.
(This same syntax works for a NumPy array. We asked you to convert it to a pandas DataFrame so that the values were easier to display in the notebook.)
Identifying the problem¶
In a Jupyter notebook, we will basically follow the procedure from Friday’s class, but there will be a few changes.
Import the Spotify dataset as
df
(be sure to use thena_values
argument, so that the appropriate columns become numeric automatically).Verify that the correct columns are numeric, by using
dtypes
.Let’s say a row is bad if any of its entries is NaN. Using either
isna
ornotna
and eitherany
orall
, remove the bad rows from the DataFrame. Call the resultdf2
. Put a.copy()
afterwards just to be safe.Check your answer by computing the
shape
ofdf2
. It should be(1545, 23)
.Get a list
numeric_cols
containing the names of all the numeric columns fromdf2
. Like on Friday, useis_numeric_dtype
. Your list should have length 14Make a new DataFrame
df3
containing only the numeric columns ofdf2
.Instantiate a
KMeans
object, specifying 10 clusters. Call the resultkmeans
.fit
the object usingdf3
.We are going to store the
cluster
values into the previous DataFramedf2
, not indf3
. The reason that is better, is becausedf2
contains information like artist name and song title. Use code that’s in the following form.
df2[???] = kmeans.predict(df2[???])
Plot the result using the following code with Altair. Make the colors look better by specifying the appropriate encoding type for the color column.
alt.Chart(df2).mark_circle().encode(
x = "Artist Followers",
y = "Valence",
color = "cluster"
)
Our clusters are fit using 14-dimensional data, but one of those dimensions is dominating because the values in it are so much larger than the others.
Check the mean and standard deviation of all the different columns in
df3
by usingdf3.mean(axis = ???)
anddf3.std(axis = ???)
. (If you want to get numbers that are easier to read, without the scientific notation, you can change the way pandas displays float values, as in this Stack Overflow answer.)
Fixing the problem using StandardScaler
¶
This is such a common issue in Machine Learning (not just for clustering), that there is a built-in tool in scikit-learn to rescale the different columns. The tool is called StandardScaler
, and its syntax is very similar to the usual scikit-learn syntax.
Import
StandardScaler
using the following code.
from sklearn.preprocessing import StandardScaler
Instantiate a new
StandardScaler
object, and name itscaler
. Do this the same way you instantiate aLinearRegression
object or aKMeans
object. You don’t need to pass any arguments inside the parentheses.fit
the object usingscaler.fit(df3)
.Normally we would try
scaler.predict(df3)
, but since we are not predicting anything, a different word is used. Evaluatescaler.transform(df3)
. Convert it to a pandas DataFrame and name the resultdf4
. Givedf4
the same column names asdf3
.
You should think of df4
and df3
as containing the same data, with the only difference being that the data in df4
has been normalized.
Evaluate the mean and standard deviation of the columns of
df4
.Instantiate a new
KMeans
object with however many clusters you want.fit
the new object usingdf4
, and thenpredict
also usingdf4
. (It would be a little more robust to instead predict usingdf2
, but then we would need another round of scaling and putting in column names.) Put the resulting cluster numbers into the “cluster” column ofdf2
(so overwrite the old “cluster” column.)Redo the Altair plot from above. Add in a tooltip specifying the song name and artist name. If everything went correctly, the colors should look less strictly ordered than before the scaling was done.