Week 10 Monday#

Announcements#

  • I have office hours at 11am today, next door in ALP 3610.

  • Videos and video quizzes due.

  • Worksheets 15 and 16 due Tuesday by 11:59pm.

  • There won’t be a “Quiz 7”.

  • General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.

Don’t expect conclusive results in the course project#

If you’re coming up with your own research question, as opposed to following a tutorial, you shouldn’t expect to get conclusive (maybe not even interesting) results… that’s just how research goes a lot of the time.

Today I want to use the Spotify dataset. I think that’s a great dataset for Exploratory Data Analysis (EDA). I’ve tried for many hours to find an interesting Machine Learning question we can answer using this dataset (like classification: “Predict if the artist is Taylor Swift or Billie Eilish” or regression: “Predict the number of streams of a song”), but have never gotten any convincing results.

For the course project, you can decide for yourself if you’d rather investigate your own question or if you’d rather investigate someone else’s question and be more confident that you will get conclusive results. Both are good options.

Preparing the Spotify dataset for K-means clustering#

General guiding question: If we perform K-means clustering on the Spotify dataset, do songs by the same artist tend to appear in the same cluster?

The first step is to choose what columns we will use. They need to be numeric columns and they need to not have any missing values.

import altair as alt
import pandas as pd
from pandas.api.types import is_numeric_dtype

Remember that missing values in this dataset are represented by a blank space " " (not an empty string "" which would be fine). Here we also drop the rows containing missing values. Sometimes that is too drastic (like on the Titanic dataset), but here it only removes about 10 rows.

df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()

Let’s choose which columns (aka features, aka input variables) we want to use for our clustering. They should all be numeric. (If you want to use a categorical feature, first use OneHotEncoder.)

# All numeric columns
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols
['Index',
 'Highest Charting Position',
 'Number of Times Charted',
 'Artist Followers',
 'Popularity',
 'Danceability',
 'Energy',
 'Loudness',
 'Speechiness',
 'Acousticness',
 'Liveness',
 'Tempo',
 'Duration (ms)',
 'Valence']

Let’s use the columns from “Popularity” to the end. (I don’t want to use anything related too closely to the artist.)

I don’t think we’ve used the index method before on a list. Use the list method index to find at what index “Popularity” occurs.

i = num_cols.index("Popularity")
i
4

Make a list cols which contains the numeric column names from “Popularity” to the end.

cols = num_cols[i:]
cols
['Popularity',
 'Danceability',
 'Energy',
 'Loudness',
 'Speechiness',
 'Acousticness',
 'Liveness',
 'Tempo',
 'Duration (ms)',
 'Valence']

Here is a reminder of what the data in this dataset looks like. Notice in particular how big the numbers are in the “Duration (ms)” column. It would be a bad idea to use clustering on this dataset without scaling.

df.head(3)
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.800 -4.808 0.0504 0.1270 0.3590 134.002 211560.0 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.0383 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.3350 0.0849 166.928 178147.0 0.688 A

3 rows × 23 columns

K-means clustering review#

We saw K-means clustering at the very beginning of the Machine Learning portion of Math 10. (It is the only example of unsupervised learning we have discussed.) Let’s review K-means clustering. Think about how it compares and contrasts to the supervised Machine Learning we have done (linear and polynomial regression, decision trees for classification and regression, random forests).

Remember that scaling is typically very important for K-means clustering. If two of your columns have the same unit (like money spent on rent and money spent on food), then maybe you don’t need to rescale those, but if the units for the columns are different (like money spent on rent and distance traveled to work), then I think it’s essential to rescale. (Scaling does not have any impact on decision trees, I don’t think. Scaling before linear regression can be useful or not useful depending on the context. If you want to know which feature is the most important, I think it makes sense to rescale. If you want to be able to interpret the precise coefficients that show up, then I don’t think it makes sense to rescale.)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Recommended exercise: try to do what we’re doing below without using Pipeline. I think you’ll find that it involves much more typing.

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("kmeans", KMeans(n_clusters=6))
    ]
)

Remember that we can’t use all the columns, because K-means clustering requires numerical values in all the columns.

pipe.fit(df)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_83/3268716185.py in <module>
----> 1 pipe.fit(df)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    353                 message_clsname="Pipeline",
    354                 message=self._log_message(step_idx),
--> 355                 **fit_params_steps[name],
    356             )
    357             # Replace the transformer of the step with the fitted

/shared-libs/python3.7/py/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    850         if y is None:
    851             # fit method of arity 1 (unsupervised transformation)
--> 852             return self.fit(X, **fit_params).transform(X)
    853         else:
    854             # fit method of arity 2 (supervised transformation)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight)
    804         # Reset internal state before fitting
    805         self._reset()
--> 806         return self.partial_fit(X, y, sample_weight)
    807 
    808     def partial_fit(self, X, y=None, sample_weight=None):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight)
    845             dtype=FLOAT_DTYPES,
    846             force_all_finite="allow-nan",
--> 847             reset=first_call,
    848         )
    849         n_features = X.shape[1]

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1897 
   1898     def __array__(self, dtype=None) -> np.ndarray:
-> 1899         return np.asarray(self._values, dtype=dtype)
   1900 
   1901     def __array_wrap__(

ValueError: could not convert string to float: '2021-07-23--2021-07-30'

Here we use a portion of the numeric columns. (Eventually I want to see how songs by a single artist get divided into different clusters, so I want to remove columns like especially “Artist Followers” which will be extremely connected to the artist.)

pipe.fit(df[cols])
Pipeline(steps=[('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=6))])

Remember that K-means is an algorithm from unsupervised learning. That corresponds in our fit call that we only gave an input X, not any true labels y.

Most frequent artists#

Which 25 artists appear most often in the dataset? Use value_counts, and take advantage of the fact that by default the results are sorted.

df["Artist"].value_counts()
Taylor Swift               52
Lil Uzi Vert               32
Justin Bieber              32
Juice WRLD                 30
Pop Smoke                  29
                           ..
Bad Bunny, Daddy Yankee     1
Capital Bra                 1
Normani                     1
Mora, Jhay Cortez           1
José Feliciano              1
Name: Artist, Length: 712, dtype: int64

I think we used this trick once before. We get a pandas Index (similar to a list) containing the top 25 artists.

top_artists = df["Artist"].value_counts().index[:25]
top_artists
Index(['Taylor Swift', 'Lil Uzi Vert', 'Justin Bieber', 'Juice WRLD',
       'Pop Smoke', 'BTS', 'Bad Bunny', 'Eminem', 'The Weeknd', 'Drake',
       'Billie Eilish', 'Ariana Grande', 'Selena Gomez', 'Doja Cat', 'J. Cole',
       'Dua Lipa', 'Tyler, The Creator', 'DaBaby', 'Lady Gaga', 'Kid Cudi',
       'Olivia Rodrigo', '21 Savage, Metro Boomin', 'Polo G', 'Mac Miller',
       'Lil Baby'],
      dtype='object')

Making an Altair chart#

I don’t know if we’ll have time, but let’s try to make a chart like this one from the Altair examples gallery so we can see which artists’ songs fall into which clusters. Only use the 25 most frequently occurring artists.

Here are the last 5 rows in the DataFrame.

df[-5:]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.762 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.528 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.765 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.832 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.448 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G

5 rows × 23 columns

We will use the following Boolean Series (a pandas Series with True and False for its values) to perform Boolean indexing. Notice for example how there is a True in the rows with labels 1551 and 1555. Those artists (as we can see from the above slice) are Dua Lipa and Taylor Swift.

bool_ser = df["Artist"].isin(top_artists)
bool_ser
0       False
1       False
2        True
3       False
4       False
        ...  
1551     True
1552    False
1553    False
1554    False
1555     True
Name: Artist, Length: 1545, dtype: bool

Now we get the rows whose corresponding artist is among the top 25. We also use copy so we can add a column to it later without any warnings showing up.

df_top = df[bool_ser].copy()
df_top["pred"] = pipe.predict(df[cols])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_83/386168527.py in <module>
----> 1 df_top["pred"] = pipe.predict(df[cols])

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3161         else:
   3162             # set column
-> 3163             self._set_item(key, value)
   3164 
   3165     def _setitem_slice(self, key: slice, value):

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3240         """
   3241         self._ensure_valid_index(value)
-> 3242         value = self._sanitize_column(key, value)
   3243         NDFrame._set_item(self, key, value)
   3244 

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3897 
   3898             # turn me into an ndarray
-> 3899             value = sanitize_index(value, self.index)
   3900             if not isinstance(value, (np.ndarray, Index)):
   3901                 if isinstance(value, list) and len(value) > 0:

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
    750     if len(data) != len(index):
    751         raise ValueError(
--> 752             "Length of values "
    753             f"({len(data)}) "
    754             "does not match length of index "

ValueError: Length of values (1545) does not match length of index (504)

We now add a new column to df_top that contains the cluster numbers. Notice how we fit the clustering algorithm using every row in df, but we are only using predict on 504 of the rows. It’s essential to use the same columns, but there’s no need to use the same rows.

df_top["pred"] = pipe.predict(df_top[cols])

There is now a new column on the right side, containing the cluster values.

df_top
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord pred
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A 1
6 7 3 16 2021-05-14--2021-05-21 Kiss Me More (feat. SZA) 29,356,736 Doja Cat 8640063.0 748mdHapucXQri7IAO8yFK ['dance pop', 'pop'] ... 0.701 -3.541 0.0286 0.23500 0.1230 110.968 208867.0 0.742 G#/Ab 1
8 9 3 8 2021-06-18--2021-06-25 Yonaguni 25,030,128 Bad Bunny 36142273.0 2JPLbjOn0wPCngEot2STUS ['latin', 'reggaeton', 'trap latino'] ... 0.648 -4.601 0.1180 0.27600 0.1350 179.951 206710.0 0.440 C#/Db 5
10 11 4 43 2021-05-07--2021-05-14 Levitating (feat. DaBaby) 23,518,010 Dua Lipa 27142474.0 463CkQjx2Zk1yXoBuierM9 ['dance pop', 'pop', 'uk pop'] ... 0.825 -3.787 0.0601 0.00883 0.0674 102.977 203064.0 0.915 F#/Gb 1
12 13 5 3 2021-07-09--2021-07-16 Permission to Dance 22,062,812 BTS 37106176.0 0LThjFY2iTtNdd4wviwVV2 ['k-pop', 'k-pop boy group'] ... 0.741 -5.330 0.0427 0.00544 0.3370 124.925 187585.0 0.646 A 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1529 1530 196 1 2020-01-10--2020-01-17 Kinda Crazy 4,866,777 Selena Gomez 28931149.0 59iGOjPSOcPLGl3vqEStUp ['dance pop', 'pop', 'post-teen pop'] ... 0.446 -10.304 0.0472 0.48400 0.1830 93.030 212436.0 0.534 B 2
1545 1546 128 1 2019-12-27--2020-01-03 Candy 5,632,102 Doja Cat 8671649.0 1VJwtWR6z7SpZRwipI12be ['dance pop', 'pop'] ... 0.516 -5.857 0.0444 0.51300 0.1630 124.876 190920.0 0.209 G#/Ab 5
1549 1550 187 1 2019-12-27--2020-01-03 Let Me Know (I Wonder Why Freestyle) 4,701,532 Juice WRLD 19102888.0 3wwo0bJvDSorOpNfzEkfXx ['chicago rap', 'melodic rap'] ... 0.537 -7.895 0.0832 0.17200 0.4180 125.028 215381.0 0.383 G 5
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A 1
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G 5

504 rows × 24 columns

Here is the adaptation of the Altair code from the above link. The stack="normalize" portion makes sure the total width of the chart is constant.

I don’t know enough about these artists to see any clear interpretation of the results. Notice for example how the songs by Polo G and DaBaby are almost all in the same cluster (cluster 0), whereas no songs by Kid Cudi and Lady Gaga are in that cluster.

alt.Chart(df_top).mark_bar().encode(
    x=alt.X('count()', stack="normalize"),
    y='Artist',
    color='pred:N'
)

Aside: Idea for extra component for your course project#

Random idea for an extra component for the project: use geojson data in Altair to make maps: randomly chosen reference. Here is a more systematic reference: UW Visualization Curriculum. As far as I know, there aren’t yet any geojson examples in the Altair documentation (there should be within the next few months).