Week 10 Monday#

Announcements#

I have office hours at 11am today, next door in ALP 3610.
Videos and video quizzes due.
Worksheets 15 and 16 due Tuesday by 11:59pm.
There won’t be a “Quiz 7”.
General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.

Don’t expect conclusive results in the course project#

If you’re coming up with your own research question, as opposed to following a tutorial, you shouldn’t expect to get conclusive (maybe not even interesting) results… that’s just how research goes a lot of the time.

Today I want to use the Spotify dataset. I think that’s a great dataset for Exploratory Data Analysis (EDA). I’ve tried for many hours to find an interesting Machine Learning question we can answer using this dataset (like classification: “Predict if the artist is Taylor Swift or Billie Eilish” or regression: “Predict the number of streams of a song”), but have never gotten any convincing results.

For the course project, you can decide for yourself if you’d rather investigate your own question or if you’d rather investigate someone else’s question and be more confident that you will get conclusive results. Both are good options.

Preparing the Spotify dataset for K-means clustering#

General guiding question: If we perform K-means clustering on the Spotify dataset, do songs by the same artist tend to appear in the same cluster?

The first step is to choose what columns we will use. They need to be numeric columns and they need to not have any missing values.

import altair as alt
import pandas as pd
from pandas.api.types import is_numeric_dtype

Remember that missing values in this dataset are represented by a blank space " " (not an empty string "" which would be fine). Here we also drop the rows containing missing values. Sometimes that is too drastic (like on the Titanic dataset), but here it only removes about 10 rows.

df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()

Let’s choose which columns (aka features, aka input variables) we want to use for our clustering. They should all be numeric. (If you want to use a categorical feature, first use OneHotEncoder.)

# All numeric columns
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols

['Index',
 'Highest Charting Position',
 'Number of Times Charted',
 'Artist Followers',
 'Popularity',
 'Danceability',
 'Energy',
 'Loudness',
 'Speechiness',
 'Acousticness',
 'Liveness',
 'Tempo',
 'Duration (ms)',
 'Valence']

Let’s use the columns from “Popularity” to the end. (I don’t want to use anything related too closely to the artist.)

I don’t think we’ve used the index method before on a list. Use the list method index to find at what index “Popularity” occurs.

i = num_cols.index("Popularity")
i

Make a list cols which contains the numeric column names from “Popularity” to the end.

cols = num_cols[i:]
cols

['Popularity',
 'Danceability',
 'Energy',
 'Loudness',
 'Speechiness',
 'Acousticness',
 'Liveness',
 'Tempo',
 'Duration (ms)',
 'Valence']

Here is a reminder of what the data in this dataset looks like. Notice in particular how big the numbers are in the “Duration (ms)” column. It would be a bad idea to use clustering on this dataset without scaling.

df.head(3)

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
0	1	1	8	2021-07-23--2021-07-30	Beggin'	48,633,449	Måneskin	3377762.0	3Wrjm47oTz2sjIgck11l5e	['indie rock italiano', 'italian pop']	...	0.714	0.800	-4.808	0.0504	0.1270	0.3590	134.002	211560.0	0.589	B
1	2	2	3	2021-07-23--2021-07-30	STAY (with Justin Bieber)	47,248,719	The Kid LAROI	2230022.0	5HCyWlXZPP0y6Gqq8TgA20	['australian hip hop']	...	0.591	0.764	-5.484	0.0483	0.0383	0.1030	169.928	141806.0	0.478	C#/Db
2	3	1	11	2021-06-25--2021-07-02	good 4 u	40,162,559	Olivia Rodrigo	6266514.0	4ZtFanR9U6ndgddUvNcjcG	['pop']	...	0.563	0.664	-5.044	0.1540	0.3350	0.0849	166.928	178147.0	0.688	A

3 rows × 23 columns

K-means clustering review#

We saw K-means clustering at the very beginning of the Machine Learning portion of Math 10. (It is the only example of unsupervised learning we have discussed.) Let’s review K-means clustering. Think about how it compares and contrasts to the supervised Machine Learning we have done (linear and polynomial regression, decision trees for classification and regression, random forests).

Remember that scaling is typically very important for K-means clustering. If two of your columns have the same unit (like money spent on rent and money spent on food), then maybe you don’t need to rescale those, but if the units for the columns are different (like money spent on rent and distance traveled to work), then I think it’s essential to rescale. (Scaling does not have any impact on decision trees, I don’t think. Scaling before linear regression can be useful or not useful depending on the context. If you want to know which feature is the most important, I think it makes sense to rescale. If you want to be able to interpret the precise coefficients that show up, then I don’t think it makes sense to rescale.)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Recommended exercise: try to do what we’re doing below without using Pipeline. I think you’ll find that it involves much more typing.

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("kmeans", KMeans(n_clusters=6))
    ]
)

Remember that we can’t use all the columns, because K-means clustering requires numerical values in all the columns.

pipe.fit(df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_83/3268716185.py in <module>
----> 1 pipe.fit(df)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    353                 message_clsname="Pipeline",
    354                 message=self._log_message(step_idx),
--> 355                 **fit_params_steps[name],
    356             )
    357             # Replace the transformer of the step with the fitted

/shared-libs/python3.7/py/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    850         if y is None:
    851             # fit method of arity 1 (unsupervised transformation)
--> 852             return self.fit(X, **fit_params).transform(X)
    853         else:
    854             # fit method of arity 2 (supervised transformation)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight)
    804         # Reset internal state before fitting
    805         self._reset()
--> 806         return self.partial_fit(X, y, sample_weight)
    807 
    808     def partial_fit(self, X, y=None, sample_weight=None):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight)
    845             dtype=FLOAT_DTYPES,
    846             force_all_finite="allow-nan",
--> 847             reset=first_call,
    848         )
    849         n_features = X.shape[1]

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1897 
   1898     def __array__(self, dtype=None) -> np.ndarray:
-> 1899         return np.asarray(self._values, dtype=dtype)
   1900 
   1901     def __array_wrap__(

ValueError: could not convert string to float: '2021-07-23--2021-07-30'

Here we use a portion of the numeric columns. (Eventually I want to see how songs by a single artist get divided into different clusters, so I want to remove columns like especially “Artist Followers” which will be extremely connected to the artist.)

pipe.fit(df[cols])

Pipeline(steps=[('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=6))])

Remember that K-means is an algorithm from unsupervised learning. That corresponds in our fit call that we only gave an input X, not any true labels y.

Most frequent artists#

Which 25 artists appear most often in the dataset? Use value_counts, and take advantage of the fact that by default the results are sorted.

df["Artist"].value_counts()

Taylor Swift               52
Lil Uzi Vert               32
Justin Bieber              32
Juice WRLD                 30
Pop Smoke                  29
                           ..
Bad Bunny, Daddy Yankee     1
Capital Bra                 1
Normani                     1
Mora, Jhay Cortez           1
José Feliciano              1
Name: Artist, Length: 712, dtype: int64

I think we used this trick once before. We get a pandas Index (similar to a list) containing the top 25 artists.

top_artists = df["Artist"].value_counts().index[:25]
top_artists

Index(['Taylor Swift', 'Lil Uzi Vert', 'Justin Bieber', 'Juice WRLD',
       'Pop Smoke', 'BTS', 'Bad Bunny', 'Eminem', 'The Weeknd', 'Drake',
       'Billie Eilish', 'Ariana Grande', 'Selena Gomez', 'Doja Cat', 'J. Cole',
       'Dua Lipa', 'Tyler, The Creator', 'DaBaby', 'Lady Gaga', 'Kid Cudi',
       'Olivia Rodrigo', '21 Savage, Metro Boomin', 'Polo G', 'Mac Miller',
       'Lil Baby'],
      dtype='object')

Making an Altair chart#

I don’t know if we’ll have time, but let’s try to make a chart like this one from the Altair examples gallery so we can see which artists’ songs fall into which clusters. Only use the 25 most frequently occurring artists.

Here are the last 5 rows in the DataFrame.

df[-5:]

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
1551	1552	195	1	2019-12-27--2020-01-03	New Rules	4,630,675	Dua Lipa	27167675.0	2ekn2ttSfGqwhhate0LSR0	['dance pop', 'pop', 'uk pop']	...	0.762	0.700	-6.021	0.0694	0.00261	0.1530	116.073	209320.0	0.608	A
1552	1553	196	1	2019-12-27--2020-01-03	Cheirosa - Ao Vivo	4,623,030	Jorge & Mateus	15019109.0	2PWjKmjyTZeDpmOUa3a5da	['sertanejo', 'sertanejo universitario']	...	0.528	0.870	-3.123	0.0851	0.24000	0.3330	152.370	181930.0	0.714	B
1553	1554	197	1	2019-12-27--2020-01-03	Havana (feat. Young Thug)	4,620,876	Camila Cabello	22698747.0	1rfofaqEpACxVEHIZBJe6W	['dance pop', 'electropop', 'pop', 'post-teen ...	...	0.765	0.523	-4.333	0.0300	0.18400	0.1320	104.988	217307.0	0.394	D
1554	1555	198	1	2019-12-27--2020-01-03	Surtada - Remix Brega Funk	4,607,385	Dadá Boladão, Tati Zaqui, OIK	208630.0	5F8ffc8KWKNawllr5WsW0r	['brega funk', 'funk carioca']	...	0.832	0.550	-7.026	0.0587	0.24900	0.1820	154.064	152784.0	0.881	F
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614.0	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.448	0.603	-7.176	0.0640	0.43300	0.0862	205.272	221307.0	0.422	G

5 rows × 23 columns

We will use the following Boolean Series (a pandas Series with True and False for its values) to perform Boolean indexing. Notice for example how there is a True in the rows with labels 1551 and 1555. Those artists (as we can see from the above slice) are Dua Lipa and Taylor Swift.

bool_ser = df["Artist"].isin(top_artists)
bool_ser

     False
     False
      True
     False
     False
        ...  
   True
  False
  False
  False
   True
Name: Artist, Length: 1545, dtype: bool

Now we get the rows whose corresponding artist is among the top 25. We also use copy so we can add a column to it later without any warnings showing up.

df_top = df[bool_ser].copy()

df_top["pred"] = pipe.predict(df[cols])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_83/386168527.py in <module>
----> 1 df_top["pred"] = pipe.predict(df[cols])

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3161         else:
   3162             # set column
-> 3163             self._set_item(key, value)
   3164 
   3165     def _setitem_slice(self, key: slice, value):

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3240         """
   3241         self._ensure_valid_index(value)
-> 3242         value = self._sanitize_column(key, value)
   3243         NDFrame._set_item(self, key, value)
   3244 

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3897 
   3898             # turn me into an ndarray
-> 3899             value = sanitize_index(value, self.index)
   3900             if not isinstance(value, (np.ndarray, Index)):
   3901                 if isinstance(value, list) and len(value) > 0:

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
    750     if len(data) != len(index):
    751         raise ValueError(
--> 752             "Length of values "
    753             f"({len(data)}) "
    754             "does not match length of index "

ValueError: Length of values (1545) does not match length of index (504)

We now add a new column to df_top that contains the cluster numbers. Notice how we fit the clustering algorithm using every row in df, but we are only using predict on 504 of the rows. It’s essential to use the same columns, but there’s no need to use the same rows.

df_top["pred"] = pipe.predict(df_top[cols])

There is now a new column on the right side, containing the cluster values.

df_top

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord	pred
2	3	1	11	2021-06-25--2021-07-02	good 4 u	40,162,559	Olivia Rodrigo	6266514.0	4ZtFanR9U6ndgddUvNcjcG	['pop']	...	0.664	-5.044	0.1540	0.33500	0.0849	166.928	178147.0	0.688	A	1
6	7	3	16	2021-05-14--2021-05-21	Kiss Me More (feat. SZA)	29,356,736	Doja Cat	8640063.0	748mdHapucXQri7IAO8yFK	['dance pop', 'pop']	...	0.701	-3.541	0.0286	0.23500	0.1230	110.968	208867.0	0.742	G#/Ab	1
8	9	3	8	2021-06-18--2021-06-25	Yonaguni	25,030,128	Bad Bunny	36142273.0	2JPLbjOn0wPCngEot2STUS	['latin', 'reggaeton', 'trap latino']	...	0.648	-4.601	0.1180	0.27600	0.1350	179.951	206710.0	0.440	C#/Db	5
10	11	4	43	2021-05-07--2021-05-14	Levitating (feat. DaBaby)	23,518,010	Dua Lipa	27142474.0	463CkQjx2Zk1yXoBuierM9	['dance pop', 'pop', 'uk pop']	...	0.825	-3.787	0.0601	0.00883	0.0674	102.977	203064.0	0.915	F#/Gb	1
12	13	5	3	2021-07-09--2021-07-16	Permission to Dance	22,062,812	BTS	37106176.0	0LThjFY2iTtNdd4wviwVV2	['k-pop', 'k-pop boy group']	...	0.741	-5.330	0.0427	0.00544	0.3370	124.925	187585.0	0.646	A	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1529	1530	196	1	2020-01-10--2020-01-17	Kinda Crazy	4,866,777	Selena Gomez	28931149.0	59iGOjPSOcPLGl3vqEStUp	['dance pop', 'pop', 'post-teen pop']	...	0.446	-10.304	0.0472	0.48400	0.1830	93.030	212436.0	0.534	B	2
1545	1546	128	1	2019-12-27--2020-01-03	Candy	5,632,102	Doja Cat	8671649.0	1VJwtWR6z7SpZRwipI12be	['dance pop', 'pop']	...	0.516	-5.857	0.0444	0.51300	0.1630	124.876	190920.0	0.209	G#/Ab	5
1549	1550	187	1	2019-12-27--2020-01-03	Let Me Know (I Wonder Why Freestyle)	4,701,532	Juice WRLD	19102888.0	3wwo0bJvDSorOpNfzEkfXx	['chicago rap', 'melodic rap']	...	0.537	-7.895	0.0832	0.17200	0.4180	125.028	215381.0	0.383	G	5
1551	1552	195	1	2019-12-27--2020-01-03	New Rules	4,630,675	Dua Lipa	27167675.0	2ekn2ttSfGqwhhate0LSR0	['dance pop', 'pop', 'uk pop']	...	0.700	-6.021	0.0694	0.00261	0.1530	116.073	209320.0	0.608	A	1
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614.0	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.603	-7.176	0.0640	0.43300	0.0862	205.272	221307.0	0.422	G	5

504 rows × 24 columns

Here is the adaptation of the Altair code from the above link. The stack="normalize" portion makes sure the total width of the chart is constant.

I don’t know enough about these artists to see any clear interpretation of the results. Notice for example how the songs by Polo G and DaBaby are almost all in the same cluster (cluster 0), whereas no songs by Kid Cudi and Lady Gaga are in that cluster.

alt.Chart(df_top).mark_bar().encode(
    x=alt.X('count()', stack="normalize"),
    y='Artist',
    color='pred:N'
)

Aside: Idea for extra component for your course project#

Random idea for an extra component for the project: use geojson data in Altair to make maps: randomly chosen reference. Here is a more systematic reference: UW Visualization Curriculum. As far as I know, there aren’t yet any geojson examples in the Altair documentation (there should be within the next few months).

UC Irvine Math 10, Fall 2022

Week 10 Monday

Contents