Week 10 Monday
Contents
Week 10 Monday#
Announcements#
I have office hours at 11am today, next door in ALP 3610.
Videos and video quizzes due.
Worksheets 15 and 16 due Tuesday by 11:59pm.
There won’t be a “Quiz 7”.
General plan for lectures this week: About 20 minutes lecture, then time to work on the course project.
Don’t expect conclusive results in the course project#
If you’re coming up with your own research question, as opposed to following a tutorial, you shouldn’t expect to get conclusive (maybe not even interesting) results… that’s just how research goes a lot of the time.
Today I want to use the Spotify dataset. I think that’s a great dataset for Exploratory Data Analysis (EDA). I’ve tried for many hours to find an interesting Machine Learning question we can answer using this dataset (like classification: “Predict if the artist is Taylor Swift or Billie Eilish” or regression: “Predict the number of streams of a song”), but have never gotten any convincing results.
For the course project, you can decide for yourself if you’d rather investigate your own question or if you’d rather investigate someone else’s question and be more confident that you will get conclusive results. Both are good options.
Preparing the Spotify dataset for K-means clustering#
General guiding question: If we perform K-means clustering on the Spotify dataset, do songs by the same artist tend to appear in the same cluster?
The first step is to choose what columns we will use. They need to be numeric columns and they need to not have any missing values.
import altair as alt
import pandas as pd
from pandas.api.types import is_numeric_dtype
Remember that missing values in this dataset are represented by a blank space " "
(not an empty string ""
which would be fine). Here we also drop the rows containing missing values. Sometimes that is too drastic (like on the Titanic dataset), but here it only removes about 10 rows.
df = pd.read_csv("spotify_dataset.csv", na_values=" ").dropna(axis=0).copy()
Let’s choose which columns (aka features, aka input variables) we want to use for our clustering. They should all be numeric. (If you want to use a categorical feature, first use OneHotEncoder
.)
# All numeric columns
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols
['Index',
'Highest Charting Position',
'Number of Times Charted',
'Artist Followers',
'Popularity',
'Danceability',
'Energy',
'Loudness',
'Speechiness',
'Acousticness',
'Liveness',
'Tempo',
'Duration (ms)',
'Valence']
Let’s use the columns from “Popularity” to the end. (I don’t want to use anything related too closely to the artist.)
I don’t think we’ve used the index
method before on a list. Use the list method index
to find at what index “Popularity” occurs.
i = num_cols.index("Popularity")
i
4
Make a list cols
which contains the numeric column names from “Popularity” to the end.
cols = num_cols[i:]
cols
['Popularity',
'Danceability',
'Energy',
'Loudness',
'Speechiness',
'Acousticness',
'Liveness',
'Tempo',
'Duration (ms)',
'Valence']
Here is a reminder of what the data in this dataset looks like. Notice in particular how big the numbers are in the “Duration (ms)” column. It would be a bad idea to use clustering on this dataset without scaling.
df.head(3)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.800 | -4.808 | 0.0504 | 0.1270 | 0.3590 | 134.002 | 211560.0 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.1540 | 0.3350 | 0.0849 | 166.928 | 178147.0 | 0.688 | A |
3 rows × 23 columns
K-means clustering review#
We saw K-means clustering at the very beginning of the Machine Learning portion of Math 10. (It is the only example of unsupervised learning we have discussed.) Let’s review K-means clustering. Think about how it compares and contrasts to the supervised Machine Learning we have done (linear and polynomial regression, decision trees for classification and regression, random forests).
Remember that scaling is typically very important for K-means clustering. If two of your columns have the same unit (like money spent on rent and money spent on food), then maybe you don’t need to rescale those, but if the units for the columns are different (like money spent on rent and distance traveled to work), then I think it’s essential to rescale. (Scaling does not have any impact on decision trees, I don’t think. Scaling before linear regression can be useful or not useful depending on the context. If you want to know which feature is the most important, I think it makes sense to rescale. If you want to be able to interpret the precise coefficients that show up, then I don’t think it makes sense to rescale.)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
Recommended exercise: try to do what we’re doing below without using Pipeline
. I think you’ll find that it involves much more typing.
pipe = Pipeline(
[
("scaler", StandardScaler()),
("kmeans", KMeans(n_clusters=6))
]
)
Remember that we can’t use all the columns, because K-means clustering requires numerical values in all the columns.
pipe.fit(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_83/3268716185.py in <module>
----> 1 pipe.fit(df)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
388 """
389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
392 if self._final_estimator != "passthrough":
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
353 message_clsname="Pipeline",
354 message=self._log_message(step_idx),
--> 355 **fit_params_steps[name],
356 )
357 # Replace the transformer of the step with the fitted
/shared-libs/python3.7/py/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
850 if y is None:
851 # fit method of arity 1 (unsupervised transformation)
--> 852 return self.fit(X, **fit_params).transform(X)
853 else:
854 # fit method of arity 2 (supervised transformation)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight)
804 # Reset internal state before fitting
805 self._reset()
--> 806 return self.partial_fit(X, y, sample_weight)
807
808 def partial_fit(self, X, y=None, sample_weight=None):
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight)
845 dtype=FLOAT_DTYPES,
846 force_all_finite="allow-nan",
--> 847 reset=first_call,
848 )
849 n_features = X.shape[1]
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
744 array = array.astype(dtype, casting="unsafe", copy=False)
745 else:
--> 746 array = np.asarray(array, order=order, dtype=dtype)
747 except ComplexWarning as complex_warning:
748 raise ValueError(
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
1897
1898 def __array__(self, dtype=None) -> np.ndarray:
-> 1899 return np.asarray(self._values, dtype=dtype)
1900
1901 def __array_wrap__(
ValueError: could not convert string to float: '2021-07-23--2021-07-30'
Here we use a portion of the numeric columns. (Eventually I want to see how songs by a single artist get divided into different clusters, so I want to remove columns like especially “Artist Followers” which will be extremely connected to the artist.)
pipe.fit(df[cols])
Pipeline(steps=[('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=6))])
Remember that K-means is an algorithm from unsupervised learning. That corresponds in our fit
call that we only gave an input X
, not any true labels y
.
Most frequent artists#
Which 25 artists appear most often in the dataset? Use value_counts
, and take advantage of the fact that by default the results are sorted.
df["Artist"].value_counts()
Taylor Swift 52
Lil Uzi Vert 32
Justin Bieber 32
Juice WRLD 30
Pop Smoke 29
..
Bad Bunny, Daddy Yankee 1
Capital Bra 1
Normani 1
Mora, Jhay Cortez 1
José Feliciano 1
Name: Artist, Length: 712, dtype: int64
I think we used this trick once before. We get a pandas Index (similar to a list) containing the top 25 artists.
top_artists = df["Artist"].value_counts().index[:25]
top_artists
Index(['Taylor Swift', 'Lil Uzi Vert', 'Justin Bieber', 'Juice WRLD',
'Pop Smoke', 'BTS', 'Bad Bunny', 'Eminem', 'The Weeknd', 'Drake',
'Billie Eilish', 'Ariana Grande', 'Selena Gomez', 'Doja Cat', 'J. Cole',
'Dua Lipa', 'Tyler, The Creator', 'DaBaby', 'Lady Gaga', 'Kid Cudi',
'Olivia Rodrigo', '21 Savage, Metro Boomin', 'Polo G', 'Mac Miller',
'Lil Baby'],
dtype='object')
Making an Altair chart#
I don’t know if we’ll have time, but let’s try to make a chart like this one from the Altair examples gallery so we can see which artists’ songs fall into which clusters. Only use the 25 most frequently occurring artists.
Here are the last 5 rows in the DataFrame.
df[-5:]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.762 | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.528 | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.765 | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.832 | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.448 | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G |
5 rows × 23 columns
We will use the following Boolean Series (a pandas Series with True
and False
for its values) to perform Boolean indexing. Notice for example how there is a True
in the rows with labels 1551 and 1555. Those artists (as we can see from the above slice) are Dua Lipa and Taylor Swift.
bool_ser = df["Artist"].isin(top_artists)
bool_ser
0 False
1 False
2 True
3 False
4 False
...
1551 True
1552 False
1553 False
1554 False
1555 True
Name: Artist, Length: 1545, dtype: bool
Now we get the rows whose corresponding artist is among the top 25. We also use copy
so we can add a column to it later without any warnings showing up.
df_top = df[bool_ser].copy()
df_top["pred"] = pipe.predict(df[cols])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_83/386168527.py in <module>
----> 1 df_top["pred"] = pipe.predict(df[cols])
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3161 else:
3162 # set column
-> 3163 self._set_item(key, value)
3164
3165 def _setitem_slice(self, key: slice, value):
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3240 """
3241 self._ensure_valid_index(value)
-> 3242 value = self._sanitize_column(key, value)
3243 NDFrame._set_item(self, key, value)
3244
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3897
3898 # turn me into an ndarray
-> 3899 value = sanitize_index(value, self.index)
3900 if not isinstance(value, (np.ndarray, Index)):
3901 if isinstance(value, list) and len(value) > 0:
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
750 if len(data) != len(index):
751 raise ValueError(
--> 752 "Length of values "
753 f"({len(data)}) "
754 "does not match length of index "
ValueError: Length of values (1545) does not match length of index (504)
We now add a new column to df_top
that contains the cluster numbers. Notice how we fit the clustering algorithm using every row in df
, but we are only using predict
on 504 of the rows. It’s essential to use the same columns, but there’s no need to use the same rows.
df_top["pred"] = pipe.predict(df_top[cols])
There is now a new column on the right side, containing the cluster values.
df_top
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A | 1 |
6 | 7 | 3 | 16 | 2021-05-14--2021-05-21 | Kiss Me More (feat. SZA) | 29,356,736 | Doja Cat | 8640063.0 | 748mdHapucXQri7IAO8yFK | ['dance pop', 'pop'] | ... | 0.701 | -3.541 | 0.0286 | 0.23500 | 0.1230 | 110.968 | 208867.0 | 0.742 | G#/Ab | 1 |
8 | 9 | 3 | 8 | 2021-06-18--2021-06-25 | Yonaguni | 25,030,128 | Bad Bunny | 36142273.0 | 2JPLbjOn0wPCngEot2STUS | ['latin', 'reggaeton', 'trap latino'] | ... | 0.648 | -4.601 | 0.1180 | 0.27600 | 0.1350 | 179.951 | 206710.0 | 0.440 | C#/Db | 5 |
10 | 11 | 4 | 43 | 2021-05-07--2021-05-14 | Levitating (feat. DaBaby) | 23,518,010 | Dua Lipa | 27142474.0 | 463CkQjx2Zk1yXoBuierM9 | ['dance pop', 'pop', 'uk pop'] | ... | 0.825 | -3.787 | 0.0601 | 0.00883 | 0.0674 | 102.977 | 203064.0 | 0.915 | F#/Gb | 1 |
12 | 13 | 5 | 3 | 2021-07-09--2021-07-16 | Permission to Dance | 22,062,812 | BTS | 37106176.0 | 0LThjFY2iTtNdd4wviwVV2 | ['k-pop', 'k-pop boy group'] | ... | 0.741 | -5.330 | 0.0427 | 0.00544 | 0.3370 | 124.925 | 187585.0 | 0.646 | A | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1529 | 1530 | 196 | 1 | 2020-01-10--2020-01-17 | Kinda Crazy | 4,866,777 | Selena Gomez | 28931149.0 | 59iGOjPSOcPLGl3vqEStUp | ['dance pop', 'pop', 'post-teen pop'] | ... | 0.446 | -10.304 | 0.0472 | 0.48400 | 0.1830 | 93.030 | 212436.0 | 0.534 | B | 2 |
1545 | 1546 | 128 | 1 | 2019-12-27--2020-01-03 | Candy | 5,632,102 | Doja Cat | 8671649.0 | 1VJwtWR6z7SpZRwipI12be | ['dance pop', 'pop'] | ... | 0.516 | -5.857 | 0.0444 | 0.51300 | 0.1630 | 124.876 | 190920.0 | 0.209 | G#/Ab | 5 |
1549 | 1550 | 187 | 1 | 2019-12-27--2020-01-03 | Let Me Know (I Wonder Why Freestyle) | 4,701,532 | Juice WRLD | 19102888.0 | 3wwo0bJvDSorOpNfzEkfXx | ['chicago rap', 'melodic rap'] | ... | 0.537 | -7.895 | 0.0832 | 0.17200 | 0.4180 | 125.028 | 215381.0 | 0.383 | G | 5 |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A | 1 |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G | 5 |
504 rows × 24 columns
Here is the adaptation of the Altair code from the above link. The stack="normalize"
portion makes sure the total width of the chart is constant.
I don’t know enough about these artists to see any clear interpretation of the results. Notice for example how the songs by Polo G and DaBaby are almost all in the same cluster (cluster 0), whereas no songs by Kid Cudi and Lady Gaga are in that cluster.
alt.Chart(df_top).mark_bar().encode(
x=alt.X('count()', stack="normalize"),
y='Artist',
color='pred:N'
)
Aside: Idea for extra component for your course project#
Random idea for an extra component for the project: use geojson data in Altair to make maps: randomly chosen reference. Here is a more systematic reference: UW Visualization Curriculum. As far as I know, there aren’t yet any geojson examples in the Altair documentation (there should be within the next few months).