Spotify Popular Songs Correlation
Contents
Spotify Popular Songs Correlation#
Author: Daniel Lim, djlim2@uci.edu
Course Project, UC Irvine, Math 10, F22
Introduction#
This dataset contains a wide range of songs from Spotify. There are columns describing the artist of the track, popularity of song, how long it is, etc. My goal is to see if there is a correlation between the popularity of the song and song genre using factors such as energy, valence, and danceability.
Cleaning Up Data#
We first need to clean up the data and get rid of columns that won’t be of use to us as well as getting rid of certain missing values.
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('dataset.csv')
df
Unnamed: 0 | track_id | artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | ... | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 5SuOikwiRyPMVoIQDJUgSV | Gen Hoshino | Comedy | Comedy | 73 | 230666 | False | 0.676 | 0.4610 | ... | -6.746 | 0 | 0.1430 | 0.0322 | 0.000001 | 0.3580 | 0.7150 | 87.917 | 4 | acoustic |
1 | 1 | 4qPNDBW1i3p13qLCt0Ki3A | Ben Woodward | Ghost (Acoustic) | Ghost - Acoustic | 55 | 149610 | False | 0.420 | 0.1660 | ... | -17.235 | 1 | 0.0763 | 0.9240 | 0.000006 | 0.1010 | 0.2670 | 77.489 | 4 | acoustic |
2 | 2 | 1iJBSr7s7jYXzM8EGcbK5b | Ingrid Michaelson;ZAYN | To Begin Again | To Begin Again | 57 | 210826 | False | 0.438 | 0.3590 | ... | -9.734 | 1 | 0.0557 | 0.2100 | 0.000000 | 0.1170 | 0.1200 | 76.332 | 4 | acoustic |
3 | 3 | 6lfxq3CG4xtTiEg7opyCyx | Kina Grannis | Crazy Rich Asians (Original Motion Picture Sou... | Can't Help Falling In Love | 71 | 201933 | False | 0.266 | 0.0596 | ... | -18.515 | 1 | 0.0363 | 0.9050 | 0.000071 | 0.1320 | 0.1430 | 181.740 | 3 | acoustic |
4 | 4 | 5vjLSffimiIP26QG5WcN2K | Chord Overstreet | Hold On | Hold On | 82 | 198853 | False | 0.618 | 0.4430 | ... | -9.681 | 1 | 0.0526 | 0.4690 | 0.000000 | 0.0829 | 0.1670 | 119.949 | 4 | acoustic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113995 | 113995 | 2C3TZjDRiAzdyViavDJ217 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Sleep My Little Boy | 21 | 384999 | False | 0.172 | 0.2350 | ... | -16.393 | 1 | 0.0422 | 0.6400 | 0.928000 | 0.0863 | 0.0339 | 125.995 | 5 | world-music |
113996 | 113996 | 1hIz5L4IB9hN3WRYPOCGPw | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Water Into Light | 22 | 385000 | False | 0.174 | 0.1170 | ... | -18.318 | 0 | 0.0401 | 0.9940 | 0.976000 | 0.1050 | 0.0350 | 85.239 | 4 | world-music |
113997 | 113997 | 6x8ZfSoqDjuNa5SVP5QjvX | Cesária Evora | Best Of | Miss Perfumado | 22 | 271466 | False | 0.629 | 0.3290 | ... | -10.895 | 0 | 0.0420 | 0.8670 | 0.000000 | 0.0839 | 0.7430 | 132.378 | 4 | world-music |
113998 | 113998 | 2e6sXL2bYv4bSz6VTdnfLs | Michael W. Smith | Change Your World | Friends | 41 | 283893 | False | 0.587 | 0.5060 | ... | -10.889 | 1 | 0.0297 | 0.3810 | 0.000000 | 0.2700 | 0.4130 | 135.960 | 4 | world-music |
113999 | 113999 | 2hETkH7cOfqmz3LqZDHZf5 | Cesária Evora | Miss Perfumado | Barbincor | 22 | 241826 | False | 0.526 | 0.4870 | ... | -10.204 | 0 | 0.0725 | 0.6810 | 0.000000 | 0.0893 | 0.7080 | 79.198 | 4 | world-music |
114000 rows × 21 columns
Since most of the columns that we want to look at are already integers, we don’t need to convert them
df.columns, df.dtypes
(Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
'key', 'loudness', 'mode', 'speechiness', 'acousticness',
'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
'track_genre'],
dtype='object'),
Unnamed: 0 int64
track_id object
artists object
album_name object
track_name object
popularity int64
duration_ms int64
explicit bool
danceability float64
energy float64
key int64
loudness float64
mode int64
speechiness float64
acousticness float64
instrumentalness float64
liveness float64
valence float64
tempo float64
time_signature int64
track_genre object
dtype: object)
Dropping these columns in the beginning since they have nothing to do with what I want to do
df.drop(['Unnamed: 0', 'track_id'], axis=1, inplace=True)
df
artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Gen Hoshino | Comedy | Comedy | 73 | 230666 | False | 0.676 | 0.4610 | 1 | -6.746 | 0 | 0.1430 | 0.0322 | 0.000001 | 0.3580 | 0.7150 | 87.917 | 4 | acoustic |
1 | Ben Woodward | Ghost (Acoustic) | Ghost - Acoustic | 55 | 149610 | False | 0.420 | 0.1660 | 1 | -17.235 | 1 | 0.0763 | 0.9240 | 0.000006 | 0.1010 | 0.2670 | 77.489 | 4 | acoustic |
2 | Ingrid Michaelson;ZAYN | To Begin Again | To Begin Again | 57 | 210826 | False | 0.438 | 0.3590 | 0 | -9.734 | 1 | 0.0557 | 0.2100 | 0.000000 | 0.1170 | 0.1200 | 76.332 | 4 | acoustic |
3 | Kina Grannis | Crazy Rich Asians (Original Motion Picture Sou... | Can't Help Falling In Love | 71 | 201933 | False | 0.266 | 0.0596 | 0 | -18.515 | 1 | 0.0363 | 0.9050 | 0.000071 | 0.1320 | 0.1430 | 181.740 | 3 | acoustic |
4 | Chord Overstreet | Hold On | Hold On | 82 | 198853 | False | 0.618 | 0.4430 | 2 | -9.681 | 1 | 0.0526 | 0.4690 | 0.000000 | 0.0829 | 0.1670 | 119.949 | 4 | acoustic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113995 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Sleep My Little Boy | 21 | 384999 | False | 0.172 | 0.2350 | 5 | -16.393 | 1 | 0.0422 | 0.6400 | 0.928000 | 0.0863 | 0.0339 | 125.995 | 5 | world-music |
113996 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Water Into Light | 22 | 385000 | False | 0.174 | 0.1170 | 0 | -18.318 | 0 | 0.0401 | 0.9940 | 0.976000 | 0.1050 | 0.0350 | 85.239 | 4 | world-music |
113997 | Cesária Evora | Best Of | Miss Perfumado | 22 | 271466 | False | 0.629 | 0.3290 | 0 | -10.895 | 0 | 0.0420 | 0.8670 | 0.000000 | 0.0839 | 0.7430 | 132.378 | 4 | world-music |
113998 | Michael W. Smith | Change Your World | Friends | 41 | 283893 | False | 0.587 | 0.5060 | 7 | -10.889 | 1 | 0.0297 | 0.3810 | 0.000000 | 0.2700 | 0.4130 | 135.960 | 4 | world-music |
113999 | Cesária Evora | Miss Perfumado | Barbincor | 22 | 241826 | False | 0.526 | 0.4870 | 1 | -10.204 | 0 | 0.0725 | 0.6810 | 0.000000 | 0.0893 | 0.7080 | 79.198 | 4 | world-music |
114000 rows × 19 columns
Checking for any missing values in this columns specifically since a track without an artist makes no sense
df['artists'].isnull().any().sum()
1
df[df['artists'].isnull()]
artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
65900 | NaN | NaN | NaN | 0 | 0 | False | 0.501 | 0.583 | 7 | -9.46 | 0 | 0.0605 | 0.69 | 0.00396 | 0.0747 | 0.734 | 138.391 | 4 | k-pop |
df.drop([65900], inplace=True) # dropped row with missing value
df
artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Gen Hoshino | Comedy | Comedy | 73 | 230666 | False | 0.676 | 0.4610 | 1 | -6.746 | 0 | 0.1430 | 0.0322 | 0.000001 | 0.3580 | 0.7150 | 87.917 | 4 | acoustic |
1 | Ben Woodward | Ghost (Acoustic) | Ghost - Acoustic | 55 | 149610 | False | 0.420 | 0.1660 | 1 | -17.235 | 1 | 0.0763 | 0.9240 | 0.000006 | 0.1010 | 0.2670 | 77.489 | 4 | acoustic |
2 | Ingrid Michaelson;ZAYN | To Begin Again | To Begin Again | 57 | 210826 | False | 0.438 | 0.3590 | 0 | -9.734 | 1 | 0.0557 | 0.2100 | 0.000000 | 0.1170 | 0.1200 | 76.332 | 4 | acoustic |
3 | Kina Grannis | Crazy Rich Asians (Original Motion Picture Sou... | Can't Help Falling In Love | 71 | 201933 | False | 0.266 | 0.0596 | 0 | -18.515 | 1 | 0.0363 | 0.9050 | 0.000071 | 0.1320 | 0.1430 | 181.740 | 3 | acoustic |
4 | Chord Overstreet | Hold On | Hold On | 82 | 198853 | False | 0.618 | 0.4430 | 2 | -9.681 | 1 | 0.0526 | 0.4690 | 0.000000 | 0.0829 | 0.1670 | 119.949 | 4 | acoustic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113995 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Sleep My Little Boy | 21 | 384999 | False | 0.172 | 0.2350 | 5 | -16.393 | 1 | 0.0422 | 0.6400 | 0.928000 | 0.0863 | 0.0339 | 125.995 | 5 | world-music |
113996 | Rainy Lullaby | #mindfulness - Soft Rain for Mindful Meditatio... | Water Into Light | 22 | 385000 | False | 0.174 | 0.1170 | 0 | -18.318 | 0 | 0.0401 | 0.9940 | 0.976000 | 0.1050 | 0.0350 | 85.239 | 4 | world-music |
113997 | Cesária Evora | Best Of | Miss Perfumado | 22 | 271466 | False | 0.629 | 0.3290 | 0 | -10.895 | 0 | 0.0420 | 0.8670 | 0.000000 | 0.0839 | 0.7430 | 132.378 | 4 | world-music |
113998 | Michael W. Smith | Change Your World | Friends | 41 | 283893 | False | 0.587 | 0.5060 | 7 | -10.889 | 1 | 0.0297 | 0.3810 | 0.000000 | 0.2700 | 0.4130 | 135.960 | 4 | world-music |
113999 | Cesária Evora | Miss Perfumado | Barbincor | 22 | 241826 | False | 0.526 | 0.4870 | 1 | -10.204 | 0 | 0.0725 | 0.6810 | 0.000000 | 0.0893 | 0.7080 | 79.198 | 4 | world-music |
113999 rows × 19 columns
df.columns
Index(['artists', 'album_name', 'track_name', 'popularity', 'duration_ms',
'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo', 'time_signature', 'track_genre'],
dtype='object')
I’m grouping the ‘track_genre’ column with the ‘popularity’ column just to get an idea of what the most popular genres are and computing the mean of the two.
pop_score = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
pop_score
track_genre
pop-film 59.283000
k-pop 56.952953
chill 53.651000
sad 52.379000
grunge 49.594000
...
chicago-house 12.339000
detroit-techno 11.174000
latin 8.297000
romance 3.245000
iranian 2.210000
Name: popularity, Length: 114, dtype: float64
df = df[['artists', 'track_name', 'popularity', 'danceability', 'energy', 'valence', 'track_genre']]
df # using boolean indexing to keep the columns we want
artists | track_name | popularity | danceability | energy | valence | track_genre | |
---|---|---|---|---|---|---|---|
0 | Gen Hoshino | Comedy | 73 | 0.676 | 0.4610 | 0.7150 | acoustic |
1 | Ben Woodward | Ghost - Acoustic | 55 | 0.420 | 0.1660 | 0.2670 | acoustic |
2 | Ingrid Michaelson;ZAYN | To Begin Again | 57 | 0.438 | 0.3590 | 0.1200 | acoustic |
3 | Kina Grannis | Can't Help Falling In Love | 71 | 0.266 | 0.0596 | 0.1430 | acoustic |
4 | Chord Overstreet | Hold On | 82 | 0.618 | 0.4430 | 0.1670 | acoustic |
... | ... | ... | ... | ... | ... | ... | ... |
113995 | Rainy Lullaby | Sleep My Little Boy | 21 | 0.172 | 0.2350 | 0.0339 | world-music |
113996 | Rainy Lullaby | Water Into Light | 22 | 0.174 | 0.1170 | 0.0350 | world-music |
113997 | Cesária Evora | Miss Perfumado | 22 | 0.629 | 0.3290 | 0.7430 | world-music |
113998 | Michael W. Smith | Friends | 41 | 0.587 | 0.5060 | 0.4130 | world-music |
113999 | Cesária Evora | Barbincor | 22 | 0.526 | 0.4870 | 0.7080 | world-music |
113999 rows × 7 columns
Main Portion of Project#
The reason I wanted to take a sample of the max number of rows is because I wanted to see if the outcome would be the same or similar regardless of how many times I ran the code (I tested it about 5 times and the result seems to be similar)
df = df.sample(5000, random_state=32454463)
df
artists | track_name | popularity | danceability | energy | valence | track_genre | |
---|---|---|---|---|---|---|---|
106440 | Lars Winnerbäck | Vem som helst blues | 39 | 0.462 | 0.795 | 0.6840 | swedish |
14671 | The Kiboomers | Baa Baa Black Sheep | 41 | 0.704 | 0.236 | 0.7220 | children |
58917 | Youth Code;King Yosef | Claw / Crawl | 20 | 0.509 | 0.953 | 0.0389 | industrial |
77494 | Vou pro Sereno;Xande De Pilares | Marinheiro Só / Cada Macaco no seu Galho (Chô ... | 46 | 0.405 | 0.748 | 0.7610 | pagode |
24942 | Drexciya | Red Hills of Lardossa | 7 | 0.695 | 0.837 | 0.2060 | detroit-techno |
... | ... | ... | ... | ... | ... | ... | ... |
27589 | Bladerunner | I Miss You | 18 | 0.549 | 0.718 | 0.0367 | drum-and-bass |
17240 | DJ Tray | Stop Playing - Jersey Club | 22 | 0.868 | 0.651 | 0.5450 | club |
11531 | Amy Winehouse | Me & Mr Jones | 66 | 0.583 | 0.486 | 0.5130 | british |
7826 | Pickin' On Series | Lovefool - Bluegrass Rendition of the Cardigans | 21 | 0.765 | 0.313 | 0.7540 | bluegrass |
105171 | bladecut | forever more | 42 | 0.672 | 0.385 | 0.0918 | study |
5000 rows × 7 columns
alt.Chart(df).mark_bar().encode(
x = 'energy',
y = 'valence',
color = alt.Color('popularity',scale=alt.Scale(scheme='turbo')),
tooltip = ['track_genre']
)
Regardless of how many times it is ran, it seems songs with a lower energy and valence result in less popularity, those with a medium (around 0.5) energy and valence range between not that popular to semi-popular (20-70 range). Those with a high energy and valence range from 0-40 in popularity.
brush = alt.selection_interval(encodings=["x"], init={"x": [0,100]}) # step 1
c1 = alt.Chart(df).mark_circle().encode(
x="energy",
y="popularity",
color=alt.condition(brush, "track_genre", alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(df).mark_bar().encode(
x="track_genre",
y=alt.Y("count()", scale=alt.Scale(domain=[0,80])),
color="track_genre"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
We can see from this graph that in terms of energy and popularity and when you include all the data points, pop-film does not have the most counts which would have been expected based on the ‘pop_score’. This might be because we chose a wrong factor (‘energy’) to compare popularity with instead of a different one.
cols = ['energy', 'valence', 'danceability']
I am now going to use OneHotEncoder to convert the ‘track_genre’ column into a Numpy array so we can incorporate it into LinearRegression.
encoder = OneHotEncoder()
encoder.fit(df[['track_genre']]) # 'track_genre' is a string column so we want to convert it into 1s and 0s
OneHotEncoder()
We convert the feature names into a list so that we can later use them as new columns in df
new_cols = list(encoder.get_feature_names_out())
new_cols
['track_genre_acoustic',
'track_genre_afrobeat',
'track_genre_alt-rock',
'track_genre_alternative',
'track_genre_ambient',
'track_genre_anime',
'track_genre_black-metal',
'track_genre_bluegrass',
'track_genre_blues',
'track_genre_brazil',
'track_genre_breakbeat',
'track_genre_british',
'track_genre_cantopop',
'track_genre_chicago-house',
'track_genre_children',
'track_genre_chill',
'track_genre_classical',
'track_genre_club',
'track_genre_comedy',
'track_genre_country',
'track_genre_dance',
'track_genre_dancehall',
'track_genre_death-metal',
'track_genre_deep-house',
'track_genre_detroit-techno',
'track_genre_disco',
'track_genre_disney',
'track_genre_drum-and-bass',
'track_genre_dub',
'track_genre_dubstep',
'track_genre_edm',
'track_genre_electro',
'track_genre_electronic',
'track_genre_emo',
'track_genre_folk',
'track_genre_forro',
'track_genre_french',
'track_genre_funk',
'track_genre_garage',
'track_genre_german',
'track_genre_gospel',
'track_genre_goth',
'track_genre_grindcore',
'track_genre_groove',
'track_genre_grunge',
'track_genre_guitar',
'track_genre_happy',
'track_genre_hard-rock',
'track_genre_hardcore',
'track_genre_hardstyle',
'track_genre_heavy-metal',
'track_genre_hip-hop',
'track_genre_honky-tonk',
'track_genre_house',
'track_genre_idm',
'track_genre_indian',
'track_genre_indie',
'track_genre_indie-pop',
'track_genre_industrial',
'track_genre_iranian',
'track_genre_j-dance',
'track_genre_j-idol',
'track_genre_j-pop',
'track_genre_j-rock',
'track_genre_jazz',
'track_genre_k-pop',
'track_genre_kids',
'track_genre_latin',
'track_genre_latino',
'track_genre_malay',
'track_genre_mandopop',
'track_genre_metal',
'track_genre_metalcore',
'track_genre_minimal-techno',
'track_genre_mpb',
'track_genre_new-age',
'track_genre_opera',
'track_genre_pagode',
'track_genre_party',
'track_genre_piano',
'track_genre_pop',
'track_genre_pop-film',
'track_genre_power-pop',
'track_genre_progressive-house',
'track_genre_psych-rock',
'track_genre_punk',
'track_genre_punk-rock',
'track_genre_r-n-b',
'track_genre_reggae',
'track_genre_reggaeton',
'track_genre_rock',
'track_genre_rock-n-roll',
'track_genre_rockabilly',
'track_genre_romance',
'track_genre_sad',
'track_genre_salsa',
'track_genre_samba',
'track_genre_sertanejo',
'track_genre_show-tunes',
'track_genre_singer-songwriter',
'track_genre_ska',
'track_genre_sleep',
'track_genre_songwriter',
'track_genre_soul',
'track_genre_spanish',
'track_genre_study',
'track_genre_swedish',
'track_genre_synth-pop',
'track_genre_tango',
'track_genre_techno',
'track_genre_trance',
'track_genre_trip-hop',
'track_genre_turkish',
'track_genre_world-music']
df2 = df.copy() # make a copy of df to be safe
df2[new_cols] = encoder.transform(df[["track_genre"]]).toarray() # transform the 'track_genre' column into array
df2
artists | track_name | popularity | danceability | energy | valence | track_genre | track_genre_acoustic | track_genre_afrobeat | track_genre_alt-rock | ... | track_genre_spanish | track_genre_study | track_genre_swedish | track_genre_synth-pop | track_genre_tango | track_genre_techno | track_genre_trance | track_genre_trip-hop | track_genre_turkish | track_genre_world-music | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
106440 | Lars Winnerbäck | Vem som helst blues | 39 | 0.462 | 0.795 | 0.6840 | swedish | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14671 | The Kiboomers | Baa Baa Black Sheep | 41 | 0.704 | 0.236 | 0.7220 | children | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
58917 | Youth Code;King Yosef | Claw / Crawl | 20 | 0.509 | 0.953 | 0.0389 | industrial | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
77494 | Vou pro Sereno;Xande De Pilares | Marinheiro Só / Cada Macaco no seu Galho (Chô ... | 46 | 0.405 | 0.748 | 0.7610 | pagode | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
24942 | Drexciya | Red Hills of Lardossa | 7 | 0.695 | 0.837 | 0.2060 | detroit-techno | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
27589 | Bladerunner | I Miss You | 18 | 0.549 | 0.718 | 0.0367 | drum-and-bass | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
17240 | DJ Tray | Stop Playing - Jersey Club | 22 | 0.868 | 0.651 | 0.5450 | club | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
11531 | Amy Winehouse | Me & Mr Jones | 66 | 0.583 | 0.486 | 0.5130 | british | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7826 | Pickin' On Series | Lovefool - Bluegrass Rendition of the Cardigans | 21 | 0.765 | 0.313 | 0.7540 | bluegrass | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
105171 | bladecut | forever more | 42 | 0.672 | 0.385 | 0.0918 | study | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5000 rows × 121 columns
reg = LinearRegression(fit_intercept=False) # False so we don't include 0
reg.fit(df2[cols+new_cols], df2['popularity']) # cols and new cols are out independent variables, while popularity is our dependent varaible
LinearRegression(fit_intercept=False)
pd.Series(reg.coef_, index=reg.feature_names_in_)
energy -0.403090
valence -4.584493
danceability 10.108819
track_genre_acoustic 39.078027
track_genre_afrobeat 21.509349
...
track_genre_techno 34.199210
track_genre_trance 34.507869
track_genre_trip-hop 29.400029
track_genre_turkish 35.722974
track_genre_world-music 37.985017
Length: 117, dtype: float64
pd.Series(reg.coef_, index=reg.feature_names_in_).sort_values(ascending=False, key=abs)
track_genre_k-pop 51.243638
track_genre_pop-film 50.919686
track_genre_pop 48.092857
track_genre_anime 47.498782
track_genre_indian 47.458272
...
track_genre_latin 5.357669
valence -4.584493
track_genre_romance 1.786777
track_genre_iranian 0.850650
energy -0.403090
Length: 117, dtype: float64
From this, it seems that energy and valence are not big indicators or popularity of a song but it is danceability that determines popularity. K-pop also seems to be the most popular genre (possibly because of danceability of its songs ?)
I am going to use KNeighborsRegressor and KNeighborsClassifier to see which of them produces the better results
X_train, X_test, y_train, y_test = train_test_split(df2[cols], df2['popularity'], train_size=0.8) # split the data into a training and test set
reg2 = KNeighborsRegressor(n_neighbors=10)
reg2.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg2.predict(X_train)
array([26.9, 40.6, 36.6, ..., 10.4, 38.8, 44.7])
mean_absolute_error(reg2.predict(X_train), y_train)
17.009625
mean_absolute_error(reg2.predict(X_test), y_test)
18.2267
Using mean absolute error to compare the test and training set, we see the error for the test set is greater than the training set meaning we will not be overfitting the data when using n_neighbors=10.
def get_scores(k):
K_reg = KNeighborsRegressor(n_neighbors=k)
K_reg.fit(X_train, y_train)
train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
return (train_error, test_error)
We will see which k values will give us the least test error
reg_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
reg_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | NaN | NaN |
1 | 2 | NaN | NaN |
2 | 3 | NaN | NaN |
3 | 4 | NaN | NaN |
4 | 5 | NaN | NaN |
... | ... | ... | ... |
144 | 145 | NaN | NaN |
145 | 146 | NaN | NaN |
146 | 147 | NaN | NaN |
147 | 148 | NaN | NaN |
148 | 149 | NaN | NaN |
149 rows × 3 columns
for i in reg_scores.index:
reg_scores.loc[i,["train_error","test_error"]] = get_scores(reg_scores.loc[i,"k"])
reg_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.057750 | 21.320000 |
1 | 2 | 11.319125 | 19.648000 |
2 | 3 | 13.713833 | 18.821333 |
3 | 4 | 14.815188 | 18.813750 |
4 | 5 | 15.672650 | 18.877800 |
... | ... | ... | ... |
144 | 145 | 18.039840 | 18.142124 |
145 | 146 | 18.039740 | 18.138623 |
146 | 147 | 18.043389 | 18.138891 |
147 | 148 | 18.051356 | 18.140041 |
148 | 149 | 18.053909 | 18.137027 |
149 rows × 3 columns
(reg_scores["test_error"]).min()
18.101453781512603
This means n_neighbors=10 since the test error was around 18.2
reg_scores["kinv"] = 1/reg_scores.k
Since higher k values result in lower flexibility, we add a column with the reciprocal of k values.
reg_scores
k | train_error | test_error | kinv | |
---|---|---|---|---|
0 | 1 | 0.057750 | 21.320000 | 1.000000 |
1 | 2 | 11.319125 | 19.648000 | 0.500000 |
2 | 3 | 13.713833 | 18.821333 | 0.333333 |
3 | 4 | 14.815188 | 18.813750 | 0.250000 |
4 | 5 | 15.672650 | 18.877800 | 0.200000 |
... | ... | ... | ... | ... |
144 | 145 | 18.039840 | 18.142124 | 0.006897 |
145 | 146 | 18.039740 | 18.138623 | 0.006849 |
146 | 147 | 18.043389 | 18.138891 | 0.006803 |
147 | 148 | 18.051356 | 18.140041 | 0.006757 |
148 | 149 | 18.053909 | 18.137027 | 0.006711 |
149 rows × 4 columns
reg_train = alt.Chart(reg_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
reg_test = alt.Chart(reg_scores).mark_line(color="orange").encode(
x = "kinv",
y = "test_error"
)
reg_train+reg_test
We can see that there is decent flexibility and variance in the beginning and all the underfitting after means there is lower flexibility.
clf = KNeighborsClassifier(n_neighbors=7)
We will now compare the result with KNeighborsClassifier
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=7)
mean_absolute_error(clf.predict(X_train), y_train)
24.1935
mean_absolute_error(clf.predict(X_test), y_test)
27.824
The mean absolute error for the test set is greater than the training set, meaning we will not be overfitting for n_neighbors = 7.
def get_clf_scores(k):
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)
train_error = mean_absolute_error(clf.predict(X_train), y_train)
test_error = mean_absolute_error(clf.predict(X_test), y_test)
return (train_error, test_error)
clf_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in clf_scores.index:
clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])
Process is the same as KNeighborsRegressor
clf_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.05775 | 21.320 |
1 | 2 | 11.63450 | 21.997 |
2 | 3 | 17.05800 | 23.735 |
3 | 4 | 20.35475 | 25.824 |
4 | 5 | 22.30100 | 26.822 |
... | ... | ... | ... |
144 | 145 | 32.94375 | 33.060 |
145 | 146 | 32.94425 | 33.025 |
146 | 147 | 32.93925 | 33.008 |
147 | 148 | 32.93375 | 33.008 |
148 | 149 | 32.93800 | 33.043 |
149 rows × 3 columns
clf_scores["test_error"].min()
21.32
Using n_neighbors=7 wasn’t a great choice since the test error was around 27.8 which is somewhat far
clf_scores["kinv"] = 1/clf_scores.k
clftrain = alt.Chart(clf_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
clftest = alt.Chart(clf_scores).mark_line(color="orange").encode(
x = "kinv",
y = "test_error"
).properties(
title= "Error",
)
clftrain+clftest
There is good amount of flexibility and varaince in the beginning and overfitting occurs after.
Summary#
From comparing KNeighborsRegressor and KNeighborsClassifier, we can see that KNeighborsRegressor would be better choice for our dataset. From the test error, we have an error of around 18% from KNeighborsRegressor so we can expect popularity to have around a 18% error as well. We can conclude there is some sort of correlation between popularity of a song with its genre.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)? https://www.kaggle.com/code/kelvinzeng/spotify-tracks-analysis
List any other references that you found helpful. https://christopherdavisuci.github.io/UCI-Math-10-W22/Proj/StudentProjects/DanaAlbakri.html https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://christopherdavisuci.github.io/UCI-Math-10-F22/Week7/Week7-Wednesday.html#including-a-categorical-variable-in-our-linear-regression https://christopherdavisuci.github.io/UCI-Math-10-F22/Week6/Week6-Friday.html#linear-regression-using-a-categorical-variable
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote