Spotify Popular Songs Correlation#

Course Project, UC Irvine, Math 10, F22

Introduction#

This dataset contains a wide range of songs from Spotify. There are columns describing the artist of the track, popularity of song, how long it is, etc. My goal is to see if there is a correlation between the popularity of the song and song genre using factors such as energy, valence, and danceability.

Cleaning Up Data#

We first need to clean up the data and get rid of columns that won’t be of use to us as well as getting rid of certain missing values.

import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('dataset.csv')
df

	Unnamed: 0	track_id	artists	album_name	track_name	popularity	duration_ms	explicit	danceability	energy	...	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	track_genre
0	0	5SuOikwiRyPMVoIQDJUgSV	Gen Hoshino	Comedy	Comedy	73	230666	False	0.676	0.4610	...	-6.746	0	0.1430	0.0322	0.000001	0.3580	0.7150	87.917	4	acoustic
1	1	4qPNDBW1i3p13qLCt0Ki3A	Ben Woodward	Ghost (Acoustic)	Ghost - Acoustic	55	149610	False	0.420	0.1660	...	-17.235	1	0.0763	0.9240	0.000006	0.1010	0.2670	77.489	4	acoustic
2	2	1iJBSr7s7jYXzM8EGcbK5b	Ingrid Michaelson;ZAYN	To Begin Again	To Begin Again	57	210826	False	0.438	0.3590	...	-9.734	1	0.0557	0.2100	0.000000	0.1170	0.1200	76.332	4	acoustic
3	3	6lfxq3CG4xtTiEg7opyCyx	Kina Grannis	Crazy Rich Asians (Original Motion Picture Sou...	Can't Help Falling In Love	71	201933	False	0.266	0.0596	...	-18.515	1	0.0363	0.9050	0.000071	0.1320	0.1430	181.740	3	acoustic
4	4	5vjLSffimiIP26QG5WcN2K	Chord Overstreet	Hold On	Hold On	82	198853	False	0.618	0.4430	...	-9.681	1	0.0526	0.4690	0.000000	0.0829	0.1670	119.949	4	acoustic
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113995	113995	2C3TZjDRiAzdyViavDJ217	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Sleep My Little Boy	21	384999	False	0.172	0.2350	...	-16.393	1	0.0422	0.6400	0.928000	0.0863	0.0339	125.995	5	world-music
113996	113996	1hIz5L4IB9hN3WRYPOCGPw	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Water Into Light	22	385000	False	0.174	0.1170	...	-18.318	0	0.0401	0.9940	0.976000	0.1050	0.0350	85.239	4	world-music
113997	113997	6x8ZfSoqDjuNa5SVP5QjvX	Cesária Evora	Best Of	Miss Perfumado	22	271466	False	0.629	0.3290	...	-10.895	0	0.0420	0.8670	0.000000	0.0839	0.7430	132.378	4	world-music
113998	113998	2e6sXL2bYv4bSz6VTdnfLs	Michael W. Smith	Change Your World	Friends	41	283893	False	0.587	0.5060	...	-10.889	1	0.0297	0.3810	0.000000	0.2700	0.4130	135.960	4	world-music
113999	113999	2hETkH7cOfqmz3LqZDHZf5	Cesária Evora	Miss Perfumado	Barbincor	22	241826	False	0.526	0.4870	...	-10.204	0	0.0725	0.6810	0.000000	0.0893	0.7080	79.198	4	world-music

114000 rows × 21 columns

Since most of the columns that we want to look at are already integers, we don’t need to convert them

df.columns, df.dtypes

(Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
        'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
        'key', 'loudness', 'mode', 'speechiness', 'acousticness',
        'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
        'track_genre'],
       dtype='object'),
 Unnamed: 0            int64
 track_id             object
 artists              object
 album_name           object
 track_name           object
 popularity            int64
 duration_ms           int64
 explicit               bool
 danceability        float64
 energy              float64
 key                   int64
 loudness            float64
 mode                  int64
 speechiness         float64
 acousticness        float64
 instrumentalness    float64
 liveness            float64
 valence             float64
 tempo               float64
 time_signature        int64
 track_genre          object
 dtype: object)

Dropping these columns in the beginning since they have nothing to do with what I want to do

df.drop(['Unnamed: 0', 'track_id'], axis=1, inplace=True)

df

	artists	album_name	track_name	popularity	duration_ms	explicit	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	track_genre
0	Gen Hoshino	Comedy	Comedy	73	230666	False	0.676	0.4610	1	-6.746	0	0.1430	0.0322	0.000001	0.3580	0.7150	87.917	4	acoustic
1	Ben Woodward	Ghost (Acoustic)	Ghost - Acoustic	55	149610	False	0.420	0.1660	1	-17.235	1	0.0763	0.9240	0.000006	0.1010	0.2670	77.489	4	acoustic
2	Ingrid Michaelson;ZAYN	To Begin Again	To Begin Again	57	210826	False	0.438	0.3590	0	-9.734	1	0.0557	0.2100	0.000000	0.1170	0.1200	76.332	4	acoustic
3	Kina Grannis	Crazy Rich Asians (Original Motion Picture Sou...	Can't Help Falling In Love	71	201933	False	0.266	0.0596	0	-18.515	1	0.0363	0.9050	0.000071	0.1320	0.1430	181.740	3	acoustic
4	Chord Overstreet	Hold On	Hold On	82	198853	False	0.618	0.4430	2	-9.681	1	0.0526	0.4690	0.000000	0.0829	0.1670	119.949	4	acoustic
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113995	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Sleep My Little Boy	21	384999	False	0.172	0.2350	5	-16.393	1	0.0422	0.6400	0.928000	0.0863	0.0339	125.995	5	world-music
113996	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Water Into Light	22	385000	False	0.174	0.1170	0	-18.318	0	0.0401	0.9940	0.976000	0.1050	0.0350	85.239	4	world-music
113997	Cesária Evora	Best Of	Miss Perfumado	22	271466	False	0.629	0.3290	0	-10.895	0	0.0420	0.8670	0.000000	0.0839	0.7430	132.378	4	world-music
113998	Michael W. Smith	Change Your World	Friends	41	283893	False	0.587	0.5060	7	-10.889	1	0.0297	0.3810	0.000000	0.2700	0.4130	135.960	4	world-music
113999	Cesária Evora	Miss Perfumado	Barbincor	22	241826	False	0.526	0.4870	1	-10.204	0	0.0725	0.6810	0.000000	0.0893	0.7080	79.198	4	world-music

114000 rows × 19 columns

Checking for any missing values in this columns specifically since a track without an artist makes no sense

df['artists'].isnull().any().sum() 

df[df['artists'].isnull()]

	artists	album_name	track_name	popularity	duration_ms	explicit	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	track_genre
65900	NaN	NaN	NaN	0	0	False	0.501	0.583	7	-9.46	0	0.0605	0.69	0.00396	0.0747	0.734	138.391	4	k-pop

df.drop([65900], inplace=True) # dropped row with missing value 

df

	artists	album_name	track_name	popularity	duration_ms	explicit	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	time_signature	track_genre
0	Gen Hoshino	Comedy	Comedy	73	230666	False	0.676	0.4610	1	-6.746	0	0.1430	0.0322	0.000001	0.3580	0.7150	87.917	4	acoustic
1	Ben Woodward	Ghost (Acoustic)	Ghost - Acoustic	55	149610	False	0.420	0.1660	1	-17.235	1	0.0763	0.9240	0.000006	0.1010	0.2670	77.489	4	acoustic
2	Ingrid Michaelson;ZAYN	To Begin Again	To Begin Again	57	210826	False	0.438	0.3590	0	-9.734	1	0.0557	0.2100	0.000000	0.1170	0.1200	76.332	4	acoustic
3	Kina Grannis	Crazy Rich Asians (Original Motion Picture Sou...	Can't Help Falling In Love	71	201933	False	0.266	0.0596	0	-18.515	1	0.0363	0.9050	0.000071	0.1320	0.1430	181.740	3	acoustic
4	Chord Overstreet	Hold On	Hold On	82	198853	False	0.618	0.4430	2	-9.681	1	0.0526	0.4690	0.000000	0.0829	0.1670	119.949	4	acoustic
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113995	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Sleep My Little Boy	21	384999	False	0.172	0.2350	5	-16.393	1	0.0422	0.6400	0.928000	0.0863	0.0339	125.995	5	world-music
113996	Rainy Lullaby	#mindfulness - Soft Rain for Mindful Meditatio...	Water Into Light	22	385000	False	0.174	0.1170	0	-18.318	0	0.0401	0.9940	0.976000	0.1050	0.0350	85.239	4	world-music
113997	Cesária Evora	Best Of	Miss Perfumado	22	271466	False	0.629	0.3290	0	-10.895	0	0.0420	0.8670	0.000000	0.0839	0.7430	132.378	4	world-music
113998	Michael W. Smith	Change Your World	Friends	41	283893	False	0.587	0.5060	7	-10.889	1	0.0297	0.3810	0.000000	0.2700	0.4130	135.960	4	world-music
113999	Cesária Evora	Miss Perfumado	Barbincor	22	241826	False	0.526	0.4870	1	-10.204	0	0.0725	0.6810	0.000000	0.0893	0.7080	79.198	4	world-music

113999 rows × 19 columns

df.columns

Index(['artists', 'album_name', 'track_name', 'popularity', 'duration_ms',
       'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')

I’m grouping the ‘track_genre’ column with the ‘popularity’ column just to get an idea of what the most popular genres are and computing the mean of the two.

pop_score = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
pop_score

track_genre
pop-film          59.283000
k-pop             56.952953
chill             53.651000
sad               52.379000
grunge            49.594000
                    ...    
chicago-house     12.339000
detroit-techno    11.174000
latin              8.297000
romance            3.245000
iranian            2.210000
Name: popularity, Length: 114, dtype: float64

df = df[['artists', 'track_name', 'popularity', 'danceability', 'energy', 'valence', 'track_genre']]
df # using boolean indexing to keep the columns we want

	artists	track_name	popularity	danceability	energy	valence	track_genre
0	Gen Hoshino	Comedy	73	0.676	0.4610	0.7150	acoustic
1	Ben Woodward	Ghost - Acoustic	55	0.420	0.1660	0.2670	acoustic
2	Ingrid Michaelson;ZAYN	To Begin Again	57	0.438	0.3590	0.1200	acoustic
3	Kina Grannis	Can't Help Falling In Love	71	0.266	0.0596	0.1430	acoustic
4	Chord Overstreet	Hold On	82	0.618	0.4430	0.1670	acoustic
...	...	...	...	...	...	...	...
113995	Rainy Lullaby	Sleep My Little Boy	21	0.172	0.2350	0.0339	world-music
113996	Rainy Lullaby	Water Into Light	22	0.174	0.1170	0.0350	world-music
113997	Cesária Evora	Miss Perfumado	22	0.629	0.3290	0.7430	world-music
113998	Michael W. Smith	Friends	41	0.587	0.5060	0.4130	world-music
113999	Cesária Evora	Barbincor	22	0.526	0.4870	0.7080	world-music

113999 rows × 7 columns

Main Portion of Project#

The reason I wanted to take a sample of the max number of rows is because I wanted to see if the outcome would be the same or similar regardless of how many times I ran the code (I tested it about 5 times and the result seems to be similar)

df = df.sample(5000, random_state=32454463)
df

	artists	track_name	popularity	danceability	energy	valence	track_genre
106440	Lars Winnerbäck	Vem som helst blues	39	0.462	0.795	0.6840	swedish
14671	The Kiboomers	Baa Baa Black Sheep	41	0.704	0.236	0.7220	children
58917	Youth Code;King Yosef	Claw / Crawl	20	0.509	0.953	0.0389	industrial
77494	Vou pro Sereno;Xande De Pilares	Marinheiro Só / Cada Macaco no seu Galho (Chô ...	46	0.405	0.748	0.7610	pagode
24942	Drexciya	Red Hills of Lardossa	7	0.695	0.837	0.2060	detroit-techno
...	...	...	...	...	...	...	...
27589	Bladerunner	I Miss You	18	0.549	0.718	0.0367	drum-and-bass
17240	DJ Tray	Stop Playing - Jersey Club	22	0.868	0.651	0.5450	club
11531	Amy Winehouse	Me & Mr Jones	66	0.583	0.486	0.5130	british
7826	Pickin' On Series	Lovefool - Bluegrass Rendition of the Cardigans	21	0.765	0.313	0.7540	bluegrass
105171	bladecut	forever more	42	0.672	0.385	0.0918	study

5000 rows × 7 columns

alt.Chart(df).mark_bar().encode(
    x = 'energy', 
    y = 'valence', 
    color = alt.Color('popularity',scale=alt.Scale(scheme='turbo')),
    tooltip = ['track_genre']
)

Regardless of how many times it is ran, it seems songs with a lower energy and valence result in less popularity, those with a medium (around 0.5) energy and valence range between not that popular to semi-popular (20-70 range). Those with a high energy and valence range from 0-40 in popularity.

brush = alt.selection_interval(encodings=["x"], init={"x": [0,100]}) # step 1

c1 = alt.Chart(df).mark_circle().encode(
    x="energy",
    y="popularity",
    color=alt.condition(brush, "track_genre", alt.value("orchid"))
).add_selection(brush) # step 2

c2 = alt.Chart(df).mark_bar().encode(
    x="track_genre",
    y=alt.Y("count()", scale=alt.Scale(domain=[0,80])),
    color="track_genre"
).transform_filter(brush)

alt.hconcat(c1,c2) # c1|c2

We can see from this graph that in terms of energy and popularity and when you include all the data points, pop-film does not have the most counts which would have been expected based on the ‘pop_score’. This might be because we chose a wrong factor (‘energy’) to compare popularity with instead of a different one.

cols = ['energy', 'valence', 'danceability']

I am now going to use OneHotEncoder to convert the ‘track_genre’ column into a Numpy array so we can incorporate it into LinearRegression.

encoder = OneHotEncoder()

encoder.fit(df[['track_genre']]) # 'track_genre' is a string column so we want to convert it into 1s and 0s

OneHotEncoder()

We convert the feature names into a list so that we can later use them as new columns in df

new_cols = list(encoder.get_feature_names_out()) 
new_cols

['track_genre_acoustic',
 'track_genre_afrobeat',
 'track_genre_alt-rock',
 'track_genre_alternative',
 'track_genre_ambient',
 'track_genre_anime',
 'track_genre_black-metal',
 'track_genre_bluegrass',
 'track_genre_blues',
 'track_genre_brazil',
 'track_genre_breakbeat',
 'track_genre_british',
 'track_genre_cantopop',
 'track_genre_chicago-house',
 'track_genre_children',
 'track_genre_chill',
 'track_genre_classical',
 'track_genre_club',
 'track_genre_comedy',
 'track_genre_country',
 'track_genre_dance',
 'track_genre_dancehall',
 'track_genre_death-metal',
 'track_genre_deep-house',
 'track_genre_detroit-techno',
 'track_genre_disco',
 'track_genre_disney',
 'track_genre_drum-and-bass',
 'track_genre_dub',
 'track_genre_dubstep',
 'track_genre_edm',
 'track_genre_electro',
 'track_genre_electronic',
 'track_genre_emo',
 'track_genre_folk',
 'track_genre_forro',
 'track_genre_french',
 'track_genre_funk',
 'track_genre_garage',
 'track_genre_german',
 'track_genre_gospel',
 'track_genre_goth',
 'track_genre_grindcore',
 'track_genre_groove',
 'track_genre_grunge',
 'track_genre_guitar',
 'track_genre_happy',
 'track_genre_hard-rock',
 'track_genre_hardcore',
 'track_genre_hardstyle',
 'track_genre_heavy-metal',
 'track_genre_hip-hop',
 'track_genre_honky-tonk',
 'track_genre_house',
 'track_genre_idm',
 'track_genre_indian',
 'track_genre_indie',
 'track_genre_indie-pop',
 'track_genre_industrial',
 'track_genre_iranian',
 'track_genre_j-dance',
 'track_genre_j-idol',
 'track_genre_j-pop',
 'track_genre_j-rock',
 'track_genre_jazz',
 'track_genre_k-pop',
 'track_genre_kids',
 'track_genre_latin',
 'track_genre_latino',
 'track_genre_malay',
 'track_genre_mandopop',
 'track_genre_metal',
 'track_genre_metalcore',
 'track_genre_minimal-techno',
 'track_genre_mpb',
 'track_genre_new-age',
 'track_genre_opera',
 'track_genre_pagode',
 'track_genre_party',
 'track_genre_piano',
 'track_genre_pop',
 'track_genre_pop-film',
 'track_genre_power-pop',
 'track_genre_progressive-house',
 'track_genre_psych-rock',
 'track_genre_punk',
 'track_genre_punk-rock',
 'track_genre_r-n-b',
 'track_genre_reggae',
 'track_genre_reggaeton',
 'track_genre_rock',
 'track_genre_rock-n-roll',
 'track_genre_rockabilly',
 'track_genre_romance',
 'track_genre_sad',
 'track_genre_salsa',
 'track_genre_samba',
 'track_genre_sertanejo',
 'track_genre_show-tunes',
 'track_genre_singer-songwriter',
 'track_genre_ska',
 'track_genre_sleep',
 'track_genre_songwriter',
 'track_genre_soul',
 'track_genre_spanish',
 'track_genre_study',
 'track_genre_swedish',
 'track_genre_synth-pop',
 'track_genre_tango',
 'track_genre_techno',
 'track_genre_trance',
 'track_genre_trip-hop',
 'track_genre_turkish',
 'track_genre_world-music']

df2 = df.copy() # make a copy of df to be safe

df2[new_cols] = encoder.transform(df[["track_genre"]]).toarray() # transform the 'track_genre' column into array

df2

	artists	track_name	popularity	danceability	energy	valence	track_genre	track_genre_acoustic	track_genre_afrobeat	track_genre_alt-rock	...	track_genre_spanish	track_genre_study	track_genre_swedish	track_genre_synth-pop	track_genre_tango	track_genre_techno	track_genre_trance	track_genre_trip-hop	track_genre_turkish	track_genre_world-music
106440	Lars Winnerbäck	Vem som helst blues	39	0.462	0.795	0.6840	swedish	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14671	The Kiboomers	Baa Baa Black Sheep	41	0.704	0.236	0.7220	children	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
58917	Youth Code;King Yosef	Claw / Crawl	20	0.509	0.953	0.0389	industrial	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
77494	Vou pro Sereno;Xande De Pilares	Marinheiro Só / Cada Macaco no seu Galho (Chô ...	46	0.405	0.748	0.7610	pagode	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
24942	Drexciya	Red Hills of Lardossa	7	0.695	0.837	0.2060	detroit-techno	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
27589	Bladerunner	I Miss You	18	0.549	0.718	0.0367	drum-and-bass	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
17240	DJ Tray	Stop Playing - Jersey Club	22	0.868	0.651	0.5450	club	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
11531	Amy Winehouse	Me & Mr Jones	66	0.583	0.486	0.5130	british	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7826	Pickin' On Series	Lovefool - Bluegrass Rendition of the Cardigans	21	0.765	0.313	0.7540	bluegrass	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
105171	bladecut	forever more	42	0.672	0.385	0.0918	study	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5000 rows × 121 columns

reg = LinearRegression(fit_intercept=False) # False so we don't include 0

reg.fit(df2[cols+new_cols], df2['popularity']) # cols and new cols are out independent variables, while popularity is our dependent varaible

LinearRegression(fit_intercept=False)

pd.Series(reg.coef_, index=reg.feature_names_in_)

energy                     -0.403090
valence                    -4.584493
danceability               10.108819
track_genre_acoustic       39.078027
track_genre_afrobeat       21.509349
                             ...    
track_genre_techno         34.199210
track_genre_trance         34.507869
track_genre_trip-hop       29.400029
track_genre_turkish        35.722974
track_genre_world-music    37.985017
Length: 117, dtype: float64

pd.Series(reg.coef_, index=reg.feature_names_in_).sort_values(ascending=False, key=abs)

track_genre_k-pop       51.243638
track_genre_pop-film    50.919686
track_genre_pop         48.092857
track_genre_anime       47.498782
track_genre_indian      47.458272
                          ...    
track_genre_latin        5.357669
valence                 -4.584493
track_genre_romance      1.786777
track_genre_iranian      0.850650
energy                  -0.403090
Length: 117, dtype: float64

From this, it seems that energy and valence are not big indicators or popularity of a song but it is danceability that determines popularity. K-pop also seems to be the most popular genre (possibly because of danceability of its songs ?)

I am going to use KNeighborsRegressor and KNeighborsClassifier to see which of them produces the better results

X_train, X_test, y_train, y_test = train_test_split(df2[cols], df2['popularity'], train_size=0.8) # split the data into a training and test set

reg2 = KNeighborsRegressor(n_neighbors=10)

reg2.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=10)

reg2.predict(X_train)

array([26.9, 40.6, 36.6, ..., 10.4, 38.8, 44.7])

mean_absolute_error(reg2.predict(X_train), y_train)

17.009625

mean_absolute_error(reg2.predict(X_test), y_test)

18.2267

Using mean absolute error to compare the test and training set, we see the error for the test set is greater than the training set meaning we will not be overfitting the data when using n_neighbors=10.

def get_scores(k):
    K_reg = KNeighborsRegressor(n_neighbors=k)
    K_reg.fit(X_train, y_train)
    train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
    test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
    return (train_error, test_error)

We will see which k values will give us the least test error

reg_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
reg_scores

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

149 rows × 3 columns

for i in reg_scores.index:
    reg_scores.loc[i,["train_error","test_error"]] = get_scores(reg_scores.loc[i,"k"])

reg_scores

	k	train_error	test_error
0	1	0.057750	21.320000
1	2	11.319125	19.648000
2	3	13.713833	18.821333
3	4	14.815188	18.813750
4	5	15.672650	18.877800
...	...	...	...
144	145	18.039840	18.142124
145	146	18.039740	18.138623
146	147	18.043389	18.138891
147	148	18.051356	18.140041
148	149	18.053909	18.137027

149 rows × 3 columns

(reg_scores["test_error"]).min()

18.101453781512603

This means n_neighbors=10 since the test error was around 18.2

reg_scores["kinv"] = 1/reg_scores.k

Since higher k values result in lower flexibility, we add a column with the reciprocal of k values.

reg_scores

	k	train_error	test_error	kinv
0	1	0.057750	21.320000	1.000000
1	2	11.319125	19.648000	0.500000
2	3	13.713833	18.821333	0.333333
3	4	14.815188	18.813750	0.250000
4	5	15.672650	18.877800	0.200000
...	...	...	...	...
144	145	18.039840	18.142124	0.006897
145	146	18.039740	18.138623	0.006849
146	147	18.043389	18.138891	0.006803
147	148	18.051356	18.140041	0.006757
148	149	18.053909	18.137027	0.006711

149 rows × 4 columns

reg_train = alt.Chart(reg_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

reg_test = alt.Chart(reg_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)

reg_train+reg_test

We can see that there is decent flexibility and variance in the beginning and all the underfitting after means there is lower flexibility.

clf = KNeighborsClassifier(n_neighbors=7)

We will now compare the result with KNeighborsClassifier

clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=7)

mean_absolute_error(clf.predict(X_train), y_train)

24.1935

mean_absolute_error(clf.predict(X_test), y_test)

27.824

The mean absolute error for the test set is greater than the training set, meaning we will not be overfitting for n_neighbors = 7.

def get_clf_scores(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)

clf_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

for i in clf_scores.index:
    clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])

Process is the same as KNeighborsRegressor

clf_scores

	k	train_error	test_error
0	1	0.05775	21.320
1	2	11.63450	21.997
2	3	17.05800	23.735
3	4	20.35475	25.824
4	5	22.30100	26.822
...	...	...	...
144	145	32.94375	33.060
145	146	32.94425	33.025
146	147	32.93925	33.008
147	148	32.93375	33.008
148	149	32.93800	33.043

149 rows × 3 columns

clf_scores["test_error"].min()

21.32

Using n_neighbors=7 wasn’t a great choice since the test error was around 27.8 which is somewhat far

clf_scores["kinv"] = 1/clf_scores.k

clftrain = alt.Chart(clf_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

clftest = alt.Chart(clf_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
  ).properties(
      title= "Error",
       
    
)

clftrain+clftest

There is good amount of flexibility and varaince in the beginning and overfitting occurs after.

Summary#

From comparing KNeighborsRegressor and KNeighborsClassifier, we can see that KNeighborsRegressor would be better choice for our dataset. From the test error, we have an error of around 18% from KNeighborsRegressor so we can expect popularity to have around a 18% error as well. We can conclude there is some sort of correlation between popularity of a song with its genre.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)? https://www.kaggle.com/code/kelvinzeng/spotify-tracks-analysis

List any other references that you found helpful. https://christopherdavisuci.github.io/UCI-Math-10-W22/Proj/StudentProjects/DanaAlbakri.html https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://christopherdavisuci.github.io/UCI-Math-10-F22/Week7/Week7-Wednesday.html#including-a-categorical-variable-in-our-linear-regression https://christopherdavisuci.github.io/UCI-Math-10-F22/Week6/Week6-Friday.html#linear-regression-using-a-categorical-variable

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

Spotify Popular Songs Correlation

Contents

Spotify Popular Songs Correlation#

Introduction#

Cleaning Up Data#

Main Portion of Project#

Summary#

References#

Submission#

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN

	k	train_error	test_error
0	1	NaN	NaN
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN
...	...	...	...
144	145	NaN	NaN
145	146	NaN	NaN
146	147	NaN	NaN
147	148	NaN	NaN
148	149	NaN	NaN