logo

UC Irvine Math 10, Fall 2022

  • UC Irvine, Math 10, Fall 2022
  • Week 0
    • Worksheet 0
    • Week 0 Friday
  • Week 1
    • Week 1 Monday
    • Worksheet 1
    • Week 1 Tuesday
    • Week 1 Wednesday
    • Worksheet 2
    • Week 1 Videos
    • Week 1 Friday
  • Week 2
    • Week 2 Monday
    • Week 2 Wednesday
    • Week 2 Friday
    • Worksheet 3
    • Worksheet 4
    • Week 2 Videos
  • Week 3
    • Week 3 Monday
    • Week 3 Wednesday
    • Week 3 Friday
    • Worksheet 5
    • Worksheet 6
    • Week 3 Videos
  • Week 4
    • Week 4 Monday
    • Week 4 Wednesday
    • Week 4 Friday
    • Worksheet 7
    • Worksheet 8
    • Week 4 Videos
  • Week 5
    • Week 5 Monday
    • Week 5 Friday
  • Week 6
    • Week 6 Monday
    • Week 6 Wednesday
    • Week 6 Friday
    • Worksheet 9
    • Worksheet 10
    • Week 6 Videos
  • Week 7
    • Week 7 Monday
    • Week 7 Wednesday
    • Worksheet 11
    • Worksheet 12
    • Week 7 Videos
  • Week 8
    • Week 8 Videos
    • Week 8 Monday
    • Week 8 Wednesday
    • Week 8 Friday
    • Worksheet 13
    • Worksheet 14
  • Week 9
    • Week 9 Monday
    • Week 9 Wednesday
    • Worksheet 15
    • Worksheet 16
    • Week 9 Videos
  • Week 10
    • Week 10 Monday
    • Week 10 Wednesday
    • Week 10 Friday
  • Course Project
    • Course Project
    • Possible extra topics
  • Student Projects
    • Analyzing and Visualizing Top Soccer Data
    • Hollywood Movie Gross Income
    • Attempts about finding aspects that can influence the effects of Music Therapy
    • Predict the Chance of admission to Graduate School
    • Major League Baseball (MLB) Team Performances Predictions & Analysis
    • Predicting Australian Cities With Weather
    • Customer Personality Analysis
    • Replace this with your project title
    • Chip from Specs
    • Global Life Expectancy
    • NBA Shot Prediction
    • Analysis on Starbucks & McDonald’s Menu Items
    • Applying Machine Learning techniques to the Fashion MNIST data set
    • Spotify Popular Songs Correlation
    • New York City Taxi Rides
    • Video Game Sales
    • What contributes to a movie’s commercial success?
    • Car Price Prediction
    • Prediction of Canadian Car Accident
    • Key Financial Metrics Analysis of S&P 500 Stocks
    • Predicting Heart Disease
    • The analysis of relations between the total covid-affected population and other datas
    • Predict Diabetes
    • Salary Classification By Using Decision Tree
    • The relationship and prediction between Pokemon’s stats and generation
    • NFL offense performance and analysis
    • Netflix Stock Price
    • Title
Powered by Jupyter Book
  • Binder
  • .ipynb
Contents
  • Introduction
  • Cleaning Up Data
  • Main Portion of Project
  • Summary
  • References
  • Submission

Spotify Popular Songs Correlation

Contents

  • Introduction
  • Cleaning Up Data
  • Main Portion of Project
  • Summary
  • References
  • Submission

Spotify Popular Songs Correlation#

Author: Daniel Lim, djlim2@uci.edu

Course Project, UC Irvine, Math 10, F22

Introduction#

This dataset contains a wide range of songs from Spotify. There are columns describing the artist of the track, popularity of song, how long it is, etc. My goal is to see if there is a correlation between the popularity of the song and song genre using factors such as energy, valence, and danceability.

Cleaning Up Data#

We first need to clean up the data and get rid of columns that won’t be of use to us as well as getting rid of certain missing values.

import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('dataset.csv')
df
Unnamed: 0 track_id artists album_name track_name popularity duration_ms explicit danceability energy ... loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 ... -6.746 0 0.1430 0.0322 0.000001 0.3580 0.7150 87.917 4 acoustic
1 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 ... -17.235 1 0.0763 0.9240 0.000006 0.1010 0.2670 77.489 4 acoustic
2 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 ... -9.734 1 0.0557 0.2100 0.000000 0.1170 0.1200 76.332 4 acoustic
3 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 ... -18.515 1 0.0363 0.9050 0.000071 0.1320 0.1430 181.740 3 acoustic
4 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 ... -9.681 1 0.0526 0.4690 0.000000 0.0829 0.1670 119.949 4 acoustic
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113995 113995 2C3TZjDRiAzdyViavDJ217 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Sleep My Little Boy 21 384999 False 0.172 0.2350 ... -16.393 1 0.0422 0.6400 0.928000 0.0863 0.0339 125.995 5 world-music
113996 113996 1hIz5L4IB9hN3WRYPOCGPw Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Water Into Light 22 385000 False 0.174 0.1170 ... -18.318 0 0.0401 0.9940 0.976000 0.1050 0.0350 85.239 4 world-music
113997 113997 6x8ZfSoqDjuNa5SVP5QjvX Cesária Evora Best Of Miss Perfumado 22 271466 False 0.629 0.3290 ... -10.895 0 0.0420 0.8670 0.000000 0.0839 0.7430 132.378 4 world-music
113998 113998 2e6sXL2bYv4bSz6VTdnfLs Michael W. Smith Change Your World Friends 41 283893 False 0.587 0.5060 ... -10.889 1 0.0297 0.3810 0.000000 0.2700 0.4130 135.960 4 world-music
113999 113999 2hETkH7cOfqmz3LqZDHZf5 Cesária Evora Miss Perfumado Barbincor 22 241826 False 0.526 0.4870 ... -10.204 0 0.0725 0.6810 0.000000 0.0893 0.7080 79.198 4 world-music

114000 rows × 21 columns

Since most of the columns that we want to look at are already integers, we don’t need to convert them

df.columns, df.dtypes
(Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
        'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
        'key', 'loudness', 'mode', 'speechiness', 'acousticness',
        'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
        'track_genre'],
       dtype='object'),
 Unnamed: 0            int64
 track_id             object
 artists              object
 album_name           object
 track_name           object
 popularity            int64
 duration_ms           int64
 explicit               bool
 danceability        float64
 energy              float64
 key                   int64
 loudness            float64
 mode                  int64
 speechiness         float64
 acousticness        float64
 instrumentalness    float64
 liveness            float64
 valence             float64
 tempo               float64
 time_signature        int64
 track_genre          object
 dtype: object)

Dropping these columns in the beginning since they have nothing to do with what I want to do

df.drop(['Unnamed: 0', 'track_id'], axis=1, inplace=True)
df
artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 1 -6.746 0 0.1430 0.0322 0.000001 0.3580 0.7150 87.917 4 acoustic
1 Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 1 -17.235 1 0.0763 0.9240 0.000006 0.1010 0.2670 77.489 4 acoustic
2 Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 0 -9.734 1 0.0557 0.2100 0.000000 0.1170 0.1200 76.332 4 acoustic
3 Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 0 -18.515 1 0.0363 0.9050 0.000071 0.1320 0.1430 181.740 3 acoustic
4 Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 2 -9.681 1 0.0526 0.4690 0.000000 0.0829 0.1670 119.949 4 acoustic
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113995 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Sleep My Little Boy 21 384999 False 0.172 0.2350 5 -16.393 1 0.0422 0.6400 0.928000 0.0863 0.0339 125.995 5 world-music
113996 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Water Into Light 22 385000 False 0.174 0.1170 0 -18.318 0 0.0401 0.9940 0.976000 0.1050 0.0350 85.239 4 world-music
113997 Cesária Evora Best Of Miss Perfumado 22 271466 False 0.629 0.3290 0 -10.895 0 0.0420 0.8670 0.000000 0.0839 0.7430 132.378 4 world-music
113998 Michael W. Smith Change Your World Friends 41 283893 False 0.587 0.5060 7 -10.889 1 0.0297 0.3810 0.000000 0.2700 0.4130 135.960 4 world-music
113999 Cesária Evora Miss Perfumado Barbincor 22 241826 False 0.526 0.4870 1 -10.204 0 0.0725 0.6810 0.000000 0.0893 0.7080 79.198 4 world-music

114000 rows × 19 columns

Checking for any missing values in this columns specifically since a track without an artist makes no sense

df['artists'].isnull().any().sum() 
1
df[df['artists'].isnull()]
artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
65900 NaN NaN NaN 0 0 False 0.501 0.583 7 -9.46 0 0.0605 0.69 0.00396 0.0747 0.734 138.391 4 k-pop
df.drop([65900], inplace=True) # dropped row with missing value 
df
artists album_name track_name popularity duration_ms explicit danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature track_genre
0 Gen Hoshino Comedy Comedy 73 230666 False 0.676 0.4610 1 -6.746 0 0.1430 0.0322 0.000001 0.3580 0.7150 87.917 4 acoustic
1 Ben Woodward Ghost (Acoustic) Ghost - Acoustic 55 149610 False 0.420 0.1660 1 -17.235 1 0.0763 0.9240 0.000006 0.1010 0.2670 77.489 4 acoustic
2 Ingrid Michaelson;ZAYN To Begin Again To Begin Again 57 210826 False 0.438 0.3590 0 -9.734 1 0.0557 0.2100 0.000000 0.1170 0.1200 76.332 4 acoustic
3 Kina Grannis Crazy Rich Asians (Original Motion Picture Sou... Can't Help Falling In Love 71 201933 False 0.266 0.0596 0 -18.515 1 0.0363 0.9050 0.000071 0.1320 0.1430 181.740 3 acoustic
4 Chord Overstreet Hold On Hold On 82 198853 False 0.618 0.4430 2 -9.681 1 0.0526 0.4690 0.000000 0.0829 0.1670 119.949 4 acoustic
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113995 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Sleep My Little Boy 21 384999 False 0.172 0.2350 5 -16.393 1 0.0422 0.6400 0.928000 0.0863 0.0339 125.995 5 world-music
113996 Rainy Lullaby #mindfulness - Soft Rain for Mindful Meditatio... Water Into Light 22 385000 False 0.174 0.1170 0 -18.318 0 0.0401 0.9940 0.976000 0.1050 0.0350 85.239 4 world-music
113997 Cesária Evora Best Of Miss Perfumado 22 271466 False 0.629 0.3290 0 -10.895 0 0.0420 0.8670 0.000000 0.0839 0.7430 132.378 4 world-music
113998 Michael W. Smith Change Your World Friends 41 283893 False 0.587 0.5060 7 -10.889 1 0.0297 0.3810 0.000000 0.2700 0.4130 135.960 4 world-music
113999 Cesária Evora Miss Perfumado Barbincor 22 241826 False 0.526 0.4870 1 -10.204 0 0.0725 0.6810 0.000000 0.0893 0.7080 79.198 4 world-music

113999 rows × 19 columns

df.columns
Index(['artists', 'album_name', 'track_name', 'popularity', 'duration_ms',
       'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')

I’m grouping the ‘track_genre’ column with the ‘popularity’ column just to get an idea of what the most popular genres are and computing the mean of the two.

pop_score = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
pop_score
track_genre
pop-film          59.283000
k-pop             56.952953
chill             53.651000
sad               52.379000
grunge            49.594000
                    ...    
chicago-house     12.339000
detroit-techno    11.174000
latin              8.297000
romance            3.245000
iranian            2.210000
Name: popularity, Length: 114, dtype: float64
df = df[['artists', 'track_name', 'popularity', 'danceability', 'energy', 'valence', 'track_genre']]
df # using boolean indexing to keep the columns we want
artists track_name popularity danceability energy valence track_genre
0 Gen Hoshino Comedy 73 0.676 0.4610 0.7150 acoustic
1 Ben Woodward Ghost - Acoustic 55 0.420 0.1660 0.2670 acoustic
2 Ingrid Michaelson;ZAYN To Begin Again 57 0.438 0.3590 0.1200 acoustic
3 Kina Grannis Can't Help Falling In Love 71 0.266 0.0596 0.1430 acoustic
4 Chord Overstreet Hold On 82 0.618 0.4430 0.1670 acoustic
... ... ... ... ... ... ... ...
113995 Rainy Lullaby Sleep My Little Boy 21 0.172 0.2350 0.0339 world-music
113996 Rainy Lullaby Water Into Light 22 0.174 0.1170 0.0350 world-music
113997 Cesária Evora Miss Perfumado 22 0.629 0.3290 0.7430 world-music
113998 Michael W. Smith Friends 41 0.587 0.5060 0.4130 world-music
113999 Cesária Evora Barbincor 22 0.526 0.4870 0.7080 world-music

113999 rows × 7 columns

Main Portion of Project#

The reason I wanted to take a sample of the max number of rows is because I wanted to see if the outcome would be the same or similar regardless of how many times I ran the code (I tested it about 5 times and the result seems to be similar)

df = df.sample(5000, random_state=32454463)
df
artists track_name popularity danceability energy valence track_genre
106440 Lars Winnerbäck Vem som helst blues 39 0.462 0.795 0.6840 swedish
14671 The Kiboomers Baa Baa Black Sheep 41 0.704 0.236 0.7220 children
58917 Youth Code;King Yosef Claw / Crawl 20 0.509 0.953 0.0389 industrial
77494 Vou pro Sereno;Xande De Pilares Marinheiro Só / Cada Macaco no seu Galho (Chô ... 46 0.405 0.748 0.7610 pagode
24942 Drexciya Red Hills of Lardossa 7 0.695 0.837 0.2060 detroit-techno
... ... ... ... ... ... ... ...
27589 Bladerunner I Miss You 18 0.549 0.718 0.0367 drum-and-bass
17240 DJ Tray Stop Playing - Jersey Club 22 0.868 0.651 0.5450 club
11531 Amy Winehouse Me & Mr Jones 66 0.583 0.486 0.5130 british
7826 Pickin' On Series Lovefool - Bluegrass Rendition of the Cardigans 21 0.765 0.313 0.7540 bluegrass
105171 bladecut forever more 42 0.672 0.385 0.0918 study

5000 rows × 7 columns

alt.Chart(df).mark_bar().encode(
    x = 'energy', 
    y = 'valence', 
    color = alt.Color('popularity',scale=alt.Scale(scheme='turbo')),
    tooltip = ['track_genre']
)

Regardless of how many times it is ran, it seems songs with a lower energy and valence result in less popularity, those with a medium (around 0.5) energy and valence range between not that popular to semi-popular (20-70 range). Those with a high energy and valence range from 0-40 in popularity.

brush = alt.selection_interval(encodings=["x"], init={"x": [0,100]}) # step 1

c1 = alt.Chart(df).mark_circle().encode(
    x="energy",
    y="popularity",
    color=alt.condition(brush, "track_genre", alt.value("orchid"))
).add_selection(brush) # step 2

c2 = alt.Chart(df).mark_bar().encode(
    x="track_genre",
    y=alt.Y("count()", scale=alt.Scale(domain=[0,80])),
    color="track_genre"
).transform_filter(brush)

alt.hconcat(c1,c2) # c1|c2

We can see from this graph that in terms of energy and popularity and when you include all the data points, pop-film does not have the most counts which would have been expected based on the ‘pop_score’. This might be because we chose a wrong factor (‘energy’) to compare popularity with instead of a different one.

cols = ['energy', 'valence', 'danceability']

I am now going to use OneHotEncoder to convert the ‘track_genre’ column into a Numpy array so we can incorporate it into LinearRegression.

encoder = OneHotEncoder()
encoder.fit(df[['track_genre']]) # 'track_genre' is a string column so we want to convert it into 1s and 0s
OneHotEncoder()

We convert the feature names into a list so that we can later use them as new columns in df

new_cols = list(encoder.get_feature_names_out()) 
new_cols
['track_genre_acoustic',
 'track_genre_afrobeat',
 'track_genre_alt-rock',
 'track_genre_alternative',
 'track_genre_ambient',
 'track_genre_anime',
 'track_genre_black-metal',
 'track_genre_bluegrass',
 'track_genre_blues',
 'track_genre_brazil',
 'track_genre_breakbeat',
 'track_genre_british',
 'track_genre_cantopop',
 'track_genre_chicago-house',
 'track_genre_children',
 'track_genre_chill',
 'track_genre_classical',
 'track_genre_club',
 'track_genre_comedy',
 'track_genre_country',
 'track_genre_dance',
 'track_genre_dancehall',
 'track_genre_death-metal',
 'track_genre_deep-house',
 'track_genre_detroit-techno',
 'track_genre_disco',
 'track_genre_disney',
 'track_genre_drum-and-bass',
 'track_genre_dub',
 'track_genre_dubstep',
 'track_genre_edm',
 'track_genre_electro',
 'track_genre_electronic',
 'track_genre_emo',
 'track_genre_folk',
 'track_genre_forro',
 'track_genre_french',
 'track_genre_funk',
 'track_genre_garage',
 'track_genre_german',
 'track_genre_gospel',
 'track_genre_goth',
 'track_genre_grindcore',
 'track_genre_groove',
 'track_genre_grunge',
 'track_genre_guitar',
 'track_genre_happy',
 'track_genre_hard-rock',
 'track_genre_hardcore',
 'track_genre_hardstyle',
 'track_genre_heavy-metal',
 'track_genre_hip-hop',
 'track_genre_honky-tonk',
 'track_genre_house',
 'track_genre_idm',
 'track_genre_indian',
 'track_genre_indie',
 'track_genre_indie-pop',
 'track_genre_industrial',
 'track_genre_iranian',
 'track_genre_j-dance',
 'track_genre_j-idol',
 'track_genre_j-pop',
 'track_genre_j-rock',
 'track_genre_jazz',
 'track_genre_k-pop',
 'track_genre_kids',
 'track_genre_latin',
 'track_genre_latino',
 'track_genre_malay',
 'track_genre_mandopop',
 'track_genre_metal',
 'track_genre_metalcore',
 'track_genre_minimal-techno',
 'track_genre_mpb',
 'track_genre_new-age',
 'track_genre_opera',
 'track_genre_pagode',
 'track_genre_party',
 'track_genre_piano',
 'track_genre_pop',
 'track_genre_pop-film',
 'track_genre_power-pop',
 'track_genre_progressive-house',
 'track_genre_psych-rock',
 'track_genre_punk',
 'track_genre_punk-rock',
 'track_genre_r-n-b',
 'track_genre_reggae',
 'track_genre_reggaeton',
 'track_genre_rock',
 'track_genre_rock-n-roll',
 'track_genre_rockabilly',
 'track_genre_romance',
 'track_genre_sad',
 'track_genre_salsa',
 'track_genre_samba',
 'track_genre_sertanejo',
 'track_genre_show-tunes',
 'track_genre_singer-songwriter',
 'track_genre_ska',
 'track_genre_sleep',
 'track_genre_songwriter',
 'track_genre_soul',
 'track_genre_spanish',
 'track_genre_study',
 'track_genre_swedish',
 'track_genre_synth-pop',
 'track_genre_tango',
 'track_genre_techno',
 'track_genre_trance',
 'track_genre_trip-hop',
 'track_genre_turkish',
 'track_genre_world-music']
df2 = df.copy() # make a copy of df to be safe
df2[new_cols] = encoder.transform(df[["track_genre"]]).toarray() # transform the 'track_genre' column into array
df2
artists track_name popularity danceability energy valence track_genre track_genre_acoustic track_genre_afrobeat track_genre_alt-rock ... track_genre_spanish track_genre_study track_genre_swedish track_genre_synth-pop track_genre_tango track_genre_techno track_genre_trance track_genre_trip-hop track_genre_turkish track_genre_world-music
106440 Lars Winnerbäck Vem som helst blues 39 0.462 0.795 0.6840 swedish 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14671 The Kiboomers Baa Baa Black Sheep 41 0.704 0.236 0.7220 children 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58917 Youth Code;King Yosef Claw / Crawl 20 0.509 0.953 0.0389 industrial 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
77494 Vou pro Sereno;Xande De Pilares Marinheiro Só / Cada Macaco no seu Galho (Chô ... 46 0.405 0.748 0.7610 pagode 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
24942 Drexciya Red Hills of Lardossa 7 0.695 0.837 0.2060 detroit-techno 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27589 Bladerunner I Miss You 18 0.549 0.718 0.0367 drum-and-bass 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17240 DJ Tray Stop Playing - Jersey Club 22 0.868 0.651 0.5450 club 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11531 Amy Winehouse Me & Mr Jones 66 0.583 0.486 0.5130 british 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7826 Pickin' On Series Lovefool - Bluegrass Rendition of the Cardigans 21 0.765 0.313 0.7540 bluegrass 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
105171 bladecut forever more 42 0.672 0.385 0.0918 study 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5000 rows × 121 columns

reg = LinearRegression(fit_intercept=False) # False so we don't include 0
reg.fit(df2[cols+new_cols], df2['popularity']) # cols and new cols are out independent variables, while popularity is our dependent varaible
LinearRegression(fit_intercept=False)
pd.Series(reg.coef_, index=reg.feature_names_in_)
energy                     -0.403090
valence                    -4.584493
danceability               10.108819
track_genre_acoustic       39.078027
track_genre_afrobeat       21.509349
                             ...    
track_genre_techno         34.199210
track_genre_trance         34.507869
track_genre_trip-hop       29.400029
track_genre_turkish        35.722974
track_genre_world-music    37.985017
Length: 117, dtype: float64
pd.Series(reg.coef_, index=reg.feature_names_in_).sort_values(ascending=False, key=abs)
track_genre_k-pop       51.243638
track_genre_pop-film    50.919686
track_genre_pop         48.092857
track_genre_anime       47.498782
track_genre_indian      47.458272
                          ...    
track_genre_latin        5.357669
valence                 -4.584493
track_genre_romance      1.786777
track_genre_iranian      0.850650
energy                  -0.403090
Length: 117, dtype: float64

From this, it seems that energy and valence are not big indicators or popularity of a song but it is danceability that determines popularity. K-pop also seems to be the most popular genre (possibly because of danceability of its songs ?)

I am going to use KNeighborsRegressor and KNeighborsClassifier to see which of them produces the better results

X_train, X_test, y_train, y_test = train_test_split(df2[cols], df2['popularity'], train_size=0.8) # split the data into a training and test set
reg2 = KNeighborsRegressor(n_neighbors=10)
reg2.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=10)
reg2.predict(X_train)
array([26.9, 40.6, 36.6, ..., 10.4, 38.8, 44.7])
mean_absolute_error(reg2.predict(X_train), y_train)
17.009625
mean_absolute_error(reg2.predict(X_test), y_test)
18.2267

Using mean absolute error to compare the test and training set, we see the error for the test set is greater than the training set meaning we will not be overfitting the data when using n_neighbors=10.

def get_scores(k):
    K_reg = KNeighborsRegressor(n_neighbors=k)
    K_reg.fit(X_train, y_train)
    train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
    test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
    return (train_error, test_error)

We will see which k values will give us the least test error

reg_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
reg_scores
k train_error test_error
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
... ... ... ...
144 145 NaN NaN
145 146 NaN NaN
146 147 NaN NaN
147 148 NaN NaN
148 149 NaN NaN

149 rows × 3 columns

for i in reg_scores.index:
    reg_scores.loc[i,["train_error","test_error"]] = get_scores(reg_scores.loc[i,"k"])
reg_scores
k train_error test_error
0 1 0.057750 21.320000
1 2 11.319125 19.648000
2 3 13.713833 18.821333
3 4 14.815188 18.813750
4 5 15.672650 18.877800
... ... ... ...
144 145 18.039840 18.142124
145 146 18.039740 18.138623
146 147 18.043389 18.138891
147 148 18.051356 18.140041
148 149 18.053909 18.137027

149 rows × 3 columns

(reg_scores["test_error"]).min()
18.101453781512603

This means n_neighbors=10 since the test error was around 18.2

reg_scores["kinv"] = 1/reg_scores.k

Since higher k values result in lower flexibility, we add a column with the reciprocal of k values.

reg_scores
k train_error test_error kinv
0 1 0.057750 21.320000 1.000000
1 2 11.319125 19.648000 0.500000
2 3 13.713833 18.821333 0.333333
3 4 14.815188 18.813750 0.250000
4 5 15.672650 18.877800 0.200000
... ... ... ... ...
144 145 18.039840 18.142124 0.006897
145 146 18.039740 18.138623 0.006849
146 147 18.043389 18.138891 0.006803
147 148 18.051356 18.140041 0.006757
148 149 18.053909 18.137027 0.006711

149 rows × 4 columns

reg_train = alt.Chart(reg_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
reg_test = alt.Chart(reg_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)
reg_train+reg_test

We can see that there is decent flexibility and variance in the beginning and all the underfitting after means there is lower flexibility.

clf = KNeighborsClassifier(n_neighbors=7)

We will now compare the result with KNeighborsClassifier

clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=7)
mean_absolute_error(clf.predict(X_train), y_train)
24.1935
mean_absolute_error(clf.predict(X_test), y_test)
27.824

The mean absolute error for the test set is greater than the training set, meaning we will not be overfitting for n_neighbors = 7.

def get_clf_scores(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)
clf_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
for i in clf_scores.index:
    clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])

Process is the same as KNeighborsRegressor

clf_scores
k train_error test_error
0 1 0.05775 21.320
1 2 11.63450 21.997
2 3 17.05800 23.735
3 4 20.35475 25.824
4 5 22.30100 26.822
... ... ... ...
144 145 32.94375 33.060
145 146 32.94425 33.025
146 147 32.93925 33.008
147 148 32.93375 33.008
148 149 32.93800 33.043

149 rows × 3 columns

clf_scores["test_error"].min()
21.32

Using n_neighbors=7 wasn’t a great choice since the test error was around 27.8 which is somewhat far

clf_scores["kinv"] = 1/clf_scores.k
clftrain = alt.Chart(clf_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
clftest = alt.Chart(clf_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
  ).properties(
      title= "Error",
       
    
)
clftrain+clftest

There is good amount of flexibility and varaince in the beginning and overfitting occurs after.

Summary#

From comparing KNeighborsRegressor and KNeighborsClassifier, we can see that KNeighborsRegressor would be better choice for our dataset. From the test error, we have an error of around 18% from KNeighborsRegressor so we can expect popularity to have around a 18% error as well. We can conclude there is some sort of correlation between popularity of a song with its genre.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)? https://www.kaggle.com/code/kelvinzeng/spotify-tracks-analysis

  • List any other references that you found helpful. https://christopherdavisuci.github.io/UCI-Math-10-W22/Proj/StudentProjects/DanaAlbakri.html https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html https://christopherdavisuci.github.io/UCI-Math-10-F22/Week7/Week7-Wednesday.html#including-a-categorical-variable-in-our-linear-regression https://christopherdavisuci.github.io/UCI-Math-10-F22/Week6/Week6-Friday.html#linear-regression-using-a-categorical-variable

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote

previous

Applying Machine Learning techniques to the Fashion MNIST data set

next

New York City Taxi Rides

By Christopher Davis
© Copyright 2022.