Spotify dataset

Spotify dataset

Recording of lecture from 1/24/2022

The csv file attached to this project was originally taken from this Kaggle dataset.

To start out we want to plot Energy vs Loudness using Altair.

To change the default colors, we can select a different color scheme.

import pandas as pd
import altair as alt
df = pd.read_csv("../data/spotify_dataset.csv")
df.head()
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.8 -4.808 0.0504 0.127 0.359 134.002 211560 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.0383 0.103 169.928 141806 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.154 0.335 0.0849 166.928 178147 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.0469 0.364 126.026 231041 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.0203 0.0501 149.995 212000 0.894 D#/Eb

5 rows × 23 columns

If you try to make this into a chart directly with the following code, it does not work.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness"
)

A first guess is that df is too long (Altair by default only works with DataFrames with 5000 rows or fewer).

alt.Chart(df.iloc[:5]).mark_circle().encode(
    x = "Energy",
    y = "Loudness"
)
alt.Chart(df[:5]).mark_circle().encode(
    x = "Energy",
    y = "Loudness"
)
type(df.head())
pandas.core.frame.DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Index                      1556 non-null   int64 
 1   Highest Charting Position  1556 non-null   int64 
 2   Number of Times Charted    1556 non-null   int64 
 3   Week of Highest Charting   1556 non-null   object
 4   Song Name                  1556 non-null   object
 5   Streams                    1556 non-null   object
 6   Artist                     1556 non-null   object
 7   Artist Followers           1556 non-null   object
 8   Song ID                    1556 non-null   object
 9   Genre                      1556 non-null   object
 10  Release Date               1556 non-null   object
 11  Weeks Charted              1556 non-null   object
 12  Popularity                 1556 non-null   object
 13  Danceability               1556 non-null   object
 14  Energy                     1556 non-null   object
 15  Loudness                   1556 non-null   object
 16  Speechiness                1556 non-null   object
 17  Acousticness               1556 non-null   object
 18  Liveness                   1556 non-null   object
 19  Tempo                      1556 non-null   object
 20  Duration (ms)              1556 non-null   object
 21  Valence                    1556 non-null   object
 22  Chord                      1556 non-null   object
dtypes: int64(3), object(20)
memory usage: 279.7+ KB
df.Artist.value_counts()
Taylor Swift                     52
Lil Uzi Vert                     32
Justin Bieber                    32
Juice WRLD                       30
Pop Smoke                        29
                                 ..
Chris Brown, Young Thug           1
Rauw Alejandro, J Balvin          1
347aidan                          1
Migrantes, Alico                  1
Dadá Boladão, Tati Zaqui, OIK     1
Name: Artist, Length: 716, dtype: int64
df["Artist"].value_counts()
Taylor Swift                     52
Lil Uzi Vert                     32
Justin Bieber                    32
Juice WRLD                       30
Pop Smoke                        29
                                 ..
Chris Brown, Young Thug           1
Rauw Alejandro, J Balvin          1
347aidan                          1
Migrantes, Alico                  1
Dadá Boladão, Tati Zaqui, OIK     1
Name: Artist, Length: 716, dtype: int64
df[:3]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.8 -4.808 0.0504 0.127 0.359 134.002 211560 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.0383 0.103 169.928 141806 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.154 0.335 0.0849 166.928 178147 0.688 A

3 rows × 23 columns

# Not too long for Altair (5000 is the cutoff for Altair)
len(df)
1556
df.shape
(1556, 23)
x = df.shape[0]
x
1556
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Index                      1556 non-null   int64 
 1   Highest Charting Position  1556 non-null   int64 
 2   Number of Times Charted    1556 non-null   int64 
 3   Week of Highest Charting   1556 non-null   object
 4   Song Name                  1556 non-null   object
 5   Streams                    1556 non-null   object
 6   Artist                     1556 non-null   object
 7   Artist Followers           1556 non-null   object
 8   Song ID                    1556 non-null   object
 9   Genre                      1556 non-null   object
 10  Release Date               1556 non-null   object
 11  Weeks Charted              1556 non-null   object
 12  Popularity                 1556 non-null   object
 13  Danceability               1556 non-null   object
 14  Energy                     1556 non-null   object
 15  Loudness                   1556 non-null   object
 16  Speechiness                1556 non-null   object
 17  Acousticness               1556 non-null   object
 18  Liveness                   1556 non-null   object
 19  Tempo                      1556 non-null   object
 20  Duration (ms)              1556 non-null   object
 21  Valence                    1556 non-null   object
 22  Chord                      1556 non-null   object
dtypes: int64(3), object(20)
memory usage: 279.7+ KB
df.loc[10,"Energy"]
'0.825'
df.dtypes
Index                         int64
Highest Charting Position     int64
Number of Times Charted       int64
Week of Highest Charting     object
Song Name                    object
Streams                      object
Artist                       object
Artist Followers             object
Song ID                      object
Genre                        object
Release Date                 object
Weeks Charted                object
Popularity                   object
Danceability                 object
Energy                       object
Loudness                     object
Speechiness                  object
Acousticness                 object
Liveness                     object
Tempo                        object
Duration (ms)                object
Valence                      object
Chord                        object
dtype: object
pd.to_numeric(df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " "

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5624/287173585.py in <module>
----> 1 pd.to_numeric(df["Energy"])

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    181         coerce_numeric = errors not in ("ignore", "raise")
    182         try:
--> 183             values, _ = lib.maybe_convert_numeric(
    184                 values, set(), coerce_numeric=coerce_numeric
    185             )

~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " " at position 35
df.isna().any(axis=1)
0       False
1       False
2       False
3       False
4       False
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Length: 1556, dtype: bool
df.isna().any(axis=1).sum()
0
"" == " "
False
# Tell pandas what missing values look like
df2 = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df2.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object
# Count the bad rows
df2.isna().any(axis=1).sum()
11
df2.isna().any(axis=1)
0       False
1       False
2       False
3       False
4       False
        ...  
1551    False
1552    False
1553    False
1554    False
1555    False
Length: 1556, dtype: bool
df[df2.isna().any(axis=1)]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
35 36 36 1 2021-07-23--2021-07-30 NOT SOBER (feat. Polo G & Stunna Gambino) 11,869,336 The Kid LAROI ...
163 164 5 39 2020-10-30--2020-11-06 34+35 5,453,159 Ariana Grande ...
464 465 118 1 2021-03-26--2021-04-02 Richer (feat. Polo G) 6,292,362 Rod Wave ...
530 531 20 5 2021-01-15--2021-01-22 34+35 Remix (feat. Doja Cat, Megan Thee Stalli... 6,162,453 Ariana Grande ...
636 637 22 6 2020-12-18--2020-12-25 Driving Home for Christmas - 2019 Remaster 8,804,531 Chris Rea ...
654 655 73 1 2020-12-18--2020-12-25 Thank God It's Christmas - Non-Album Single 10,509,961 Queen ...
750 751 19 20 2020-07-31--2020-08-07 Agua (with J Balvin) - Music From "Sponge On T... 5,358,940 Tainy ...
784 785 76 14 2020-09-04--2020-09-11 Lean (feat. Towy, Osquel, Beltito & Sammy & Fa... 4,739,241 Super Yei, Jone Quest ...
876 877 164 4 2020-09-18--2020-09-25 +Linda 4,964,708 Dalex ...
1140 1141 131 1 2020-05-29--2020-06-05 In meinem Benz 5,494,500 AK AUSSERKONTROLLE, Bonez MC ...
1538 1539 176 1 2020-01-03--2020-01-10 fuck, i'm lonely (with Anne-Marie) - from “13 ... 4,856,458 Lauv ...

11 rows × 23 columns

# Count the good rows
df2.notna().all(axis=1).sum()
1545
# Keep just the good rows
df3 = df2[df2.notna().all(axis=1)].copy()
df3
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.800 -4.808 0.0504 0.12700 0.3590 134.002 211560.0 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.762 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.528 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.765 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.832 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.448 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G

1545 rows × 23 columns

df3.shape
(1545, 23)
alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness"
)
alt.Chart(df2).mark_circle().encode(
    x = alt.X("Energy", scale = alt.Scale(domain=(0.1,0.8))),
    y = "Loudness"
)
alt.Chart(df2).mark_circle(clip=True, color="Red",size=100).encode(
    x = alt.X("Energy", scale = alt.Scale(domain=(0.1,0.8))),
    y = "Loudness"
)
df2.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object
alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = "Danceability"
)
alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
    tooltip = "Artist"
)
# Are these the same
# ["Artist"] vs list("Artist")
# No, list("Artist") it's the characters
list("Artist")
['A', 'r', 't', 'i', 's', 't']
sel = alt.selection_multi(fields=["Artist"])
c1 = alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
    tooltip = "Artist"
).add_selection(
    sel
)
c2 = alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
    tooltip = "Artist"
).transform_filter(
    sel
)
c1|c2

Try clicking on one or more of the points in the following chart. (Hold down shift while clicking to select multiple points.)

c3 = alt.Chart(df2).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
    opacity = alt.condition(sel,alt.value(1),alt.value(0.2)),
    size = alt.condition(sel,alt.value(400),alt.value(10)),
    tooltip = "Artist"
).add_selection(
    sel
)

c3