Spotify dataset
Spotify dataset¶
Recording of lecture from 1/24/2022
The csv file attached to this project was originally taken from this Kaggle dataset.
To start out we want to plot Energy vs Loudness using Altair.
To change the default colors, we can select a different color scheme.
import pandas as pd
import altair as alt
df = pd.read_csv("../data/spotify_dataset.csv")
df.head()
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.364 | 126.026 | 231041 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000 | 0.894 | D#/Eb |
5 rows × 23 columns
If you try to make this into a chart directly with the following code, it does not work.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness"
)
A first guess is that df is too long (Altair by default only works with DataFrames with 5000 rows or fewer).
alt.Chart(df.iloc[:5]).mark_circle().encode(
x = "Energy",
y = "Loudness"
)
alt.Chart(df[:5]).mark_circle().encode(
x = "Energy",
y = "Loudness"
)
type(df.head())
pandas.core.frame.DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Index 1556 non-null int64
1 Highest Charting Position 1556 non-null int64
2 Number of Times Charted 1556 non-null int64
3 Week of Highest Charting 1556 non-null object
4 Song Name 1556 non-null object
5 Streams 1556 non-null object
6 Artist 1556 non-null object
7 Artist Followers 1556 non-null object
8 Song ID 1556 non-null object
9 Genre 1556 non-null object
10 Release Date 1556 non-null object
11 Weeks Charted 1556 non-null object
12 Popularity 1556 non-null object
13 Danceability 1556 non-null object
14 Energy 1556 non-null object
15 Loudness 1556 non-null object
16 Speechiness 1556 non-null object
17 Acousticness 1556 non-null object
18 Liveness 1556 non-null object
19 Tempo 1556 non-null object
20 Duration (ms) 1556 non-null object
21 Valence 1556 non-null object
22 Chord 1556 non-null object
dtypes: int64(3), object(20)
memory usage: 279.7+ KB
df.Artist.value_counts()
Taylor Swift 52
Lil Uzi Vert 32
Justin Bieber 32
Juice WRLD 30
Pop Smoke 29
..
Chris Brown, Young Thug 1
Rauw Alejandro, J Balvin 1
347aidan 1
Migrantes, Alico 1
Dadá Boladão, Tati Zaqui, OIK 1
Name: Artist, Length: 716, dtype: int64
df["Artist"].value_counts()
Taylor Swift 52
Lil Uzi Vert 32
Justin Bieber 32
Juice WRLD 30
Pop Smoke 29
..
Chris Brown, Young Thug 1
Rauw Alejandro, J Balvin 1
347aidan 1
Migrantes, Alico 1
Dadá Boladão, Tati Zaqui, OIK 1
Name: Artist, Length: 716, dtype: int64
df[:3]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A |
3 rows × 23 columns
# Not too long for Altair (5000 is the cutoff for Altair)
len(df)
1556
df.shape
(1556, 23)
x = df.shape[0]
x
1556
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1556 entries, 0 to 1555
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Index 1556 non-null int64
1 Highest Charting Position 1556 non-null int64
2 Number of Times Charted 1556 non-null int64
3 Week of Highest Charting 1556 non-null object
4 Song Name 1556 non-null object
5 Streams 1556 non-null object
6 Artist 1556 non-null object
7 Artist Followers 1556 non-null object
8 Song ID 1556 non-null object
9 Genre 1556 non-null object
10 Release Date 1556 non-null object
11 Weeks Charted 1556 non-null object
12 Popularity 1556 non-null object
13 Danceability 1556 non-null object
14 Energy 1556 non-null object
15 Loudness 1556 non-null object
16 Speechiness 1556 non-null object
17 Acousticness 1556 non-null object
18 Liveness 1556 non-null object
19 Tempo 1556 non-null object
20 Duration (ms) 1556 non-null object
21 Valence 1556 non-null object
22 Chord 1556 non-null object
dtypes: int64(3), object(20)
memory usage: 279.7+ KB
df.loc[10,"Energy"]
'0.825'
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers object
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity object
Danceability object
Energy object
Loudness object
Speechiness object
Acousticness object
Liveness object
Tempo object
Duration (ms) object
Valence object
Chord object
dtype: object
pd.to_numeric(df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5624/287173585.py in <module>
----> 1 pd.to_numeric(df["Energy"])
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
181 coerce_numeric = errors not in ("ignore", "raise")
182 try:
--> 183 values, _ = lib.maybe_convert_numeric(
184 values, set(), coerce_numeric=coerce_numeric
185 )
~/miniconda3/envs/math11/lib/python3.9/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " " at position 35
df.isna().any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
1551 False
1552 False
1553 False
1554 False
1555 False
Length: 1556, dtype: bool
df.isna().any(axis=1).sum()
0
"" == " "
False
# Tell pandas what missing values look like
df2 = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
df2.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
# Count the bad rows
df2.isna().any(axis=1).sum()
11
df2.isna().any(axis=1)
0 False
1 False
2 False
3 False
4 False
...
1551 False
1552 False
1553 False
1554 False
1555 False
Length: 1556, dtype: bool
df[df2.isna().any(axis=1)]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 36 | 36 | 1 | 2021-07-23--2021-07-30 | NOT SOBER (feat. Polo G & Stunna Gambino) | 11,869,336 | The Kid LAROI | ... | |||||||||||||
163 | 164 | 5 | 39 | 2020-10-30--2020-11-06 | 34+35 | 5,453,159 | Ariana Grande | ... | |||||||||||||
464 | 465 | 118 | 1 | 2021-03-26--2021-04-02 | Richer (feat. Polo G) | 6,292,362 | Rod Wave | ... | |||||||||||||
530 | 531 | 20 | 5 | 2021-01-15--2021-01-22 | 34+35 Remix (feat. Doja Cat, Megan Thee Stalli... | 6,162,453 | Ariana Grande | ... | |||||||||||||
636 | 637 | 22 | 6 | 2020-12-18--2020-12-25 | Driving Home for Christmas - 2019 Remaster | 8,804,531 | Chris Rea | ... | |||||||||||||
654 | 655 | 73 | 1 | 2020-12-18--2020-12-25 | Thank God It's Christmas - Non-Album Single | 10,509,961 | Queen | ... | |||||||||||||
750 | 751 | 19 | 20 | 2020-07-31--2020-08-07 | Agua (with J Balvin) - Music From "Sponge On T... | 5,358,940 | Tainy | ... | |||||||||||||
784 | 785 | 76 | 14 | 2020-09-04--2020-09-11 | Lean (feat. Towy, Osquel, Beltito & Sammy & Fa... | 4,739,241 | Super Yei, Jone Quest | ... | |||||||||||||
876 | 877 | 164 | 4 | 2020-09-18--2020-09-25 | +Linda | 4,964,708 | Dalex | ... | |||||||||||||
1140 | 1141 | 131 | 1 | 2020-05-29--2020-06-05 | In meinem Benz | 5,494,500 | AK AUSSERKONTROLLE, Bonez MC | ... | |||||||||||||
1538 | 1539 | 176 | 1 | 2020-01-03--2020-01-10 | fuck, i'm lonely (with Anne-Marie) - from “13 ... | 4,856,458 | Lauv | ... |
11 rows × 23 columns
# Count the good rows
df2.notna().all(axis=1).sum()
1545
# Keep just the good rows
df3 = df2[df2.notna().all(axis=1)].copy()
df3
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.762 | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.528 | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.765 | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.832 | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.448 | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G |
1545 rows × 23 columns
df3.shape
(1545, 23)
alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness"
)
alt.Chart(df2).mark_circle().encode(
x = alt.X("Energy", scale = alt.Scale(domain=(0.1,0.8))),
y = "Loudness"
)
alt.Chart(df2).mark_circle(clip=True, color="Red",size=100).encode(
x = alt.X("Energy", scale = alt.Scale(domain=(0.1,0.8))),
y = "Loudness"
)
df2.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = "Danceability"
)
alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
tooltip = "Artist"
)
# Are these the same
# ["Artist"] vs list("Artist")
# No, list("Artist") it's the characters
list("Artist")
['A', 'r', 't', 'i', 's', 't']
sel = alt.selection_multi(fields=["Artist"])
c1 = alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
tooltip = "Artist"
).add_selection(
sel
)
c2 = alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
tooltip = "Artist"
).transform_filter(
sel
)
c1|c2
Try clicking on one or more of the points in the following chart. (Hold down shift while clicking to select multiple points.)
c3 = alt.Chart(df2).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color("Tempo",scale=alt.Scale(scheme="Turbo")),
opacity = alt.condition(sel,alt.value(1),alt.value(0.2)),
size = alt.condition(sel,alt.value(400),alt.value(10)),
tooltip = "Artist"
).add_selection(
sel
)
c3