Week 4 Videos
Contents
Week 4 Videos#
The axis
keyword argument in pandas#
The axis
keyword argument: part 2#
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A": [2, np.nan, 3, 4],
"B": [1, 5, 6, 2]
})
df
A | B | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
2 | 3.0 | 6 |
3 | 4.0 | 2 |
df.isna()
A | B | |
---|---|---|
0 | False | False |
1 | True | False |
2 | False | False |
3 | False | False |
df.isna().any(axis=0)
A True
B False
dtype: bool
df
A | B | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
2 | 3.0 | 6 |
3 | 4.0 | 2 |
df.dropna(axis=1)
B | |
---|---|
0 | 1 |
1 | 5 |
2 | 6 |
3 | 2 |
df
A | B | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
2 | 3.0 | 6 |
3 | 4.0 | 2 |
df.rename({"B":"C"}, axis=1)
A | C | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
2 | 3.0 | 6 |
3 | 4.0 | 2 |
df.rename({2:"C"}, axis=0)
A | B | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
C | 3.0 | 6 |
3 | 4.0 | 2 |
df.rename({2:"C"}, axis="rows")
A | B | |
---|---|---|
0 | 2.0 | 1 |
1 | NaN | 5 |
C | 3.0 | 6 |
3 | 4.0 | 2 |
The pandas DataFrame method apply
#
import pandas as pd
import altair as alt
df = pd.read_csv("spotify_dataset.csv") # better: na_values = " "
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Loudness",
color=alt.Color("Valence", scale=alt.Scale(scheme="spectral")),
tooltip = ["Artist", "Song Name"]
)
df
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.364 | 126.026 | 231041 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000 | 0.894 | D#/Eb |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.762 | 0.7 | -6.021 | 0.0694 | 0.00261 | 0.153 | 116.073 | 209320 | 0.608 | A |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.528 | 0.87 | -3.123 | 0.0851 | 0.24 | 0.333 | 152.37 | 181930 | 0.714 | B |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.765 | 0.523 | -4.333 | 0.03 | 0.184 | 0.132 | 104.988 | 217307 | 0.394 | D |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.832 | 0.55 | -7.026 | 0.0587 | 0.249 | 0.182 | 154.064 | 152784 | 0.881 | F |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.448 | 0.603 | -7.176 | 0.064 | 0.433 | 0.0862 | 205.272 | 221307 | 0.422 | G |
1556 rows × 23 columns
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers object
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity object
Danceability object
Energy object
Loudness object
Speechiness object
Acousticness object
Liveness object
Tempo object
Duration (ms) object
Valence object
Chord object
dtype: object
pd.to_numeric(df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File pandas/_libs/lib.pyx:2062, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In [6], line 1
----> 1 pd.to_numeric(df["Energy"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/tools/numeric.py:154, in to_numeric(arg, errors, downcast)
152 coerce_numeric = errors not in ("ignore", "raise")
153 try:
--> 154 values = lib.maybe_convert_numeric(
155 values, set(), coerce_numeric=coerce_numeric
156 )
157 except (ValueError, TypeError):
158 if errors == "raise":
File pandas/_libs/lib.pyx:2099, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " " at position 35
import numpy as np
df.applymap(lambda x: x if x != " " else np.nan)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.364 | 126.026 | 231041 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000 | 0.894 | D#/Eb |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.762 | 0.7 | -6.021 | 0.0694 | 0.00261 | 0.153 | 116.073 | 209320 | 0.608 | A |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.528 | 0.87 | -3.123 | 0.0851 | 0.24 | 0.333 | 152.37 | 181930 | 0.714 | B |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.765 | 0.523 | -4.333 | 0.03 | 0.184 | 0.132 | 104.988 | 217307 | 0.394 | D |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.832 | 0.55 | -7.026 | 0.0587 | 0.249 | 0.182 | 154.064 | 152784 | 0.881 | F |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.448 | 0.603 | -7.176 | 0.064 | 0.433 | 0.0862 | 205.272 | 221307 | 0.422 | G |
1556 rows × 23 columns
pd.to_numeric(df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File pandas/_libs/lib.pyx:2062, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In [9], line 1
----> 1 pd.to_numeric(df["Energy"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/tools/numeric.py:154, in to_numeric(arg, errors, downcast)
152 coerce_numeric = errors not in ("ignore", "raise")
153 try:
--> 154 values = lib.maybe_convert_numeric(
155 values, set(), coerce_numeric=coerce_numeric
156 )
157 except (ValueError, TypeError):
158 if errors == "raise":
File pandas/_libs/lib.pyx:2099, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " " at position 35
df = df.applymap(lambda x: x if x != " " else np.nan)
pd.to_numeric(df["Energy"])
0 0.800
1 0.764
2 0.664
3 0.897
4 0.704
...
1551 0.700
1552 0.870
1553 0.523
1554 0.550
1555 0.603
Name: Energy, Length: 1556, dtype: float64
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers object
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity object
Danceability object
Energy object
Loudness object
Speechiness object
Acousticness object
Liveness object
Tempo object
Duration (ms) object
Valence object
Chord object
dtype: object
lcol = "Popularity"
rcol = "Valence"
df.loc[:, lcol:rcol] = df.loc[:, lcol:rcol].apply(pd.to_numeric, axis=0)
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers object
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Loudness",
color=alt.Color("Valence", scale=alt.Scale(scheme="spectral")),
tooltip = ["Artist", "Song Name"]
)
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Loudness",
color=alt.Color("Valence", scale=alt.Scale(scheme="spectral", reverse=True)),
tooltip = ["Artist", "Song Name"]
)
df.loc[:, lcol:rcol]
Popularity | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 100.0 | 0.714 | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 |
1 | 99.0 | 0.591 | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 |
2 | 99.0 | 0.563 | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 |
3 | 98.0 | 0.808 | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 |
4 | 96.0 | 0.736 | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 79.0 | 0.762 | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 |
1552 | 66.0 | 0.528 | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 |
1553 | 81.0 | 0.765 | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 |
1554 | 60.0 | 0.832 | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 |
1555 | 70.0 | 0.448 | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 |
1556 rows × 10 columns
df.loc[:, lcol:rcol].sum(axis=0)
Popularity 1.082880e+05
Danceability 1.066045e+03
Energy 9.787500e+02
Loudness -9.808392e+03
Speechiness 1.910481e+02
Acousticness 3.842330e+02
Liveness 2.799577e+02
Tempo 1.897430e+05
Duration (ms) 3.058186e+08
Valence 7.952174e+02
dtype: float64
df.loc[:, lcol:rcol].apply(lambda col: col.sum(), axis=0)
Popularity 1.082880e+05
Danceability 1.066045e+03
Energy 9.787500e+02
Loudness -9.808392e+03
Speechiness 1.910481e+02
Acousticness 3.842330e+02
Liveness 2.799577e+02
Tempo 1.897430e+05
Duration (ms) 3.058186e+08
Valence 7.952174e+02
dtype: float64
df.loc[:, lcol:rcol].apply(lambda z: z.sum(), axis=0)
Popularity 1.082880e+05
Danceability 1.066045e+03
Energy 9.787500e+02
Loudness -9.808392e+03
Speechiness 1.910481e+02
Acousticness 3.842330e+02
Liveness 2.799577e+02
Tempo 1.897430e+05
Duration (ms) 3.058186e+08
Valence 7.952174e+02
dtype: float64
df = pd.read_csv("spotify_dataset.csv") # better: na_values = " "
df.replace(" ", np.nan, inplace=True)
df = pd.read_csv("spotify_dataset.csv", na_values=" ")
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object