Week 4 Monday
Contents
Week 4 Monday¶
Today’s class is mostly meant as review, but there is some new material I want to introduce, including:
np.count_nonzero
sort_values
Announcements¶
Videos and video quizzes are due Tuesday before discussion (different day than most weeks). No in-class quiz this week.
Midterm is Thursday during discussion. You’re allowed to use a notecard with handwritten notes on both sides; ask Chris or Yasmeen if you need a new notecard.
In discussion section on Tuesday, Yasmeen will go over some of the sample midterm. (It’s on Canvas, on the Week 4 page.)
In general, the material from this week could appear on the midterm. (This week’s material is mostly meant as review.) If you’re curious about whether you need to memorize/study something specific, please ask on Ed Discussion.
Timing operations¶
import numpy as np
import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
Define
s
to be the pandas Series corresponding to the “Energy” column indf
.
s = df["Energy"]
type(s)
pandas.core.series.Series
Count how many values in s
are strictly greater than 0.7. Try each of the following strategies.
Use a pandas Boolean Series and the
sum
method.
s[:5]
0 0.800
1 0.764
2 0.664
3 0.897
4 0.704
Name: Energy, dtype: float64
s>0.7
0 True
1 True
2 False
3 True
4 True
...
1551 False
1552 True
1553 False
1554 False
1555 False
Name: Energy, Length: 1556, dtype: bool
(s>0.7).sum()
569
Use a pandas Boolean Series and Python’s built-in
sum
function.
sum(s>0.7)
569
Numpy’s np.count_nonzero
function is the most efficient way I know to count elements in Python.
Use a pandas Boolean Series and Numpy’s
np.count_nonzero
function. (This function accepts many different types of inputs. If the input is a Boolean Series, it will count how oftenTrue
occurs.)
np.count_nonzero(s>0.7)
569
np.count_nonzero(range(5))
4
np.count_nonzero([0,1,0,2,2])
3
Use a list comprehension together with
len
. (In other words, make a list containing all the values greater than 0.7, then compute the length of that list.)
len([x for x in s if x>0.7])
569
Aside: recall that if
by itself should go at the end of the list comprehension, whereas if
together with else
should go at the beginning.
[x if x > 0.7 else "christopher" for x in s ]
[0.8,
0.764,
'christopher',
0.897,
0.704,
'christopher',
0.701,
0.718,
'christopher',
'christopher',
0.825,
0.819,
0.741,
'christopher',
'christopher',
'christopher',
'christopher',
0.825,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.73,
'christopher',
0.766,
'christopher',
0.862,
'christopher',
'christopher',
0.796,
'christopher',
'christopher',
0.816,
'christopher',
'christopher',
'christopher',
0.71,
'christopher',
'christopher',
'christopher',
0.809,
0.765,
'christopher',
0.784,
0.716,
'christopher',
0.711,
'christopher',
0.849,
'christopher',
'christopher',
0.839,
0.738,
0.941,
0.887,
'christopher',
'christopher',
0.72,
'christopher',
0.784,
'christopher',
0.825,
'christopher',
0.793,
0.782,
0.807,
0.77,
'christopher',
'christopher',
'christopher',
0.706,
0.899,
'christopher',
0.939,
'christopher',
'christopher',
0.701,
'christopher',
'christopher',
'christopher',
0.948,
'christopher',
'christopher',
0.874,
0.782,
0.753,
'christopher',
'christopher',
'christopher',
0.802,
0.762,
0.835,
0.813,
0.72,
0.764,
0.83,
'christopher',
'christopher',
0.825,
'christopher',
0.812,
0.78,
'christopher',
'christopher',
0.719,
0.899,
0.837,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.96,
0.731,
'christopher',
'christopher',
'christopher',
'christopher',
0.812,
'christopher',
'christopher',
'christopher',
'christopher',
0.835,
0.731,
0.912,
'christopher',
'christopher',
0.873,
0.859,
'christopher',
'christopher',
'christopher',
'christopher',
0.706,
0.732,
0.795,
0.788,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.855,
0.815,
'christopher',
0.717,
0.861,
0.705,
'christopher',
'christopher',
0.899,
0.783,
0.792,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.924,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.759,
'christopher',
'christopher',
0.771,
'christopher',
0.76,
'christopher',
'christopher',
'christopher',
0.805,
0.766,
'christopher',
'christopher',
0.878,
'christopher',
'christopher',
0.767,
'christopher',
0.72,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.724,
'christopher',
0.844,
0.751,
'christopher',
'christopher',
'christopher',
'christopher',
0.707,
0.732,
'christopher',
0.716,
0.71,
'christopher',
'christopher',
'christopher',
0.922,
0.909,
'christopher',
0.889,
'christopher',
0.918,
0.712,
0.739,
'christopher',
0.893,
0.717,
'christopher',
0.708,
'christopher',
'christopher',
0.727,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.908,
'christopher',
0.795,
'christopher',
0.736,
'christopher',
'christopher',
0.762,
'christopher',
0.75,
0.845,
'christopher',
'christopher',
0.855,
'christopher',
0.866,
'christopher',
0.712,
0.814,
0.862,
'christopher',
0.707,
'christopher',
0.764,
'christopher',
'christopher',
0.809,
0.756,
'christopher',
0.802,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.893,
0.922,
'christopher',
'christopher',
0.843,
'christopher',
0.729,
'christopher',
0.733,
'christopher',
0.864,
'christopher',
0.73,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.706,
'christopher',
0.934,
0.719,
'christopher',
0.712,
'christopher',
'christopher',
'christopher',
'christopher',
0.828,
'christopher',
'christopher',
0.855,
'christopher',
0.831,
'christopher',
0.836,
0.713,
'christopher',
'christopher',
'christopher',
0.838,
'christopher',
'christopher',
'christopher',
0.73,
0.789,
'christopher',
'christopher',
0.862,
0.725,
0.745,
'christopher',
'christopher',
0.704,
'christopher',
'christopher',
'christopher',
0.709,
'christopher',
0.856,
0.89,
0.793,
0.867,
0.955,
0.737,
0.814,
'christopher',
'christopher',
'christopher',
0.929,
'christopher',
'christopher',
0.749,
'christopher',
'christopher',
0.759,
'christopher',
0.74,
'christopher',
0.703,
'christopher',
'christopher',
0.748,
0.788,
0.856,
'christopher',
0.83,
0.837,
0.8,
'christopher',
0.744,
'christopher',
0.721,
'christopher',
'christopher',
'christopher',
0.819,
'christopher',
'christopher',
'christopher',
0.739,
'christopher',
'christopher',
0.796,
0.789,
0.814,
'christopher',
0.789,
0.734,
0.701,
0.848,
'christopher',
'christopher',
'christopher',
'christopher',
0.817,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.821,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.719,
'christopher',
'christopher',
'christopher',
'christopher',
0.758,
'christopher',
'christopher',
0.703,
0.792,
0.763,
0.932,
'christopher',
'christopher',
'christopher',
'christopher',
0.773,
0.741,
0.899,
'christopher',
0.732,
'christopher',
'christopher',
'christopher',
0.821,
'christopher',
'christopher',
'christopher',
0.909,
'christopher',
0.891,
0.771,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.711,
'christopher',
'christopher',
0.922,
0.823,
'christopher',
0.756,
0.741,
'christopher',
0.723,
0.902,
0.72,
'christopher',
'christopher',
'christopher',
'christopher',
0.715,
0.761,
0.724,
'christopher',
'christopher',
0.823,
0.85,
'christopher',
'christopher',
0.701,
0.794,
'christopher',
0.833,
'christopher',
'christopher',
'christopher',
0.739,
'christopher',
'christopher',
'christopher',
0.713,
'christopher',
0.741,
'christopher',
'christopher',
0.922,
'christopher',
0.76,
'christopher',
'christopher',
0.74,
0.819,
'christopher',
0.903,
0.727,
0.761,
0.912,
0.834,
0.744,
'christopher',
'christopher',
0.737,
'christopher',
0.705,
'christopher',
0.828,
'christopher',
0.797,
'christopher',
0.709,
0.708,
0.723,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.756,
'christopher',
'christopher',
'christopher',
0.775,
0.742,
'christopher',
0.706,
'christopher',
0.788,
'christopher',
0.816,
'christopher',
'christopher',
'christopher',
'christopher',
0.774,
0.893,
'christopher',
'christopher',
'christopher',
0.844,
0.769,
'christopher',
'christopher',
0.814,
'christopher',
0.859,
'christopher',
'christopher',
0.746,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.715,
0.737,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.784,
0.726,
'christopher',
'christopher',
0.707,
0.806,
'christopher',
'christopher',
'christopher',
0.786,
'christopher',
'christopher',
'christopher',
0.869,
'christopher',
0.937,
0.701,
'christopher',
'christopher',
0.79,
'christopher',
'christopher',
0.707,
0.792,
0.865,
0.836,
0.914,
'christopher',
'christopher',
0.88,
0.937,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.812,
0.831,
'christopher',
'christopher',
'christopher',
0.772,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.904,
'christopher',
0.876,
0.776,
'christopher',
'christopher',
0.759,
0.925,
'christopher',
0.953,
'christopher',
0.841,
'christopher',
'christopher',
'christopher',
0.814,
0.843,
'christopher',
'christopher',
'christopher',
0.87,
0.869,
'christopher',
'christopher',
'christopher',
'christopher',
0.712,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.813,
'christopher',
0.771,
'christopher',
'christopher',
'christopher',
'christopher',
0.938,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.939,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.73,
'christopher',
'christopher',
0.704,
0.724,
0.758,
0.727,
'christopher',
'christopher',
'christopher',
'christopher',
0.881,
'christopher',
'christopher',
'christopher',
0.703,
0.885,
'christopher',
0.723,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.829,
0.727,
0.74,
0.798,
0.745,
0.725,
0.861,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.845,
'christopher',
0.908,
'christopher',
0.756,
'christopher',
'christopher',
0.882,
0.705,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.708,
'christopher',
'christopher',
'christopher',
'christopher',
0.782,
'christopher',
'christopher',
'christopher',
0.787,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.824,
'christopher',
'christopher',
0.97,
'christopher',
'christopher',
'christopher',
0.765,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.819,
'christopher',
0.808,
'christopher',
'christopher',
'christopher',
'christopher',
0.716,
0.741,
0.796,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.851,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.887,
'christopher',
'christopher',
0.726,
0.913,
'christopher',
'christopher',
0.771,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.837,
'christopher',
'christopher',
'christopher',
0.729,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.865,
0.939,
'christopher',
0.867,
'christopher',
'christopher',
0.758,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.733,
'christopher',
0.875,
'christopher',
0.737,
'christopher',
0.733,
0.793,
0.706,
'christopher',
'christopher',
'christopher',
0.709,
'christopher',
0.818,
0.703,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.749,
'christopher',
0.858,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.776,
'christopher',
'christopher',
0.884,
'christopher',
0.77,
'christopher',
0.723,
'christopher',
'christopher',
0.832,
0.768,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.844,
0.773,
'christopher',
0.706,
'christopher',
'christopher',
'christopher',
0.727,
'christopher',
'christopher',
'christopher',
0.814,
'christopher',
0.955,
'christopher',
'christopher',
0.743,
0.82,
'christopher',
0.911,
'christopher',
'christopher',
0.799,
0.857,
0.71,
'christopher',
'christopher',
0.716,
'christopher',
'christopher',
0.794,
0.854,
'christopher',
'christopher',
'christopher',
0.796,
0.87,
'christopher',
'christopher',
0.863,
'christopher',
'christopher',
0.817,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.854,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.826,
'christopher',
0.781,
'christopher',
'christopher',
0.766,
'christopher',
'christopher',
'christopher',
0.744,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.716,
'christopher',
0.704,
0.848,
0.741,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.742,
'christopher',
'christopher',
'christopher',
0.857,
0.909,
'christopher',
'christopher',
'christopher',
'christopher',
'christopher',
0.764,
0.723,
'christopher',
0.787,
'christopher',
0.76,
'christopher',
...]
Time each of these strategies using
%%timeit
. Which is fastest?
In this case, they are all comparable, with np.count_nonzero
being the fastest. If the original DataFrame were bigger, say with ten million rows instead of one thousand rows, I think the differences would be more pronounced.
%%timeit
(s > 0.7).sum()
62 µs ± 582 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum(s > 0.7)
142 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.count_nonzero(s > 0.7)
42.2 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
len([x for x in s if x>0.7])
137 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Sorting pandas Series and DataFrames¶
One brief topic I want to cover before the midterm is using the sort_values
method.
What songs have the 10 highest “Valence” levels in the Spotify dataset? Solve this two ways.
By sorting a pandas Series, getting its index, and then getting the sub-DataFrame.
df["Valence"].sort_values()
555 0.0320
1501 0.0360
915 0.0363
865 0.0376
1179 0.0391
...
750 NaN
784 NaN
876 NaN
1140 NaN
1538 NaN
Name: Valence, Length: 1556, dtype: float64
df["Valence"].sort_values(ascending=False)
884 0.979
1408 0.977
677 0.971
1096 0.968
1230 0.966
...
750 NaN
784 NaN
876 NaN
1140 NaN
1538 NaN
Name: Valence, Length: 1556, dtype: float64
df["Valence"].sort_values(ascending=False).index
Int64Index([ 884, 1408, 677, 1096, 1230, 463, 130, 1390, 512, 627,
...
163, 464, 530, 636, 654, 750, 784, 876, 1140, 1538],
dtype='int64', length=1556)
df["Valence"].sort_values(ascending=False).index[:10]
Int64Index([884, 1408, 677, 1096, 1230, 463, 130, 1390, 512, 627], dtype='int64')
df["Valence"].sort_values(ascending=False)[:10].index
Int64Index([884, 1408, 677, 1096, 1230, 463, 130, 1390, 512, 627], dtype='int64')
The following code does not work, because it keeps only the first 10 rows at the beginning, so none of the other rows are considered during the sorting.
df["Valence"][:10].sort_values(ascending=False).index
Int64Index([9, 4, 5, 6, 2, 3, 0, 1, 8, 7], dtype='int64')
The code df["Valence"].sort_values(ascending=False)[:10].index
contains labels, so it is natural to put it inside of df.loc
.
df.loc[df["Valence"].sort_values(ascending=False)[:10].index]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
884 | 885 | 148 | 1 | 2020-09-18--2020-09-25 | September | 5,329,256 | Earth, Wind & Fire | 3008916.0 | 2grjqo0Frpf2okIBiifQKs | ['disco', 'funk', 'jazz funk', 'motown', 'quie... | ... | 0.697 | 0.832 | -7.264 | 0.0298 | 0.1680 | 0.2690 | 125.926 | 215093.0 | 0.979 | A |
1408 | 1409 | 85 | 1 | 2020-02-14--2020-02-21 | Running Over (feat. Lil Dicky) | 7,493,188 | Justin Bieber | 48544923.0 | 75nKBP8jQu681pTNCtrEnn | ['canadian pop', 'pop', 'post-teen pop'] | ... | 0.774 | 0.603 | -7.319 | 0.0591 | 0.4380 | 0.0869 | 149.982 | 179627.0 | 0.977 | B |
677 | 678 | 149 | 1 | 2020-12-18--2020-12-25 | Little Saint Nick - 1991 Remix | 7,301,381 | The Beach Boys | 1251372.0 | 63Lk6VuXdj7S58R3wLdv9r | [] | ... | 0.602 | 0.553 | -9.336 | 0.0328 | 0.1080 | 0.0512 | 130.594 | 118840.0 | 0.971 | B |
1096 | 1097 | 129 | 4 | 2020-06-05--2020-06-12 | Na Raba Toma Tapão | 4,396,629 | Niack | 352402.0 | 0AGS6ZRgzobrazmCi6pYMe | ['funk carioca'] | ... | 0.962 | 0.787 | 1.509 | 0.0554 | 0.6660 | 0.1760 | 130.003 | 165231.0 | 0.968 | D#/Eb |
1230 | 1231 | 102 | 1 | 2020-04-17--2020-04-24 | JUMP (feat. YoungBoy Never Broke Again) | 6,033,348 | DaBaby | 7601122.0 | 0oT9ElXYSxvnOOagP9efDq | ['north carolina hip hop', 'rap'] | ... | 0.896 | 0.720 | -6.262 | 0.3550 | 0.1690 | 0.2520 | 140.100 | 212093.0 | 0.966 | C |
463 | 464 | 45 | 18 | 2021-01-01--2021-01-08 | BEBÉ | 4,967,348 | Camilo, El Alfa | 10580764.0 | 7D7EH7MGyNHWSkqrszerI1 | ['colombian pop', 'reggaeton colombiano'] | ... | 0.862 | 0.720 | -4.048 | 0.0379 | 0.4870 | 0.0604 | 129.972 | 198707.0 | 0.965 | E |
130 | 131 | 131 | 3 | 2021-07-23--2021-07-30 | Aquelas Coisas | 6,012,839 | João Gomes | 409173.0 | 0FqVtQxRD3HsPltldG5v5M | [] | ... | 0.682 | 0.873 | -4.163 | 0.0449 | 0.4020 | 0.0946 | 150.006 | 147072.0 | 0.964 | F#/Gb |
1390 | 1391 | 154 | 4 | 2020-02-21--2020-02-28 | Tudo Ok | 5,510,844 | Thiaguinho MT, Mila, JS o Mão de Ouro | 28017.0 | 4HUZBG98TYbxSR9V1V2DWS | ['brega funk', 'funk carioca'] | ... | 0.814 | 0.755 | -6.164 | 0.0942 | 0.2390 | 0.3060 | 79.976 | 178500.0 | 0.963 | B |
512 | 513 | 24 | 25 | 2020-11-06--2020-11-13 | Se Te Nota (with Guaynaa) | 5,168,240 | Lele Pons | 786461.0 | 11EnQRgRMJwMAesfkB5pnu | ['latin pop', 'viral pop'] | ... | 0.905 | 0.686 | -3.152 | 0.0664 | 0.0907 | 0.2660 | 103.013 | 155825.0 | 0.963 | C |
627 | 628 | 14 | 6 | 2020-12-18--2020-12-25 | Feliz Navidad | 11,664,490 | José Feliciano | 239129.0 | 0oPdaY4dXtc3ZsaG17V972 | ['latin pop', 'puerto rican pop'] | ... | 0.513 | 0.831 | -9.004 | 0.0383 | 0.5500 | 0.3360 | 148.837 | 182067.0 | 0.963 | D |
10 rows × 23 columns
An easier way to accomplish the same thing is to sort the whole DataFrame at once. So here we are using a DataFrame’s sort_values
method, rather than a Series sort_values
method.
By sorting the whole DataFrame.
df.sort_values("Valence", ascending=False)[:10]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
884 | 885 | 148 | 1 | 2020-09-18--2020-09-25 | September | 5,329,256 | Earth, Wind & Fire | 3008916.0 | 2grjqo0Frpf2okIBiifQKs | ['disco', 'funk', 'jazz funk', 'motown', 'quie... | ... | 0.697 | 0.832 | -7.264 | 0.0298 | 0.1680 | 0.2690 | 125.926 | 215093.0 | 0.979 | A |
1408 | 1409 | 85 | 1 | 2020-02-14--2020-02-21 | Running Over (feat. Lil Dicky) | 7,493,188 | Justin Bieber | 48544923.0 | 75nKBP8jQu681pTNCtrEnn | ['canadian pop', 'pop', 'post-teen pop'] | ... | 0.774 | 0.603 | -7.319 | 0.0591 | 0.4380 | 0.0869 | 149.982 | 179627.0 | 0.977 | B |
677 | 678 | 149 | 1 | 2020-12-18--2020-12-25 | Little Saint Nick - 1991 Remix | 7,301,381 | The Beach Boys | 1251372.0 | 63Lk6VuXdj7S58R3wLdv9r | [] | ... | 0.602 | 0.553 | -9.336 | 0.0328 | 0.1080 | 0.0512 | 130.594 | 118840.0 | 0.971 | B |
1096 | 1097 | 129 | 4 | 2020-06-05--2020-06-12 | Na Raba Toma Tapão | 4,396,629 | Niack | 352402.0 | 0AGS6ZRgzobrazmCi6pYMe | ['funk carioca'] | ... | 0.962 | 0.787 | 1.509 | 0.0554 | 0.6660 | 0.1760 | 130.003 | 165231.0 | 0.968 | D#/Eb |
1230 | 1231 | 102 | 1 | 2020-04-17--2020-04-24 | JUMP (feat. YoungBoy Never Broke Again) | 6,033,348 | DaBaby | 7601122.0 | 0oT9ElXYSxvnOOagP9efDq | ['north carolina hip hop', 'rap'] | ... | 0.896 | 0.720 | -6.262 | 0.3550 | 0.1690 | 0.2520 | 140.100 | 212093.0 | 0.966 | C |
463 | 464 | 45 | 18 | 2021-01-01--2021-01-08 | BEBÉ | 4,967,348 | Camilo, El Alfa | 10580764.0 | 7D7EH7MGyNHWSkqrszerI1 | ['colombian pop', 'reggaeton colombiano'] | ... | 0.862 | 0.720 | -4.048 | 0.0379 | 0.4870 | 0.0604 | 129.972 | 198707.0 | 0.965 | E |
130 | 131 | 131 | 3 | 2021-07-23--2021-07-30 | Aquelas Coisas | 6,012,839 | João Gomes | 409173.0 | 0FqVtQxRD3HsPltldG5v5M | [] | ... | 0.682 | 0.873 | -4.163 | 0.0449 | 0.4020 | 0.0946 | 150.006 | 147072.0 | 0.964 | F#/Gb |
1390 | 1391 | 154 | 4 | 2020-02-21--2020-02-28 | Tudo Ok | 5,510,844 | Thiaguinho MT, Mila, JS o Mão de Ouro | 28017.0 | 4HUZBG98TYbxSR9V1V2DWS | ['brega funk', 'funk carioca'] | ... | 0.814 | 0.755 | -6.164 | 0.0942 | 0.2390 | 0.3060 | 79.976 | 178500.0 | 0.963 | B |
512 | 513 | 24 | 25 | 2020-11-06--2020-11-13 | Se Te Nota (with Guaynaa) | 5,168,240 | Lele Pons | 786461.0 | 11EnQRgRMJwMAesfkB5pnu | ['latin pop', 'viral pop'] | ... | 0.905 | 0.686 | -3.152 | 0.0664 | 0.0907 | 0.2660 | 103.013 | 155825.0 | 0.963 | C |
627 | 628 | 14 | 6 | 2020-12-18--2020-12-25 | Feliz Navidad | 11,664,490 | José Feliciano | 239129.0 | 0oPdaY4dXtc3ZsaG17V972 | ['latin pop', 'puerto rican pop'] | ... | 0.513 | 0.831 | -9.004 | 0.0383 | 0.5500 | 0.3360 | 148.837 | 182067.0 | 0.963 | D |
10 rows × 23 columns
df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
'Chord'],
dtype='object')
Including the month¶
Add a column to df
with the name “Month” which contains a numeric number 1 to 12 indicating the month of the “Week of Highest Charting”. Use the following strategy.
Drop all rows containing missing values.
Using
map
and a lambda function, get the first 10 characters in each string from the “Week of Highest Charting” column.Use
.dt.month
andpd.to_datetime
to find the numeric month.
(There are probably easier ways, since we already have the numeric month in the original string, but this method generalizes nicely if we want something like the month name instead of the month number.)
# Alternate version: df = df.dropna()
df.dropna(inplace=True)
df.shape
(1545, 23)
df["Week of Highest Charting"]
0 2021-07-23--2021-07-30
1 2021-07-23--2021-07-30
2 2021-06-25--2021-07-02
3 2021-07-02--2021-07-09
4 2021-07-23--2021-07-30
...
1551 2019-12-27--2020-01-03
1552 2019-12-27--2020-01-03
1553 2019-12-27--2020-01-03
1554 2019-12-27--2020-01-03
1555 2019-12-27--2020-01-03
Name: Week of Highest Charting, Length: 1545, dtype: object
Often if we for example what to do something to each entry in a pandas Series, we can just perform that operation on the whole Series, and it will automatically get mapped elementwise. For example, if s
is a pandas Series, then s+2
will add 2 to each of the entries in s
.
That does not work for slicing though. For example, in the following, we are keeping only the first 10 rows, rather than keeping the first 10 characters in each entry.
df["Week of Highest Charting"][:10]
0 2021-07-23--2021-07-30
1 2021-07-23--2021-07-30
2 2021-06-25--2021-07-02
3 2021-07-02--2021-07-09
4 2021-07-23--2021-07-30
5 2021-05-07--2021-05-14
6 2021-05-14--2021-05-21
7 2021-06-18--2021-06-25
8 2021-06-18--2021-06-25
9 2021-07-02--2021-07-09
Name: Week of Highest Charting, dtype: object
One natural way to do this is to use a lambda function.
df["Week of Highest Charting"].map(lambda x: x[:10])
0 2021-07-23
1 2021-07-23
2 2021-06-25
3 2021-07-02
4 2021-07-23
...
1551 2019-12-27
1552 2019-12-27
1553 2019-12-27
1554 2019-12-27
1555 2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object
Reminder on lambda functions. lambda
provides a quick way to define a function. For example, here we define the squaring function.
f = lambda x: x**2
f(3)
9
Here is an equivalent way to get the first 10 characters in each entry, but it’s definitely less elegant.
# less elegant
def first_ten(x):
return x[:10]
df["Week of Highest Charting"].map(first_ten)
0 2021-07-23
1 2021-07-23
2 2021-06-25
3 2021-07-02
4 2021-07-23
...
1551 2019-12-27
1552 2019-12-27
1553 2019-12-27
1554 2019-12-27
1555 2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object
temp_series = df["Week of Highest Charting"].map(lambda x: x[:10])
temp_series
0 2021-07-23
1 2021-07-23
2 2021-06-25
3 2021-07-02
4 2021-07-23
...
1551 2019-12-27
1552 2019-12-27
1553 2019-12-27
1554 2019-12-27
1555 2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object
This temp_series
is the sort of pandas Series which can be converted into the datetime
dtype. Notice how the dtype
of the previous Series was object
(where you should think “string”), whereas in the following it is datetime64[ns]
.
pd.to_datetime(temp_series)
0 2021-07-23
1 2021-07-23
2 2021-06-25
3 2021-07-02
4 2021-07-23
...
1551 2019-12-27
1552 2019-12-27
1553 2019-12-27
1554 2019-12-27
1555 2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: datetime64[ns]
Once the Series is in the correct format, we can apply all sorts of useful methods, here using the dt
accessor.
pd.to_datetime(temp_series).dt.month_name()
0 July
1 July
2 June
3 July
4 July
...
1551 December
1552 December
1553 December
1554 December
1555 December
Name: Week of Highest Charting, Length: 1545, dtype: object
Here we add a new column to the DataFrame containing the numerical month.
df["Month"] = pd.to_datetime(temp_series).dt.month
df
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | Month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 | B | 7 |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db | 7 |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A | 6 |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 | B | 7 |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A | 12 |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B | 12 |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D | 12 |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F | 12 |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G | 12 |
1545 rows × 24 columns