Week 4 Monday

Today’s class is mostly meant as review, but there is some new material I want to introduce, including:

  • np.count_nonzero

  • sort_values

Announcements

  • Videos and video quizzes are due Tuesday before discussion (different day than most weeks). No in-class quiz this week.

  • Midterm is Thursday during discussion. You’re allowed to use a notecard with handwritten notes on both sides; ask Chris or Yasmeen if you need a new notecard.

  • In discussion section on Tuesday, Yasmeen will go over some of the sample midterm. (It’s on Canvas, on the Week 4 page.)

  • In general, the material from this week could appear on the midterm. (This week’s material is mostly meant as review.) If you’re curious about whether you need to memorize/study something specific, please ask on Ed Discussion.

Timing operations

import numpy as np
import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
  • Define s to be the pandas Series corresponding to the “Energy” column in df.

s = df["Energy"]
type(s)
pandas.core.series.Series

Count how many values in s are strictly greater than 0.7. Try each of the following strategies.

  • Use a pandas Boolean Series and the sum method.

s[:5]
0    0.800
1    0.764
2    0.664
3    0.897
4    0.704
Name: Energy, dtype: float64
s>0.7
0        True
1        True
2       False
3        True
4        True
        ...  
1551    False
1552     True
1553    False
1554    False
1555    False
Name: Energy, Length: 1556, dtype: bool
(s>0.7).sum()
569
  • Use a pandas Boolean Series and Python’s built-in sum function.

sum(s>0.7)
569

Numpy’s np.count_nonzero function is the most efficient way I know to count elements in Python.

  • Use a pandas Boolean Series and Numpy’s np.count_nonzero function. (This function accepts many different types of inputs. If the input is a Boolean Series, it will count how often True occurs.)

np.count_nonzero(s>0.7)
569
np.count_nonzero(range(5))
4
np.count_nonzero([0,1,0,2,2])
3
  • Use a list comprehension together with len. (In other words, make a list containing all the values greater than 0.7, then compute the length of that list.)

len([x for x in s if x>0.7])
569

Aside: recall that if by itself should go at the end of the list comprehension, whereas if together with else should go at the beginning.

[x if x > 0.7 else "christopher" for x in s ]
[0.8,
 0.764,
 'christopher',
 0.897,
 0.704,
 'christopher',
 0.701,
 0.718,
 'christopher',
 'christopher',
 0.825,
 0.819,
 0.741,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.825,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.73,
 'christopher',
 0.766,
 'christopher',
 0.862,
 'christopher',
 'christopher',
 0.796,
 'christopher',
 'christopher',
 0.816,
 'christopher',
 'christopher',
 'christopher',
 0.71,
 'christopher',
 'christopher',
 'christopher',
 0.809,
 0.765,
 'christopher',
 0.784,
 0.716,
 'christopher',
 0.711,
 'christopher',
 0.849,
 'christopher',
 'christopher',
 0.839,
 0.738,
 0.941,
 0.887,
 'christopher',
 'christopher',
 0.72,
 'christopher',
 0.784,
 'christopher',
 0.825,
 'christopher',
 0.793,
 0.782,
 0.807,
 0.77,
 'christopher',
 'christopher',
 'christopher',
 0.706,
 0.899,
 'christopher',
 0.939,
 'christopher',
 'christopher',
 0.701,
 'christopher',
 'christopher',
 'christopher',
 0.948,
 'christopher',
 'christopher',
 0.874,
 0.782,
 0.753,
 'christopher',
 'christopher',
 'christopher',
 0.802,
 0.762,
 0.835,
 0.813,
 0.72,
 0.764,
 0.83,
 'christopher',
 'christopher',
 0.825,
 'christopher',
 0.812,
 0.78,
 'christopher',
 'christopher',
 0.719,
 0.899,
 0.837,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.96,
 0.731,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.812,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.835,
 0.731,
 0.912,
 'christopher',
 'christopher',
 0.873,
 0.859,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.706,
 0.732,
 0.795,
 0.788,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.855,
 0.815,
 'christopher',
 0.717,
 0.861,
 0.705,
 'christopher',
 'christopher',
 0.899,
 0.783,
 0.792,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.924,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.759,
 'christopher',
 'christopher',
 0.771,
 'christopher',
 0.76,
 'christopher',
 'christopher',
 'christopher',
 0.805,
 0.766,
 'christopher',
 'christopher',
 0.878,
 'christopher',
 'christopher',
 0.767,
 'christopher',
 0.72,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.724,
 'christopher',
 0.844,
 0.751,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.707,
 0.732,
 'christopher',
 0.716,
 0.71,
 'christopher',
 'christopher',
 'christopher',
 0.922,
 0.909,
 'christopher',
 0.889,
 'christopher',
 0.918,
 0.712,
 0.739,
 'christopher',
 0.893,
 0.717,
 'christopher',
 0.708,
 'christopher',
 'christopher',
 0.727,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.908,
 'christopher',
 0.795,
 'christopher',
 0.736,
 'christopher',
 'christopher',
 0.762,
 'christopher',
 0.75,
 0.845,
 'christopher',
 'christopher',
 0.855,
 'christopher',
 0.866,
 'christopher',
 0.712,
 0.814,
 0.862,
 'christopher',
 0.707,
 'christopher',
 0.764,
 'christopher',
 'christopher',
 0.809,
 0.756,
 'christopher',
 0.802,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.893,
 0.922,
 'christopher',
 'christopher',
 0.843,
 'christopher',
 0.729,
 'christopher',
 0.733,
 'christopher',
 0.864,
 'christopher',
 0.73,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.706,
 'christopher',
 0.934,
 0.719,
 'christopher',
 0.712,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.828,
 'christopher',
 'christopher',
 0.855,
 'christopher',
 0.831,
 'christopher',
 0.836,
 0.713,
 'christopher',
 'christopher',
 'christopher',
 0.838,
 'christopher',
 'christopher',
 'christopher',
 0.73,
 0.789,
 'christopher',
 'christopher',
 0.862,
 0.725,
 0.745,
 'christopher',
 'christopher',
 0.704,
 'christopher',
 'christopher',
 'christopher',
 0.709,
 'christopher',
 0.856,
 0.89,
 0.793,
 0.867,
 0.955,
 0.737,
 0.814,
 'christopher',
 'christopher',
 'christopher',
 0.929,
 'christopher',
 'christopher',
 0.749,
 'christopher',
 'christopher',
 0.759,
 'christopher',
 0.74,
 'christopher',
 0.703,
 'christopher',
 'christopher',
 0.748,
 0.788,
 0.856,
 'christopher',
 0.83,
 0.837,
 0.8,
 'christopher',
 0.744,
 'christopher',
 0.721,
 'christopher',
 'christopher',
 'christopher',
 0.819,
 'christopher',
 'christopher',
 'christopher',
 0.739,
 'christopher',
 'christopher',
 0.796,
 0.789,
 0.814,
 'christopher',
 0.789,
 0.734,
 0.701,
 0.848,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.817,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.821,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.719,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.758,
 'christopher',
 'christopher',
 0.703,
 0.792,
 0.763,
 0.932,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.773,
 0.741,
 0.899,
 'christopher',
 0.732,
 'christopher',
 'christopher',
 'christopher',
 0.821,
 'christopher',
 'christopher',
 'christopher',
 0.909,
 'christopher',
 0.891,
 0.771,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.711,
 'christopher',
 'christopher',
 0.922,
 0.823,
 'christopher',
 0.756,
 0.741,
 'christopher',
 0.723,
 0.902,
 0.72,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.715,
 0.761,
 0.724,
 'christopher',
 'christopher',
 0.823,
 0.85,
 'christopher',
 'christopher',
 0.701,
 0.794,
 'christopher',
 0.833,
 'christopher',
 'christopher',
 'christopher',
 0.739,
 'christopher',
 'christopher',
 'christopher',
 0.713,
 'christopher',
 0.741,
 'christopher',
 'christopher',
 0.922,
 'christopher',
 0.76,
 'christopher',
 'christopher',
 0.74,
 0.819,
 'christopher',
 0.903,
 0.727,
 0.761,
 0.912,
 0.834,
 0.744,
 'christopher',
 'christopher',
 0.737,
 'christopher',
 0.705,
 'christopher',
 0.828,
 'christopher',
 0.797,
 'christopher',
 0.709,
 0.708,
 0.723,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.756,
 'christopher',
 'christopher',
 'christopher',
 0.775,
 0.742,
 'christopher',
 0.706,
 'christopher',
 0.788,
 'christopher',
 0.816,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.774,
 0.893,
 'christopher',
 'christopher',
 'christopher',
 0.844,
 0.769,
 'christopher',
 'christopher',
 0.814,
 'christopher',
 0.859,
 'christopher',
 'christopher',
 0.746,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.715,
 0.737,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.784,
 0.726,
 'christopher',
 'christopher',
 0.707,
 0.806,
 'christopher',
 'christopher',
 'christopher',
 0.786,
 'christopher',
 'christopher',
 'christopher',
 0.869,
 'christopher',
 0.937,
 0.701,
 'christopher',
 'christopher',
 0.79,
 'christopher',
 'christopher',
 0.707,
 0.792,
 0.865,
 0.836,
 0.914,
 'christopher',
 'christopher',
 0.88,
 0.937,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.812,
 0.831,
 'christopher',
 'christopher',
 'christopher',
 0.772,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.904,
 'christopher',
 0.876,
 0.776,
 'christopher',
 'christopher',
 0.759,
 0.925,
 'christopher',
 0.953,
 'christopher',
 0.841,
 'christopher',
 'christopher',
 'christopher',
 0.814,
 0.843,
 'christopher',
 'christopher',
 'christopher',
 0.87,
 0.869,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.712,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.813,
 'christopher',
 0.771,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.938,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.939,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.73,
 'christopher',
 'christopher',
 0.704,
 0.724,
 0.758,
 0.727,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.881,
 'christopher',
 'christopher',
 'christopher',
 0.703,
 0.885,
 'christopher',
 0.723,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.829,
 0.727,
 0.74,
 0.798,
 0.745,
 0.725,
 0.861,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.845,
 'christopher',
 0.908,
 'christopher',
 0.756,
 'christopher',
 'christopher',
 0.882,
 0.705,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.708,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.782,
 'christopher',
 'christopher',
 'christopher',
 0.787,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.824,
 'christopher',
 'christopher',
 0.97,
 'christopher',
 'christopher',
 'christopher',
 0.765,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.819,
 'christopher',
 0.808,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.716,
 0.741,
 0.796,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.851,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.887,
 'christopher',
 'christopher',
 0.726,
 0.913,
 'christopher',
 'christopher',
 0.771,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.837,
 'christopher',
 'christopher',
 'christopher',
 0.729,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.865,
 0.939,
 'christopher',
 0.867,
 'christopher',
 'christopher',
 0.758,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.733,
 'christopher',
 0.875,
 'christopher',
 0.737,
 'christopher',
 0.733,
 0.793,
 0.706,
 'christopher',
 'christopher',
 'christopher',
 0.709,
 'christopher',
 0.818,
 0.703,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.749,
 'christopher',
 0.858,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.776,
 'christopher',
 'christopher',
 0.884,
 'christopher',
 0.77,
 'christopher',
 0.723,
 'christopher',
 'christopher',
 0.832,
 0.768,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.844,
 0.773,
 'christopher',
 0.706,
 'christopher',
 'christopher',
 'christopher',
 0.727,
 'christopher',
 'christopher',
 'christopher',
 0.814,
 'christopher',
 0.955,
 'christopher',
 'christopher',
 0.743,
 0.82,
 'christopher',
 0.911,
 'christopher',
 'christopher',
 0.799,
 0.857,
 0.71,
 'christopher',
 'christopher',
 0.716,
 'christopher',
 'christopher',
 0.794,
 0.854,
 'christopher',
 'christopher',
 'christopher',
 0.796,
 0.87,
 'christopher',
 'christopher',
 0.863,
 'christopher',
 'christopher',
 0.817,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.854,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.826,
 'christopher',
 0.781,
 'christopher',
 'christopher',
 0.766,
 'christopher',
 'christopher',
 'christopher',
 0.744,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.716,
 'christopher',
 0.704,
 0.848,
 0.741,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.742,
 'christopher',
 'christopher',
 'christopher',
 0.857,
 0.909,
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 'christopher',
 0.764,
 0.723,
 'christopher',
 0.787,
 'christopher',
 0.76,
 'christopher',
 ...]
  • Time each of these strategies using %%timeit. Which is fastest?

In this case, they are all comparable, with np.count_nonzero being the fastest. If the original DataFrame were bigger, say with ten million rows instead of one thousand rows, I think the differences would be more pronounced.

%%timeit
(s > 0.7).sum()
62 µs ± 582 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum(s > 0.7)
142 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.count_nonzero(s > 0.7)
42.2 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
len([x for x in s if x>0.7])
137 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Sorting pandas Series and DataFrames

One brief topic I want to cover before the midterm is using the sort_values method.

What songs have the 10 highest “Valence” levels in the Spotify dataset? Solve this two ways.

  • By sorting a pandas Series, getting its index, and then getting the sub-DataFrame.

df["Valence"].sort_values()
555     0.0320
1501    0.0360
915     0.0363
865     0.0376
1179    0.0391
         ...  
750        NaN
784        NaN
876        NaN
1140       NaN
1538       NaN
Name: Valence, Length: 1556, dtype: float64
df["Valence"].sort_values(ascending=False)
884     0.979
1408    0.977
677     0.971
1096    0.968
1230    0.966
        ...  
750       NaN
784       NaN
876       NaN
1140      NaN
1538      NaN
Name: Valence, Length: 1556, dtype: float64
df["Valence"].sort_values(ascending=False).index
Int64Index([ 884, 1408,  677, 1096, 1230,  463,  130, 1390,  512,  627,
            ...
             163,  464,  530,  636,  654,  750,  784,  876, 1140, 1538],
           dtype='int64', length=1556)
df["Valence"].sort_values(ascending=False).index[:10]
Int64Index([884, 1408, 677, 1096, 1230, 463, 130, 1390, 512, 627], dtype='int64')
df["Valence"].sort_values(ascending=False)[:10].index
Int64Index([884, 1408, 677, 1096, 1230, 463, 130, 1390, 512, 627], dtype='int64')

The following code does not work, because it keeps only the first 10 rows at the beginning, so none of the other rows are considered during the sorting.

df["Valence"][:10].sort_values(ascending=False).index
Int64Index([9, 4, 5, 6, 2, 3, 0, 1, 8, 7], dtype='int64')

The code df["Valence"].sort_values(ascending=False)[:10].index contains labels, so it is natural to put it inside of df.loc.

df.loc[df["Valence"].sort_values(ascending=False)[:10].index]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
884 885 148 1 2020-09-18--2020-09-25 September 5,329,256 Earth, Wind & Fire 3008916.0 2grjqo0Frpf2okIBiifQKs ['disco', 'funk', 'jazz funk', 'motown', 'quie... ... 0.697 0.832 -7.264 0.0298 0.1680 0.2690 125.926 215093.0 0.979 A
1408 1409 85 1 2020-02-14--2020-02-21 Running Over (feat. Lil Dicky) 7,493,188 Justin Bieber 48544923.0 75nKBP8jQu681pTNCtrEnn ['canadian pop', 'pop', 'post-teen pop'] ... 0.774 0.603 -7.319 0.0591 0.4380 0.0869 149.982 179627.0 0.977 B
677 678 149 1 2020-12-18--2020-12-25 Little Saint Nick - 1991 Remix 7,301,381 The Beach Boys 1251372.0 63Lk6VuXdj7S58R3wLdv9r [] ... 0.602 0.553 -9.336 0.0328 0.1080 0.0512 130.594 118840.0 0.971 B
1096 1097 129 4 2020-06-05--2020-06-12 Na Raba Toma Tapão 4,396,629 Niack 352402.0 0AGS6ZRgzobrazmCi6pYMe ['funk carioca'] ... 0.962 0.787 1.509 0.0554 0.6660 0.1760 130.003 165231.0 0.968 D#/Eb
1230 1231 102 1 2020-04-17--2020-04-24 JUMP (feat. YoungBoy Never Broke Again) 6,033,348 DaBaby 7601122.0 0oT9ElXYSxvnOOagP9efDq ['north carolina hip hop', 'rap'] ... 0.896 0.720 -6.262 0.3550 0.1690 0.2520 140.100 212093.0 0.966 C
463 464 45 18 2021-01-01--2021-01-08 BEBÉ 4,967,348 Camilo, El Alfa 10580764.0 7D7EH7MGyNHWSkqrszerI1 ['colombian pop', 'reggaeton colombiano'] ... 0.862 0.720 -4.048 0.0379 0.4870 0.0604 129.972 198707.0 0.965 E
130 131 131 3 2021-07-23--2021-07-30 Aquelas Coisas 6,012,839 João Gomes 409173.0 0FqVtQxRD3HsPltldG5v5M [] ... 0.682 0.873 -4.163 0.0449 0.4020 0.0946 150.006 147072.0 0.964 F#/Gb
1390 1391 154 4 2020-02-21--2020-02-28 Tudo Ok 5,510,844 Thiaguinho MT, Mila, JS o Mão de Ouro 28017.0 4HUZBG98TYbxSR9V1V2DWS ['brega funk', 'funk carioca'] ... 0.814 0.755 -6.164 0.0942 0.2390 0.3060 79.976 178500.0 0.963 B
512 513 24 25 2020-11-06--2020-11-13 Se Te Nota (with Guaynaa) 5,168,240 Lele Pons 786461.0 11EnQRgRMJwMAesfkB5pnu ['latin pop', 'viral pop'] ... 0.905 0.686 -3.152 0.0664 0.0907 0.2660 103.013 155825.0 0.963 C
627 628 14 6 2020-12-18--2020-12-25 Feliz Navidad 11,664,490 José Feliciano 239129.0 0oPdaY4dXtc3ZsaG17V972 ['latin pop', 'puerto rican pop'] ... 0.513 0.831 -9.004 0.0383 0.5500 0.3360 148.837 182067.0 0.963 D

10 rows × 23 columns

An easier way to accomplish the same thing is to sort the whole DataFrame at once. So here we are using a DataFrame’s sort_values method, rather than a Series sort_values method.

  • By sorting the whole DataFrame.

df.sort_values("Valence", ascending=False)[:10]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
884 885 148 1 2020-09-18--2020-09-25 September 5,329,256 Earth, Wind & Fire 3008916.0 2grjqo0Frpf2okIBiifQKs ['disco', 'funk', 'jazz funk', 'motown', 'quie... ... 0.697 0.832 -7.264 0.0298 0.1680 0.2690 125.926 215093.0 0.979 A
1408 1409 85 1 2020-02-14--2020-02-21 Running Over (feat. Lil Dicky) 7,493,188 Justin Bieber 48544923.0 75nKBP8jQu681pTNCtrEnn ['canadian pop', 'pop', 'post-teen pop'] ... 0.774 0.603 -7.319 0.0591 0.4380 0.0869 149.982 179627.0 0.977 B
677 678 149 1 2020-12-18--2020-12-25 Little Saint Nick - 1991 Remix 7,301,381 The Beach Boys 1251372.0 63Lk6VuXdj7S58R3wLdv9r [] ... 0.602 0.553 -9.336 0.0328 0.1080 0.0512 130.594 118840.0 0.971 B
1096 1097 129 4 2020-06-05--2020-06-12 Na Raba Toma Tapão 4,396,629 Niack 352402.0 0AGS6ZRgzobrazmCi6pYMe ['funk carioca'] ... 0.962 0.787 1.509 0.0554 0.6660 0.1760 130.003 165231.0 0.968 D#/Eb
1230 1231 102 1 2020-04-17--2020-04-24 JUMP (feat. YoungBoy Never Broke Again) 6,033,348 DaBaby 7601122.0 0oT9ElXYSxvnOOagP9efDq ['north carolina hip hop', 'rap'] ... 0.896 0.720 -6.262 0.3550 0.1690 0.2520 140.100 212093.0 0.966 C
463 464 45 18 2021-01-01--2021-01-08 BEBÉ 4,967,348 Camilo, El Alfa 10580764.0 7D7EH7MGyNHWSkqrszerI1 ['colombian pop', 'reggaeton colombiano'] ... 0.862 0.720 -4.048 0.0379 0.4870 0.0604 129.972 198707.0 0.965 E
130 131 131 3 2021-07-23--2021-07-30 Aquelas Coisas 6,012,839 João Gomes 409173.0 0FqVtQxRD3HsPltldG5v5M [] ... 0.682 0.873 -4.163 0.0449 0.4020 0.0946 150.006 147072.0 0.964 F#/Gb
1390 1391 154 4 2020-02-21--2020-02-28 Tudo Ok 5,510,844 Thiaguinho MT, Mila, JS o Mão de Ouro 28017.0 4HUZBG98TYbxSR9V1V2DWS ['brega funk', 'funk carioca'] ... 0.814 0.755 -6.164 0.0942 0.2390 0.3060 79.976 178500.0 0.963 B
512 513 24 25 2020-11-06--2020-11-13 Se Te Nota (with Guaynaa) 5,168,240 Lele Pons 786461.0 11EnQRgRMJwMAesfkB5pnu ['latin pop', 'viral pop'] ... 0.905 0.686 -3.152 0.0664 0.0907 0.2660 103.013 155825.0 0.963 C
627 628 14 6 2020-12-18--2020-12-25 Feliz Navidad 11,664,490 José Feliciano 239129.0 0oPdaY4dXtc3ZsaG17V972 ['latin pop', 'puerto rican pop'] ... 0.513 0.831 -9.004 0.0383 0.5500 0.3360 148.837 182067.0 0.963 D

10 rows × 23 columns

df.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord'],
      dtype='object')

Including the month

Add a column to df with the name “Month” which contains a numeric number 1 to 12 indicating the month of the “Week of Highest Charting”. Use the following strategy.

  • Drop all rows containing missing values.

  • Using map and a lambda function, get the first 10 characters in each string from the “Week of Highest Charting” column.

  • Use .dt.month and pd.to_datetime to find the numeric month.

(There are probably easier ways, since we already have the numeric month in the original string, but this method generalizes nicely if we want something like the month name instead of the month number.)

# Alternate version: df = df.dropna()
df.dropna(inplace=True)
df.shape
(1545, 23)
df["Week of Highest Charting"]
0       2021-07-23--2021-07-30
1       2021-07-23--2021-07-30
2       2021-06-25--2021-07-02
3       2021-07-02--2021-07-09
4       2021-07-23--2021-07-30
                 ...          
1551    2019-12-27--2020-01-03
1552    2019-12-27--2020-01-03
1553    2019-12-27--2020-01-03
1554    2019-12-27--2020-01-03
1555    2019-12-27--2020-01-03
Name: Week of Highest Charting, Length: 1545, dtype: object

Often if we for example what to do something to each entry in a pandas Series, we can just perform that operation on the whole Series, and it will automatically get mapped elementwise. For example, if s is a pandas Series, then s+2 will add 2 to each of the entries in s.

That does not work for slicing though. For example, in the following, we are keeping only the first 10 rows, rather than keeping the first 10 characters in each entry.

df["Week of Highest Charting"][:10]
0    2021-07-23--2021-07-30
1    2021-07-23--2021-07-30
2    2021-06-25--2021-07-02
3    2021-07-02--2021-07-09
4    2021-07-23--2021-07-30
5    2021-05-07--2021-05-14
6    2021-05-14--2021-05-21
7    2021-06-18--2021-06-25
8    2021-06-18--2021-06-25
9    2021-07-02--2021-07-09
Name: Week of Highest Charting, dtype: object

One natural way to do this is to use a lambda function.

df["Week of Highest Charting"].map(lambda x: x[:10])
0       2021-07-23
1       2021-07-23
2       2021-06-25
3       2021-07-02
4       2021-07-23
           ...    
1551    2019-12-27
1552    2019-12-27
1553    2019-12-27
1554    2019-12-27
1555    2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object

Reminder on lambda functions. lambda provides a quick way to define a function. For example, here we define the squaring function.

f = lambda x: x**2
f(3)
9

Here is an equivalent way to get the first 10 characters in each entry, but it’s definitely less elegant.

# less elegant
def first_ten(x):
    return x[:10]
df["Week of Highest Charting"].map(first_ten)
0       2021-07-23
1       2021-07-23
2       2021-06-25
3       2021-07-02
4       2021-07-23
           ...    
1551    2019-12-27
1552    2019-12-27
1553    2019-12-27
1554    2019-12-27
1555    2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object
temp_series = df["Week of Highest Charting"].map(lambda x: x[:10])
temp_series
0       2021-07-23
1       2021-07-23
2       2021-06-25
3       2021-07-02
4       2021-07-23
           ...    
1551    2019-12-27
1552    2019-12-27
1553    2019-12-27
1554    2019-12-27
1555    2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: object

This temp_series is the sort of pandas Series which can be converted into the datetime dtype. Notice how the dtype of the previous Series was object (where you should think “string”), whereas in the following it is datetime64[ns].

pd.to_datetime(temp_series)
0      2021-07-23
1      2021-07-23
2      2021-06-25
3      2021-07-02
4      2021-07-23
          ...    
1551   2019-12-27
1552   2019-12-27
1553   2019-12-27
1554   2019-12-27
1555   2019-12-27
Name: Week of Highest Charting, Length: 1545, dtype: datetime64[ns]

Once the Series is in the correct format, we can apply all sorts of useful methods, here using the dt accessor.

pd.to_datetime(temp_series).dt.month_name()
0           July
1           July
2           June
3           July
4           July
          ...   
1551    December
1552    December
1553    December
1554    December
1555    December
Name: Week of Highest Charting, Length: 1545, dtype: object

Here we add a new column to the DataFrame containing the numerical month.

df["Month"] = pd.to_datetime(temp_series).dt.month
df
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord Month
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.800 -4.808 0.0504 0.12700 0.3590 134.002 211560.0 0.589 B 7
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db 7
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A 6
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B 7
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A 12
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B 12
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D 12
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F 12
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G 12

1545 rows × 24 columns