Week 4 Tuesday#
import pandas as pd
df = pd.read_csv("spotify_dataset.csv")
df.head()
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | MĂĄneskin | 3377762 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.8 | -4.808 | 0.0504 | 0.127 | 0.359 | 134.002 | 211560 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.103 | 169.928 | 141806 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.154 | 0.335 | 0.0849 | 166.928 | 178147 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.364 | 126.026 | 231041 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000 | 0.894 | D#/Eb |
5 rows Ă— 23 columns
lambda function examples#
In the “Week of Highest Charting” column values, get the first date, for example, if the value is
"2021-07-23--2021-07-30"
, you want to produce"2021-07-23"
. Put this in a new column called “Date”. Usemap
and a lambda function.
s = "2021-07-23--2021-07-30"
s.split("-")
['2021', '07', '23', '', '2021', '07', '30']
s.split("--")
['2021-07-23', '2021-07-30']
s.split("--")[0]
'2021-07-23'
df["Date"] = df["Week of Highest Charting"].map(lambda s: s.split("--")[0])
df["Date"]
0 2021-07-23
1 2021-07-23
2 2021-06-25
3 2021-07-02
4 2021-07-23
...
1551 2019-12-27
1552 2019-12-27
1553 2019-12-27
1554 2019-12-27
1555 2019-12-27
Name: Date, Length: 1556, dtype: object
Convert the “Date” column to a datetime data type using the pandas function
to_datetime
.
df["Date"] = pd.to_datetime(df["Date"])
Which song has the most genres listed?
Notice: the “Genre” entries” are currently strings and not lists. Use the literal_eval
function from the ast
module. Some values don’t work, so only apply this to strings starting with "["
, and use not-a-number from NumPy for the other values.
df.loc[0, "Genre"]
"['indie rock italiano', 'italian pop']"
df.loc[0, "Genre"][0]
'['
type(df.loc[0, "Genre"])
str
from ast import literal_eval # safer than `eval`
literal_eval(df.loc[0, "Genre"])
['indie rock italiano', 'italian pop']
type(literal_eval(df.loc[0, "Genre"]))
list
help(literal_eval)
Help on function literal_eval in module ast:
literal_eval(node_or_string)
Safely evaluate an expression node or a string containing a Python
expression. The string or node provided may only consist of the following
Python literal structures: strings, bytes, numbers, tuples, lists, dicts,
sets, booleans, and None.
df["Genre"].map(literal_eval)
Traceback (most recent call last):
File /shared-libs/python3.9/py-core/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3378 in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
Cell In [16], line 1
df["Genre"].map(literal_eval)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:3909 in map
new_values = super()._map_values(arg, na_action=na_action)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/base.py:937 in _map_values
new_values = map_f(values, mapper)
File pandas/_libs/lib.pyx:2467 in pandas._libs.lib.map_infer
File /usr/local/lib/python3.9/ast.py:62 in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File /usr/local/lib/python3.9/ast.py:50 in parse
return compile(source, filename, mode, flags,
File <unknown>:1
^
SyntaxError: unexpected EOF while parsing
import numpy as np
df["Genre"].map(lambda x: literal_eval(x) if x.startswith("[") else np.nan)
0 [indie rock italiano, italian pop]
1 [australian hip hop]
2 [pop]
3 [pop, uk pop]
4 [lgbtq+ hip hop, pop rap]
...
1551 [dance pop, pop, uk pop]
1552 [sertanejo, sertanejo universitario]
1553 [dance pop, electropop, pop, post-teen pop]
1554 [brega funk, funk carioca]
1555 [pop, post-teen pop]
Name: Genre, Length: 1556, dtype: object
df["Genre Length"] = df["Genre"].map(lambda x: len(literal_eval(x)) if x.startswith("[") else np.nan)
df["Genre"]
0 ['indie rock italiano', 'italian pop']
1 ['australian hip hop']
2 ['pop']
3 ['pop', 'uk pop']
4 ['lgbtq+ hip hop', 'pop rap']
...
1551 ['dance pop', 'pop', 'uk pop']
1552 ['sertanejo', 'sertanejo universitario']
1553 ['dance pop', 'electropop', 'pop', 'post-teen ...
1554 ['brega funk', 'funk carioca']
1555 ['pop', 'post-teen pop']
Name: Genre, Length: 1556, dtype: object
df["Genre Length"]
0 2.0
1 1.0
2 1.0
3 2.0
4 2.0
...
1551 3.0
1552 2.0
1553 4.0
1554 2.0
1555 2.0
Name: Genre Length, Length: 1556, dtype: float64
df["Genre Length"].argmax()
44
df.iloc[44]
Index 45
Highest Charting Position 9
Number of Times Charted 39
Week of Highest Charting 2021-02-26--2021-03-05
Song Name The Business
Streams 10,739,770
Artist Tiësto
Artist Followers 5785065
Song ID 6f3Slt0GbA2bPZlz0aIFXN
Genre ['big room', 'brostep', 'dance pop', 'dutch ed...
Release Date 2020-09-16
Weeks Charted 2021-07-23--2021-07-30\n2021-07-16--2021-07-23...
Popularity 90
Danceability 0.798
Energy 0.62
Loudness -7.079
Speechiness 0.232
Acousticness 0.414
Liveness 0.112
Tempo 120.031
Duration (ms) 164000
Valence 0.235
Chord G#/Ab
Date 2021-02-26 00:00:00
Genre Lengths 11.0
Genre Length 11.0
Name: 44, dtype: object
df.iloc[44]["Genre"]
"['big room', 'brostep', 'dance pop', 'dutch edm', 'edm', 'electro house', 'house', 'pop dance', 'slap house', 'trance', 'tropical house']"
df.iloc[44]["Genre Lengths"]
11.0
Practice with list comprehension#
Using list comprehension, make a list of all the row labels for which the “Date” value is in April.
# Check: no repeated labels
df.index.is_unique
True
# approach 1
date_list = [ind for ind in df.index if df.loc[ind, "Date"].month == 4]
date_list[:5]
[13, 17, 29, 30, 36]
# approach 2
date_list2 = [ind for ind, mnth in df["Date"].dt.month.items() if mnth == 4]
date_list2[:5]
[13, 17, 29, 30, 36]
# non-list comprehension approach
date_index = df[df["Date"].dt.month == 4].index
date_index[:5]
Int64Index([13, 17, 29, 30, 36], dtype='int64')
What are all the genres occuring in the “Genre” column? Remove duplicates by using set comprehension instead of list comprehension (just replace the square brackets with curly brackets). You will have to remove missing values.
df["Genre Lists"] = df["Genre"].map(lambda x: literal_eval(x) if x.startswith("[") else np.nan)
genre_series = df["Genre Lists"][~df["Genre Lists"].isna()]
{genre for genre_list in genre_series for genre in genre_list}
{'a cappella',
'acoustic pop',
'adult standards',
'afrofuturism',
'afroswing',
'albanian hip hop',
'album rock',
'alt z',
'alternative metal',
'alternative pop rock',
'alternative r&b',
'american folk revival',
'argentine hip hop',
'art pop',
'art rock',
'atl hip hop',
'atl trap',
'aussietronica',
'australian dance',
'australian hip hop',
'australian pop',
'australian psych',
'australian rock',
'banda',
'basshall',
'beatlesque',
'bedroom pop',
'bedroom soul',
'belgian hip hop',
'big room',
'boston hip hop',
'boy band',
'brazilian hip hop',
'brega funk',
'brill building pop',
'british soul',
'britpop',
'brooklyn drill',
'brostep',
'bubblegrunge',
'cali rap',
'canadian contemporary r&b',
'canadian hip hop',
'canadian latin',
'canadian pop',
'canadian trap',
'celtic',
'celtic punk',
'celtic rock',
'champeta',
'chicago drill',
'chicago indie',
'chicago rap',
'chicago soul',
'chill r&b',
'christlicher rap',
'classic rock',
'classic soul',
'classic uk pop',
'colombian pop',
'comic',
'conscious hip hop',
'contemporary country',
'corrido',
'country',
'country road',
'country rock',
'cubaton',
'cumbia pop',
'dance pop',
'dance rock',
'deep euro house',
'deep german hip hop',
'deep house',
'deep underground hip hop',
'dembow',
'detroit hip hop',
'dfw rap',
'disco',
'dmv rap',
'dominican pop',
'dream smp',
'dreamo',
'drill',
'dutch edm',
'dutch pop',
'east coast hip hop',
'easy listening',
'eau claire indie',
'edm',
'electro house',
'electro latino',
'electronic trap',
'electropop',
'emo rap',
'escape room',
'etherpop',
'eurodance',
'europop',
'eurovision',
'florida rap',
'folk punk',
'folk rock',
'folk-pop',
'folktronica',
'forro',
'francoton',
'frauenrap',
'french hip hop',
'funk',
'funk 150 bpm',
'funk bh',
'funk carioca',
'funk ostentacao',
'funk paulista',
'funk pop',
'gangster rap',
'garage rock',
'gauze pop',
'german alternative rap',
'german cloud rap',
'german dance',
'german drill',
'german hip hop',
'german pop',
'german techno',
'german trance',
'german trap',
'german underground rap',
'girl group',
'glam rock',
'grime',
'grunge',
'hamburg hip hop',
'hard rock',
'hardcore hip hop',
'heartland rock',
'hip hop',
'hip pop',
'hollywood',
'house',
'houston rap',
'icelandic pop',
'indie cafe pop',
'indie pop',
'indie pop rap',
'indie poptimism',
'indie r&b',
'indie rock italiano',
'indie rockism',
'indie surf',
'indietronica',
'indonesian pop',
'irish folk',
'irish singer-songwriter',
'italian adult pop',
'italian hip hop',
'italian indie pop',
'italian pop',
'italian pop rock',
'jawaiian',
'jazz funk',
'jazz pop',
'k-pop',
'k-pop boy group',
'k-pop girl group',
'k-rap',
'kentucky hip hop',
'latin',
'latin hip hop',
'latin pop',
'latin viral pop',
'lgbtq+ hip hop',
'london rap',
'lounge',
'madchester',
'mariachi',
'melanesian pop',
'mellow gold',
'melodic metalcore',
'melodic rap',
'meme rap',
'memphis hip hop',
'metalcore',
'metropopolis',
'mexican hip hop',
'miami hip hop',
'minnesota hip hop',
'modern alternative rock',
'modern country rock',
'modern indie pop',
'modern rock',
'moombahton',
'motown',
'musical advocacy',
'neo mellow',
'neo soul',
'neo-psychedelic',
'new french touch',
'new orleans rap',
'new romantic',
'new wave',
'new wave pop',
'norteno',
'north carolina hip hop',
'norwegian pop',
'nouvelle chanson francaise',
'nu metal',
'nuevo regional mexicano',
'nyc pop',
'nyc rap',
'nz pop',
'oakland hip hop',
'ohio hip hop',
'old school rap francais',
'oulu metal',
'pacific islands pop',
'pagode baiano',
'panamanian pop',
'permanent wave',
'perreo',
'philly rap',
'piano rock',
'piseiro',
'pittsburgh rap',
'plugg',
'png pop',
'pop',
'pop argentino',
'pop dance',
'pop edm',
'pop house',
'pop nacional',
'pop r&b',
'pop rap',
'pop reggaeton',
'pop rock',
'pop soul',
'pop urbaine',
'pop venezolano',
'post-grunge',
'post-teen pop',
'progressive electro house',
'progressive house',
'puerto rican pop',
'punk',
'queens hip hop',
'quiet storm',
'r&b',
'r&b brasileiro',
'r&b en espanol',
'ranchera',
'rap',
'rap belge',
'rap cearense',
'rap conciencia',
'rap conscient',
'rap dominicano',
'rap francais',
'rap latina',
'rap marseille',
'rap metal',
'rave funk',
'reggaeton',
'reggaeton colombiano',
'reggaeton flow',
'regional mexican',
'rhode island rap',
'rock',
'rock-and-roll',
'rockabilly',
'sad rap',
'scandipop',
'seattle hip hop',
'sertanejo',
'sertanejo pop',
'sertanejo universitario',
'sheffield indie',
'shiver pop',
'show tunes',
'singer-songwriter',
'slap house',
'social media pop',
'soft rock',
'soul',
'soundtrack',
'south african house',
'southern hip hop',
'southern soul',
'sudanese pop',
'sunnlensk tonlist',
'surf punk',
'swedish pop',
'swing',
'synthpop',
'talent show',
'tekk',
'tennessee hip hop',
'torch song',
'toronto rap',
'trance',
'trancecore',
'trap',
'trap argentino',
'trap boricua',
'trap brasileiro',
'trap chileno',
'trap italiana',
'trap latino',
'trap queen',
'trap soul',
'trap triste',
'tropical house',
'turkish trap',
'uk alternative hip hop',
'uk alternative pop',
'uk dance',
'uk funky',
'uk hip hop',
'uk metalcore',
'uk pop',
'underground hip hop',
'urban contemporary',
'urbano espanol',
'vancouver indie',
'vapor trap',
'vegas indie',
'venezuelan hip hop',
'viral pop',
'viral rap',
'vocal jazz',
'weirdcore',
'west coast rap',
'yacht rock'}
Practice with axis
#
Reminder:
The
axis
keyword argument indicates whichaxis
labels are (potentially) changing. If the row labels could change, then we useaxis=0
. If the column labels could change, then we useaxis=1
.
df1 = pd.DataFrame({"A": [2, 3, 5], "B": [7, 13, -11]})
df2 = pd.DataFrame({"A": [-1, 0, 5], "B": [2, 2, 2]})
df1
A | B | |
---|---|---|
0 | 2 | 7 |
1 | 3 | 13 |
2 | 5 | -11 |
Can you predict what the following will do?
df1.sum(axis=0)
df1.sum(axis=0)
A 10
B 9
dtype: int64
df1.min(axis=1)
df1.min(axis=1)
0 2
1 3
2 -11
dtype: int64
pd.concat([df1, df2], axis=0)
pd.concat([df1, df2], axis=0)
A | B | |
---|---|---|
0 | 2 | 7 |
1 | 3 | 13 |
2 | 5 | -11 |
0 | -1 | 2 |
1 | 0 | 2 |
2 | 5 | 2 |
pd.concat([df1, df2], axis=1)
pd.concat([df1, df2], axis=1)
A | B | A | B | |
---|---|---|---|---|
0 | 2 | 7 | -1 | 2 |
1 | 3 | 13 | 0 | 2 |
2 | 5 | -11 | 5 | 2 |
Define df3 = pd.concat([df1, df2], axis=0)
. What are the following?
df3.apply(lambda ser: ser.idxmin(), axis=0)
df3.apply(lambda ser: ser.argmin(), axis=0)
(I was surprised thatdf3.argmin(axis=0)
didn’t work as an abbreviation for this last one.)
df3 = pd.concat([df1, df2], axis=0)
df3
A | B | |
---|---|---|
0 | 2 | 7 |
1 | 3 | 13 |
2 | 5 | -11 |
0 | -1 | 2 |
1 | 0 | 2 |
2 | 5 | 2 |
df3.apply(lambda ser: ser.idxmin(), axis=0)
A 0
B 2
dtype: int64
df3.apply(lambda ser: ser.argmin(), axis=0)
A 3
B 2
dtype: int64
pd.concat([df1, df2], axis=0).idxmin(axis=0)
A 0
B 2
dtype: int64
pd.concat([df1, df2], axis=1)
A | B | A | B | |
---|---|---|---|---|
0 | 2 | 7 | -1 | 2 |
1 | 3 | 13 | 0 | 2 |
2 | 5 | -11 | 5 | 2 |