Week 4, Tuesday Discussion

Today:

  • Go through practice midterm solutions

Reminders:

  • Homework #3 due tonight

  • Midterm Thursday during discussion

    • I have extra notecards, if you lost yours or did not get one

Question 1

(a) What is an advantage of a Python tuple in comparison to a Python list?

  • A: Tuple can go in a set, while list cannot go in a set.

  • Also, a list can be changed after we create it, while tuple cannot be changed after it is made (e.g. list has .append() while tuple does not) (b) What is an advantage of np.arange in comparison to range?

  • Entries of np.arange do not have to be integers, while range can only have integers

  • np.arange has many more methods/functions available (e.g. taking reciprocal) and the operations are “vectorized” (makes computation much faster) (c) What will be the result of the following code?

a = range(1,20,4)
b = [2,3,2]
[f"Train{a[i]}{i}" for i in b]
['Train92', 'Train133', 'Train92']
#Notice how this range still only goes up to 17
a2 = range(1,21,4)
list(a2)
[1, 5, 9, 13, 17]
r = range(10)
my_set = {r}
my_list = [r]
type(my_set)
type(my_list[0])
range

(d) What is an example of a similarity between a pandas Series and a Python dictionary?

  • Elements of both are accessed using square brackets [] and calling via labels/keys (some examples below, after part (e))

(e) How can the following error be corrected?

import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")
df = df[df.Energy.isin([" "]) == False]
type(df.loc[0,"Energy"])
str
df.Energy = pd.to_numeric(df.Energy) #with the correction
type(df.loc[0,"Energy"])
numpy.float64
df.Energy.mean()
0.633495145631068
type(df["Energy"])
pandas.core.series.Series
df["Energy"][0]
0.8
test = {"Energy":5}
test["Energy"]
5

Question 2

Assume df is a pandas DataFrame, and that its “day” column has strings representing dates. Write code to extract the sub-DataFrame which contains only those rows corresponding to Tuesday or Wednesday.

#Our first goal is to convert the date to a day of the week. I will store this information in a new column
df["weekday"] = pd.to_datetime(df["Release Date"]).dt.day_name()

#The second step is to create a boolean series that returns True if df["weekday"] is Tuesday or Wednesday
bool_ser = df["weekday"].isin(["Tuesday", "Wednesday"])

#Get the sub-DataFrame
df[bool_ser]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord weekday
5 6 1 18 2021-05-07--2021-05-14 MONTERO (Call Me By Your Name) 30,071,134 Lil Nas X 5473565 67BtfxlNbhBmCDR2L2l8qd ['lgbtq+ hip hop', 'pop rap'] ... 0.508 -6.682 0.152 0.297 0.384 178.818 137876 0.758 G#/Ab Wednesday
29 30 3 28 2021-04-02--2021-04-09 Astronaut In The Ocean 14,174,752 Masked Wolf 365975 3Ofmpyhv5UAQ70mENzB277 ['australian hip hop'] ... 0.695 -6.865 0.0913 0.175 0.15 149.996 132780 0.472 E Wednesday
44 45 9 39 2021-02-26--2021-03-05 The Business 10,739,770 Tiësto 5785065 6f3Slt0GbA2bPZlz0aIFXN ['big room', 'brostep', 'dance pop', 'dutch ed... ... 0.620 -7.079 0.232 0.414 0.112 120.031 164000 0.235 G#/Ab Wednesday
48 49 29 6 2021-06-25--2021-07-02 Fiel - Remix 10,032,746 Wisin, Jhay Cortez, Anuel AA, Los Legendarios,... 6929075 43qcs9NpJhDxtG91zxFkj7 ['latin', 'latin hip hop', 'reggaeton', 'trap ... ... 0.711 -4.733 0.0473 0.398 0.118 97.99 349547 0.573 F#/Gb Tuesday
50 51 19 4 2021-07-02--2021-07-09 Nicky Jam: Bzrp Music Sessions, Vol. 41 9,799,701 Bizarrap, Nicky Jam 3126961 03LfOYi0icz4souspZVVhq ['argentine hip hop', 'pop venezolano', 'trap ... ... 0.849 -3.167 0.153 0.0913 0.145 89.907 158087 0.818 C#/Db Wednesday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1457 1458 136 1 2020-01-31--2020-02-07 Mon Ami 5,774,770 Samra 1045091 1R4xkZXQUQ8QJtAdwHkSgC ['german hip hop'] ... 0.731 -3.723 0.383 0.188 0.111 105.666 138409 0.537 G#/Ab Wednesday
1464 1465 106 6 2020-01-03--2020-01-10 Liar 4,896,939 Camila Cabello 22698747 7LzouaWGFCy4tkXDOOnEyM ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.498 -6.684 0.0456 0.0169 0.319 98.016 207039 0.652 B Wednesday
1547 1548 156 1 2019-12-27--2020-01-03 Combatchy (feat. MC Rebecca) 5,149,797 Anitta, Lexa, Luísa Sonza 10741972 2bPtwnrpFNEe8N7Q85kLHw ['funk carioca', 'funk pop', 'pagode baiano', ... ... 0.730 -3.032 0.0809 0.383 0.0197 150.134 157600 0.605 C#/Db Wednesday
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.550 -7.026 0.0587 0.249 0.182 154.064 152784 0.881 F Wednesday
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.064 0.433 0.0862 205.272 221307 0.422 G Wednesday

121 rows × 24 columns

#Another option, if you insist on list comprehension :)
good_rows = [c for c in df.index if (df.loc[c,"weekday"] == "Tuesday") |(df.loc[c,"weekday"] == "Wednesday")]
df.loc[good_rows]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord weekday
5 6 1 18 2021-05-07--2021-05-14 MONTERO (Call Me By Your Name) 30,071,134 Lil Nas X 5473565 67BtfxlNbhBmCDR2L2l8qd ['lgbtq+ hip hop', 'pop rap'] ... 0.508 -6.682 0.152 0.297 0.384 178.818 137876 0.758 G#/Ab Wednesday
29 30 3 28 2021-04-02--2021-04-09 Astronaut In The Ocean 14,174,752 Masked Wolf 365975 3Ofmpyhv5UAQ70mENzB277 ['australian hip hop'] ... 0.695 -6.865 0.0913 0.175 0.15 149.996 132780 0.472 E Wednesday
44 45 9 39 2021-02-26--2021-03-05 The Business 10,739,770 Tiësto 5785065 6f3Slt0GbA2bPZlz0aIFXN ['big room', 'brostep', 'dance pop', 'dutch ed... ... 0.620 -7.079 0.232 0.414 0.112 120.031 164000 0.235 G#/Ab Wednesday
48 49 29 6 2021-06-25--2021-07-02 Fiel - Remix 10,032,746 Wisin, Jhay Cortez, Anuel AA, Los Legendarios,... 6929075 43qcs9NpJhDxtG91zxFkj7 ['latin', 'latin hip hop', 'reggaeton', 'trap ... ... 0.711 -4.733 0.0473 0.398 0.118 97.99 349547 0.573 F#/Gb Tuesday
50 51 19 4 2021-07-02--2021-07-09 Nicky Jam: Bzrp Music Sessions, Vol. 41 9,799,701 Bizarrap, Nicky Jam 3126961 03LfOYi0icz4souspZVVhq ['argentine hip hop', 'pop venezolano', 'trap ... ... 0.849 -3.167 0.153 0.0913 0.145 89.907 158087 0.818 C#/Db Wednesday
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1457 1458 136 1 2020-01-31--2020-02-07 Mon Ami 5,774,770 Samra 1045091 1R4xkZXQUQ8QJtAdwHkSgC ['german hip hop'] ... 0.731 -3.723 0.383 0.188 0.111 105.666 138409 0.537 G#/Ab Wednesday
1464 1465 106 6 2020-01-03--2020-01-10 Liar 4,896,939 Camila Cabello 22698747 7LzouaWGFCy4tkXDOOnEyM ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.498 -6.684 0.0456 0.0169 0.319 98.016 207039 0.652 B Wednesday
1547 1548 156 1 2019-12-27--2020-01-03 Combatchy (feat. MC Rebecca) 5,149,797 Anitta, Lexa, Luísa Sonza 10741972 2bPtwnrpFNEe8N7Q85kLHw ['funk carioca', 'funk pop', 'pagode baiano', ... ... 0.730 -3.032 0.0809 0.383 0.0197 150.134 157600 0.605 C#/Db Wednesday
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.550 -7.026 0.0587 0.249 0.182 154.064 152784 0.881 F Wednesday
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.064 0.433 0.0862 205.272 221307 0.422 G Wednesday

121 rows × 24 columns

Question 3

import numpy as np 
rng = np.random.default_rng()
A = rng.integers(0,6,size = (10,10))
#Do not worry about random numbers! You will not be tested on this.
df = pd.DataFrame(A)

#df.loc is label-based!!
df2 = df.loc[5:,3:].copy()
df.head(3)
0 1 2 3 4 5 6 7 8 9
0 4 5 2 4 4 3 3 5 1 3
1 1 0 4 3 2 2 2 4 5 1
2 3 0 4 4 4 2 4 4 2 0

Will there be a difference between df2.iloc[:,4:].shape and df2.loc[:,4:].shape?

df2.iloc[:,4:] is saying take all of the rows of df2, and take the column in integer position 4 and everything after

df2.loc[:,4:]] is saying take all of the rows of df2, and take the column labeled 4 and everything after

df2.iloc[:,4:].shape
(5, 3)
df2.loc[:,4:].shape
(5, 6)

Also keep in mind! df.index returns labels

Question 4

df = pd.read_csv("../data/spotify_dataset.csv")

#Find top 10 artists
top_art = df["Artist"].value_counts().index[:10]

df2 = df[df["Artist"].isin(top_art)]
df2
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
8 9 3 8 2021-06-18--2021-06-25 Yonaguni 25,030,128 Bad Bunny 36142273 2JPLbjOn0wPCngEot2STUS ['latin', 'reggaeton', 'trap latino'] ... 0.644 0.648 -4.601 0.118 0.276 0.135 179.951 206710 0.44 C#/Db
12 13 5 3 2021-07-09--2021-07-16 Permission to Dance 22,062,812 BTS 37106176 0LThjFY2iTtNdd4wviwVV2 ['k-pop', 'k-pop boy group'] ... 0.702 0.741 -5.33 0.0427 0.00544 0.337 124.925 187585 0.646 A
13 14 1 19 2021-04-02--2021-04-09 Peaches (feat. Daniel Caesar & Giveon) 20,294,457 Justin Bieber 48504126 4iJyoBOLtHqaGxP12qzhQI ['canadian pop', 'pop', 'post-teen pop'] ... 0.677 0.696 -6.181 0.119 0.321 0.42 90.03 198082 0.464 C
14 15 2 10 2021-05-21--2021-05-28 Butter 19,985,713 BTS 37106176 2bgTY4UwhfBYhGT4HUYStN ['k-pop', 'k-pop boy group'] ... 0.759 0.459 -5.187 0.0948 0.00323 0.0906 109.997 164442 0.695 G#/Ab
17 18 5 14 2021-04-23--2021-04-30 Save Your Tears (with Ariana Grande) (Remix) 18,053,141 The Weeknd 35305637 37BZB0z9T8Xu7U3e65qxFy ['canadian contemporary r&b', 'canadian pop', ... ... 0.65 0.825 -4.645 0.0325 0.0215 0.0936 118.091 191014 0.593 C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1499 1500 100 1 2020-01-17--2020-01-24 Alfred - Interlude 8,030,151 Eminem 46814751 4EmunTy7kNBYQivOa8F6b8 ['detroit hip hop', 'hip hop', 'rap'] ... 0.429 0.231 -20.43 0.402 0.878 0.279 74.545 30133 0.914 F
1500 1501 102 1 2020-01-17--2020-01-24 Little Engine 7,913,461 Eminem 46814751 4qNWEOMyexn7b8Icyk29t9 ['detroit hip hop', 'hip hop', 'rap'] ... 0.769 0.811 -4.162 0.228 0.0234 0.0451 155.081 177293 0.76 A#/Bb
1501 1502 113 1 2020-01-17--2020-01-24 I Will (feat. KXNG Crooked, Royce Da 5'9" & Jo... 7,115,414 Eminem 46814751 3CJbxqRQ0JNCqboWDNUUeX ['detroit hip hop', 'hip hop', 'rap'] ... 0.635 0.543 -5.941 0.067 0.0454 0.272 98.743 303000 0.036 G#/Ab
1549 1550 187 1 2019-12-27--2020-01-03 Let Me Know (I Wonder Why Freestyle) 4,701,532 Juice WRLD 19102888 3wwo0bJvDSorOpNfzEkfXx ['chicago rap', 'melodic rap'] ... 0.635 0.537 -7.895 0.0832 0.172 0.418 125.028 215381 0.383 G
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.448 0.603 -7.176 0.064 0.433 0.0862 205.272 221307 0.422 G

295 rows × 23 columns

Question 5

import altair as alt
df2 = pd.read_csv("../data/cars.csv")

brush = alt.selection_interval()

c1 = alt.Chart(df2).mark_circle().encode(
    x = "Miles_per_Gallon",
    y = "Horsepower",
    color = "Origin"
).add_selection(
    brush
)
c2 = alt.Chart(df2).mark_bar().encode(
    x = "Origin",
    y = alt.Y("count()", scale=alt.Scale(domain = [0,500]))
).transform_filter(
    brush
)
c1|c2