Week 4, Tuesday Discussion¶

Today:

Go through practice midterm solutions

Reminders:

Homework #3 due tonight
Midterm Thursday during discussion
- I have extra notecards, if you lost yours or did not get one

Question 1¶

(a) What is an advantage of a Python tuple in comparison to a Python list?

A: Tuple can go in a set, while list cannot go in a set.
Also, a list can be changed after we create it, while tuple cannot be changed after it is made (e.g. list has .append() while tuple does not) (b) What is an advantage of np.arange in comparison to range?
Entries of np.arange do not have to be integers, while range can only have integers
np.arange has many more methods/functions available (e.g. taking reciprocal) and the operations are “vectorized” (makes computation much faster) (c) What will be the result of the following code?

a = range(1,20,4)
b = [2,3,2]
[f"Train{a[i]}{i}" for i in b]

['Train92', 'Train133', 'Train92']

#Notice how this range still only goes up to 17
a2 = range(1,21,4)
list(a2)

[1, 5, 9, 13, 17]

r = range(10)
my_set = {r}
my_list = [r]
type(my_set)
type(my_list[0])

range

(d) What is an example of a similarity between a pandas Series and a Python dictionary?

Elements of both are accessed using square brackets [] and calling via labels/keys (some examples below, after part (e))

(e) How can the following error be corrected?

import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv")
df = df[df.Energy.isin([" "]) == False]

type(df.loc[0,"Energy"])

str

df.Energy = pd.to_numeric(df.Energy) #with the correction

type(df.loc[0,"Energy"])

numpy.float64

df.Energy.mean()

0.633495145631068

type(df["Energy"])

pandas.core.series.Series

df["Energy"][0]

0.8

test = {"Energy":5}
test["Energy"]

Question 2¶

Assume df is a pandas DataFrame, and that its “day” column has strings representing dates. Write code to extract the sub-DataFrame which contains only those rows corresponding to Tuesday or Wednesday.

#Our first goal is to convert the date to a day of the week. I will store this information in a new column
df["weekday"] = pd.to_datetime(df["Release Date"]).dt.day_name()

#The second step is to create a boolean series that returns True if df["weekday"] is Tuesday or Wednesday
bool_ser = df["weekday"].isin(["Tuesday", "Wednesday"])

#Get the sub-DataFrame
df[bool_ser]

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord	weekday
5	6	1	18	2021-05-07--2021-05-14	MONTERO (Call Me By Your Name)	30,071,134	Lil Nas X	5473565	67BtfxlNbhBmCDR2L2l8qd	['lgbtq+ hip hop', 'pop rap']	...	0.508	-6.682	0.152	0.297	0.384	178.818	137876	0.758	G#/Ab	Wednesday
29	30	3	28	2021-04-02--2021-04-09	Astronaut In The Ocean	14,174,752	Masked Wolf	365975	3Ofmpyhv5UAQ70mENzB277	['australian hip hop']	...	0.695	-6.865	0.0913	0.175	0.15	149.996	132780	0.472	E	Wednesday
44	45	9	39	2021-02-26--2021-03-05	The Business	10,739,770	Tiësto	5785065	6f3Slt0GbA2bPZlz0aIFXN	['big room', 'brostep', 'dance pop', 'dutch ed...	...	0.620	-7.079	0.232	0.414	0.112	120.031	164000	0.235	G#/Ab	Wednesday
48	49	29	6	2021-06-25--2021-07-02	Fiel - Remix	10,032,746	Wisin, Jhay Cortez, Anuel AA, Los Legendarios,...	6929075	43qcs9NpJhDxtG91zxFkj7	['latin', 'latin hip hop', 'reggaeton', 'trap ...	...	0.711	-4.733	0.0473	0.398	0.118	97.99	349547	0.573	F#/Gb	Tuesday
50	51	19	4	2021-07-02--2021-07-09	Nicky Jam: Bzrp Music Sessions, Vol. 41	9,799,701	Bizarrap, Nicky Jam	3126961	03LfOYi0icz4souspZVVhq	['argentine hip hop', 'pop venezolano', 'trap ...	...	0.849	-3.167	0.153	0.0913	0.145	89.907	158087	0.818	C#/Db	Wednesday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1457	1458	136	1	2020-01-31--2020-02-07	Mon Ami	5,774,770	Samra	1045091	1R4xkZXQUQ8QJtAdwHkSgC	['german hip hop']	...	0.731	-3.723	0.383	0.188	0.111	105.666	138409	0.537	G#/Ab	Wednesday
1464	1465	106	6	2020-01-03--2020-01-10	Liar	4,896,939	Camila Cabello	22698747	7LzouaWGFCy4tkXDOOnEyM	['dance pop', 'electropop', 'pop', 'post-teen ...	...	0.498	-6.684	0.0456	0.0169	0.319	98.016	207039	0.652	B	Wednesday
1547	1548	156	1	2019-12-27--2020-01-03	Combatchy (feat. MC Rebecca)	5,149,797	Anitta, Lexa, Luísa Sonza	10741972	2bPtwnrpFNEe8N7Q85kLHw	['funk carioca', 'funk pop', 'pagode baiano', ...	...	0.730	-3.032	0.0809	0.383	0.0197	150.134	157600	0.605	C#/Db	Wednesday
1554	1555	198	1	2019-12-27--2020-01-03	Surtada - Remix Brega Funk	4,607,385	Dadá Boladão, Tati Zaqui, OIK	208630	5F8ffc8KWKNawllr5WsW0r	['brega funk', 'funk carioca']	...	0.550	-7.026	0.0587	0.249	0.182	154.064	152784	0.881	F	Wednesday
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.603	-7.176	0.064	0.433	0.0862	205.272	221307	0.422	G	Wednesday

121 rows × 24 columns

#Another option, if you insist on list comprehension :)
good_rows = [c for c in df.index if (df.loc[c,"weekday"] == "Tuesday") |(df.loc[c,"weekday"] == "Wednesday")]
df.loc[good_rows]

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord	weekday
5	6	1	18	2021-05-07--2021-05-14	MONTERO (Call Me By Your Name)	30,071,134	Lil Nas X	5473565	67BtfxlNbhBmCDR2L2l8qd	['lgbtq+ hip hop', 'pop rap']	...	0.508	-6.682	0.152	0.297	0.384	178.818	137876	0.758	G#/Ab	Wednesday
29	30	3	28	2021-04-02--2021-04-09	Astronaut In The Ocean	14,174,752	Masked Wolf	365975	3Ofmpyhv5UAQ70mENzB277	['australian hip hop']	...	0.695	-6.865	0.0913	0.175	0.15	149.996	132780	0.472	E	Wednesday
44	45	9	39	2021-02-26--2021-03-05	The Business	10,739,770	Tiësto	5785065	6f3Slt0GbA2bPZlz0aIFXN	['big room', 'brostep', 'dance pop', 'dutch ed...	...	0.620	-7.079	0.232	0.414	0.112	120.031	164000	0.235	G#/Ab	Wednesday
48	49	29	6	2021-06-25--2021-07-02	Fiel - Remix	10,032,746	Wisin, Jhay Cortez, Anuel AA, Los Legendarios,...	6929075	43qcs9NpJhDxtG91zxFkj7	['latin', 'latin hip hop', 'reggaeton', 'trap ...	...	0.711	-4.733	0.0473	0.398	0.118	97.99	349547	0.573	F#/Gb	Tuesday
50	51	19	4	2021-07-02--2021-07-09	Nicky Jam: Bzrp Music Sessions, Vol. 41	9,799,701	Bizarrap, Nicky Jam	3126961	03LfOYi0icz4souspZVVhq	['argentine hip hop', 'pop venezolano', 'trap ...	...	0.849	-3.167	0.153	0.0913	0.145	89.907	158087	0.818	C#/Db	Wednesday
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1457	1458	136	1	2020-01-31--2020-02-07	Mon Ami	5,774,770	Samra	1045091	1R4xkZXQUQ8QJtAdwHkSgC	['german hip hop']	...	0.731	-3.723	0.383	0.188	0.111	105.666	138409	0.537	G#/Ab	Wednesday
1464	1465	106	6	2020-01-03--2020-01-10	Liar	4,896,939	Camila Cabello	22698747	7LzouaWGFCy4tkXDOOnEyM	['dance pop', 'electropop', 'pop', 'post-teen ...	...	0.498	-6.684	0.0456	0.0169	0.319	98.016	207039	0.652	B	Wednesday
1547	1548	156	1	2019-12-27--2020-01-03	Combatchy (feat. MC Rebecca)	5,149,797	Anitta, Lexa, Luísa Sonza	10741972	2bPtwnrpFNEe8N7Q85kLHw	['funk carioca', 'funk pop', 'pagode baiano', ...	...	0.730	-3.032	0.0809	0.383	0.0197	150.134	157600	0.605	C#/Db	Wednesday
1554	1555	198	1	2019-12-27--2020-01-03	Surtada - Remix Brega Funk	4,607,385	Dadá Boladão, Tati Zaqui, OIK	208630	5F8ffc8KWKNawllr5WsW0r	['brega funk', 'funk carioca']	...	0.550	-7.026	0.0587	0.249	0.182	154.064	152784	0.881	F	Wednesday
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.603	-7.176	0.064	0.433	0.0862	205.272	221307	0.422	G	Wednesday

121 rows × 24 columns

Question 3¶

import numpy as np 
rng = np.random.default_rng()
A = rng.integers(0,6,size = (10,10))
#Do not worry about random numbers! You will not be tested on this.
df = pd.DataFrame(A)

#df.loc is label-based!!
df2 = df.loc[5:,3:].copy()
df.head(3)

	0	1	2	3	4	5	6	7	8	9
0	4	5	2	4	4	3	3	5	1	3
1	1	0	4	3	2	2	2	4	5	1
2	3	0	4	4	4	2	4	4	2	0

Will there be a difference between df2.iloc[:,4:].shape and df2.loc[:,4:].shape?

df2.iloc[:,4:] is saying take all of the rows of df2, and take the column in integer position 4 and everything after

df2.loc[:,4:]] is saying take all of the rows of df2, and take the column labeled 4 and everything after

df2.iloc[:,4:].shape

(5, 3)

df2.loc[:,4:].shape

(5, 6)

Also keep in mind! df.index returns labels

Question 4¶

df = pd.read_csv("../data/spotify_dataset.csv")

#Find top 10 artists
top_art = df["Artist"].value_counts().index[:10]

df2 = df[df["Artist"].isin(top_art)]
df2

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
8	9	3	8	2021-06-18--2021-06-25	Yonaguni	25,030,128	Bad Bunny	36142273	2JPLbjOn0wPCngEot2STUS	['latin', 'reggaeton', 'trap latino']	...	0.644	0.648	-4.601	0.118	0.276	0.135	179.951	206710	0.44	C#/Db
12	13	5	3	2021-07-09--2021-07-16	Permission to Dance	22,062,812	BTS	37106176	0LThjFY2iTtNdd4wviwVV2	['k-pop', 'k-pop boy group']	...	0.702	0.741	-5.33	0.0427	0.00544	0.337	124.925	187585	0.646	A
13	14	1	19	2021-04-02--2021-04-09	Peaches (feat. Daniel Caesar & Giveon)	20,294,457	Justin Bieber	48504126	4iJyoBOLtHqaGxP12qzhQI	['canadian pop', 'pop', 'post-teen pop']	...	0.677	0.696	-6.181	0.119	0.321	0.42	90.03	198082	0.464	C
14	15	2	10	2021-05-21--2021-05-28	Butter	19,985,713	BTS	37106176	2bgTY4UwhfBYhGT4HUYStN	['k-pop', 'k-pop boy group']	...	0.759	0.459	-5.187	0.0948	0.00323	0.0906	109.997	164442	0.695	G#/Ab
17	18	5	14	2021-04-23--2021-04-30	Save Your Tears (with Ariana Grande) (Remix)	18,053,141	The Weeknd	35305637	37BZB0z9T8Xu7U3e65qxFy	['canadian contemporary r&b', 'canadian pop', ...	...	0.65	0.825	-4.645	0.0325	0.0215	0.0936	118.091	191014	0.593	C
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1499	1500	100	1	2020-01-17--2020-01-24	Alfred - Interlude	8,030,151	Eminem	46814751	4EmunTy7kNBYQivOa8F6b8	['detroit hip hop', 'hip hop', 'rap']	...	0.429	0.231	-20.43	0.402	0.878	0.279	74.545	30133	0.914	F
1500	1501	102	1	2020-01-17--2020-01-24	Little Engine	7,913,461	Eminem	46814751	4qNWEOMyexn7b8Icyk29t9	['detroit hip hop', 'hip hop', 'rap']	...	0.769	0.811	-4.162	0.228	0.0234	0.0451	155.081	177293	0.76	A#/Bb
1501	1502	113	1	2020-01-17--2020-01-24	I Will (feat. KXNG Crooked, Royce Da 5'9" & Jo...	7,115,414	Eminem	46814751	3CJbxqRQ0JNCqboWDNUUeX	['detroit hip hop', 'hip hop', 'rap']	...	0.635	0.543	-5.941	0.067	0.0454	0.272	98.743	303000	0.036	G#/Ab
1549	1550	187	1	2019-12-27--2020-01-03	Let Me Know (I Wonder Why Freestyle)	4,701,532	Juice WRLD	19102888	3wwo0bJvDSorOpNfzEkfXx	['chicago rap', 'melodic rap']	...	0.635	0.537	-7.895	0.0832	0.172	0.418	125.028	215381	0.383	G
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.448	0.603	-7.176	0.064	0.433	0.0862	205.272	221307	0.422	G

295 rows × 23 columns

Question 5¶

import altair as alt
df2 = pd.read_csv("../data/cars.csv")

brush = alt.selection_interval()

c1 = alt.Chart(df2).mark_circle().encode(
    x = "Miles_per_Gallon",
    y = "Horsepower",
    color = "Origin"
).add_selection(
    brush
)

c2 = alt.Chart(df2).mark_bar().encode(
    x = "Origin",
    y = alt.Y("count()", scale=alt.Scale(domain = [0,500]))
).transform_filter(
    brush
)

c1|c2

UC Irvine Math 10 S22

Week 4, Tuesday Discussion

Contents