Week 4 Monday#
Announcements#
The midterm is a week from today. A sample midterm is posted on the Week 5 page on Canvas.
The midterm is closed book, closed computer, but you are allowed to use one notecard with handwritten notes (both sides). Jinghao will have the notecards for you before the quiz on Tuesday.
The best way to study is to go over the worksheets, quizzes, and the sample midterm. Next priority would be the lecture notes.
I briefly mention lots of topics. If you’re not sure if something could appear on the midterm, ask on Ed Discussion and I’ll answer there.
Plan for today#
Today I would like to briefly cover three useful Python topics (that are not specific to data science):
list comprehension
f-strings
lambda functions
I would also like to discuss the pandas DataFrame methods apply
and applymap
. Both of these methods take a function as their input. The apply
method applies that function to an entire row or an entire column at a time. The applymap
method applies that function to every entry on its own. (We have also seen map
which is very similar, but which is a Series method rather than a DataFrame method.)
Using applymap
and a lambda function#
Load the “taxis” dataset from Seaborn.
import seaborn as sns
df = sns.load_dataset("taxis")
Here is a reminder of what df
looks like.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Set the value at row
3
and column"tip"
to NumPy’s Not-a-Number value.
import numpy as np
Here is an example of how to set a particular element to a particular value. Because we are referring to the column “tip” by name, we need to use loc
rather than iloc
.
df.loc[3, "tip"] = np.nan
Notice how the corresponding element has now been changed.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | NaN | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Look at the first few rows of
df.isna()
. Can you recognize the effect of the missing value we inserted?
Notice how in the row 3
and the column "tip"
we have a True
, corresponding to this missing value.
df.isna()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6429 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6430 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6431 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6432 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6433 rows Ă— 14 columns
Write a function
is_missing
which takes as inputx
and as output returns"Missing"
ifx
isnp.nan
and returnsx
otherwise.
(I’m not sure the best way to make this work with string inputs, so you can assume the inputs are numeric.)
Surprisingly, the following does not work, as we will see below.
def is_missing(x):
if x == np.nan:
return "Missing"
else:
return x
It seems to work for the input 100
.
is_missing(100)
100
And it works for a string input.
is_missing("chris")
'chris'
But it doesn’t return the correct thing, "Missing"
, for np.nan
as the input.
# Surprise
is_missing(np.nan)
nan
The problem is that np.nan == np.nan
evaluates to False
. There is no way to know this in advance; it is just a convention. The idea behind it is, say we wanted to know if two values were equal (say two people’s salaries), and say they were both missing, would you want to consider those people as having equal salaries? This isn’t a proof by any means, just an explanation of why we might want this convention.
np.nan == np.nan
False
Instead we will use the NumPy function np.isnan
. (This is the part that does not seem to work for string inputs.) This is the NumPy analogue of the df.isna()
method we used above.
def is_missing(x):
if np.isnan(x):
return "Missing"
else:
return x
This still works for 10
as the input.
is_missing(10)
10
And now it is working for np.nan
as input.
is_missing(np.nan)
'Missing'
Apply
is_missing
to every entry in the columns from “passengers” to “total” usingapplymap
.
Here is how we get those columns. Surprising: loc
slices include the right endpoint. (That almost never happens in Python.)
df.loc[:, "passengers":"total"]
passengers | distance | fare | tip | tolls | total | |
---|---|---|---|---|---|---|
0 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 |
1 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 |
2 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 |
3 | 1 | 7.70 | 27.0 | NaN | 0.0 | 36.95 |
4 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 |
... | ... | ... | ... | ... | ... | ... |
6428 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 |
6429 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 |
6430 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 |
6431 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 |
6432 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 |
6433 rows Ă— 6 columns
The applymap
method takes as input a function, and then applies that function to every input. Notice how most values are unchanged, but we now have an additional string "Missing"
showing up in row 3
.
df.loc[:, "passengers":"total"].applymap(is_missing)
passengers | distance | fare | tip | tolls | total | |
---|---|---|---|---|---|---|
0 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 |
1 | 1 | 0.79 | 5.0 | 0.0 | 0.0 | 9.30 |
2 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 |
3 | 1 | 7.70 | 27.0 | Missing | 0.0 | 36.95 |
4 | 3 | 2.16 | 9.0 | 1.1 | 0.0 | 13.40 |
... | ... | ... | ... | ... | ... | ... |
6428 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 |
6429 | 1 | 18.74 | 58.0 | 0.0 | 0.0 | 58.80 |
6430 | 1 | 4.14 | 16.0 | 0.0 | 0.0 | 17.30 |
6431 | 1 | 1.12 | 6.0 | 0.0 | 0.0 | 6.80 |
6432 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 |
6433 rows Ă— 6 columns
Do the same thing using
applymap
and a lambda function.
Two new things are being introduced here: lambda functions and the syntax ??? if ??? else ???
. The term lambda
tells Python that we are about to define a function. The part before the colon (x
in this case) designates the inputs, and the part after the colon designates the output.
The syntax A if B else C
will evaluate to A
if B
is True
and will evaluate to C
if B
is False
. This is a shorter way of writing the function that we defined above using def
.
df.loc[:, "passengers":"total"].applymap(lambda x: "Missing" if np.isnan(x) else x)
passengers | distance | fare | tip | tolls | total | |
---|---|---|---|---|---|---|
0 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 |
1 | 1 | 0.79 | 5.0 | 0.0 | 0.0 | 9.30 |
2 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 |
3 | 1 | 7.70 | 27.0 | Missing | 0.0 | 36.95 |
4 | 3 | 2.16 | 9.0 | 1.1 | 0.0 | 13.40 |
... | ... | ... | ... | ... | ... | ... |
6428 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 |
6429 | 1 | 18.74 | 58.0 | 0.0 | 0.0 | 58.80 |
6430 | 1 | 4.14 | 16.0 | 0.0 | 0.0 | 17.30 |
6431 | 1 | 1.12 | 6.0 | 0.0 | 0.0 | 6.80 |
6432 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 |
6433 rows Ă— 6 columns
Notice how we haven’t changed the original DataFrame df
, other than putting the missing value in row 3
.
Drop the rows which contain missing values, using
df = df.dropna(axis=???).copy()
. We include thecopy()
at the end to prevent warnings later when we change the DataFrame.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | NaN | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
Notice how the row with the missing value, 3
, has disappeared. If you tried to use df.loc[3]
now, you would raise an error. If you tried to use df.iloc[3]
, it would work (and return the row with label 4
).
df = df.dropna(axis=0).copy()
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
5 | 2019-03-11 10:37:23 | 2019-03-11 10:47:31 | 1 | 0.49 | 7.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Times Sq/Theatre District | Midtown East | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn |
6340 rows Ă— 14 columns
Examples with list comprehension and f-strings#
The expression
pd.to_datetime("4-24-2023").day_name()
produces the string"Monday"
. Using this idea, make the following length 7 list, and name the resultday_list
.
["Monday", "Tuesday", ..., "Sunday"]
(On a quiz or exam, explicitly writing out 7 different values will not get credit.)
import pandas as pd
We don’t need to use the dt
accessor here, because we are working with a single value, not with a pandas Series.
pd.to_datetime("4-24-2023").day_name()
'Monday'
Let’s slowly build up to making this list. Here we get the day numbers we will use.
# for-loop way
for i in range(24,31):
print(i)
24
25
26
27
28
29
30
“Of course” the following is not going to work… how would Python know that the i
inside the string was a variable?
# for-loop way
for i in range(24,31):
print("4-i-2023")
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
The following is a little clumsy, and even it does not work, because we are trying to combine a string like "4-"
with an integer like 24
.
# for-loop way, old way
for i in range(24,31):
print("4-" + i + "-2023")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [32], line 4
1 # for-loop way, old way
3 for i in range(24,31):
----> 4 print("4-" + i + "-2023")
TypeError: can only concatenate str (not "int") to str
To fix the above error, we need to convert i
to a string, using str
. This works, but we will see a much more elegant way below.
# for-loop way, old way
for i in range(24,31):
print("4-" + str(i) + "-2023")
4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023
Here is the exact same thing, but using f-strings. (These were added relatively recently, in Python 3.6.) Notice the two changes: (1) We put the letter f
before the opening quotation mark, and (2) we put the variable inside curly brackets.
# for-loop way, f-string way
for i in range(24,31):
print(f"4-{i}-2023")
4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023
Now we can finally use the to_datetime
function and the day_name
method to get the strings for the days of the week.
# for-loop way, f-string way
for i in range(24,31):
print(pd.to_datetime(f"4-{i}-2023").day_name())
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Here would be the for-loop way to put these names into a list.
day_list = []
for i in range(24,31):
day_list.append(pd.to_datetime(f"4-{i}-2023").day_name())
day_list
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
The list comprehension approach is much more elegant. The most basic example of list comprehension is [A for B in C]
, where A
is what should go into the list, B
is the variable name, and C
is whatever we are iterating over. The best way to gain comfort with list comprehension is to do the same thing using a for loop, and compare. For example, compare the following to what we just did.
# list comprehension way
day_list = [pd.to_datetime(f"4-{i}-2023").day_name() for i in range(24,31)]
day_list
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
We didn’t get further than this.
Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like
"Monday"
).
Why didn’t we have to use
pd.to_datetime
?
For each day, what was the median tip for that day? Print out a string with explanation for each day. Use
groupby
. Also specify:.2f
in the string formatting part so that only two decimal places get printed.
Make a list of all the row labels in
df
for which the “pickup_zone” or “dropoff_zone” contains the substring “Airport”. Name the resultind_list
.
Write a function
has_airport
which takes as input a row fromdf
(not a row label but an entire row) and as output returnsTrue
if the “pickup_zone” or “dropoff_zone” entry contains the substring “Airport”.
Make a new column
"Airport"
which containsTrue
if the “pickup_zone” or “dropoff_zone” contains the substring “Airport”, and otherwise containsFalse
. Usedf.apply(???, axis=???)
.
Make the same column, again using
apply
, but this time using a lambda function.
Does the result match the indices we found in
ind_list
?
Last part of class#
Time to work on Worksheets 5-6 (due tonight) and 7 (due next week).
Maya is here to help.