Week 4 Monday#

Announcements#

  • The midterm is a week from today. A sample midterm is posted on the Week 5 page on Canvas.

  • The midterm is closed book, closed computer, but you are allowed to use one notecard with handwritten notes (both sides). Jinghao will have the notecards for you before the quiz on Tuesday.

  • The best way to study is to go over the worksheets, quizzes, and the sample midterm. Next priority would be the lecture notes.

  • I briefly mention lots of topics. If you’re not sure if something could appear on the midterm, ask on Ed Discussion and I’ll answer there.

Plan for today#

Today I would like to briefly cover three useful Python topics (that are not specific to data science):

  • list comprehension

  • f-strings

  • lambda functions

I would also like to discuss the pandas DataFrame methods apply and applymap. Both of these methods take a function as their input. The apply method applies that function to an entire row or an entire column at a time. The applymap method applies that function to every entry on its own. (We have also seen map which is very similar, but which is a Series method rather than a DataFrame method.)

Using applymap and a lambda function#

  • Load the “taxis” dataset from Seaborn.

import seaborn as sns
df = sns.load_dataset("taxis")

Here is a reminder of what df looks like.

df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan
  • Set the value at row 3 and column "tip" to NumPy’s Not-a-Number value.

import numpy as np

Here is an example of how to set a particular element to a particular value. Because we are referring to the column “tip” by name, we need to use loc rather than iloc.

df.loc[3, "tip"] = np.nan

Notice how the corresponding element has now been changed.

df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 NaN 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan
  • Look at the first few rows of df.isna(). Can you recognize the effect of the missing value we inserted?

Notice how in the row 3 and the column "tip" we have a True, corresponding to this missing value.

df.isna()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False
3 False False False False False True False False False False False False False False
4 False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6428 False False False False False False False False False False False False False False
6429 False False False False False False False False False False False False False False
6430 False False False False False False False False False False False False False False
6431 False False False False False False False False False False False False False False
6432 False False False False False False False False False False False False False False

6433 rows Ă— 14 columns

  • Write a function is_missing which takes as input x and as output returns "Missing" if x is np.nan and returns x otherwise.

(I’m not sure the best way to make this work with string inputs, so you can assume the inputs are numeric.)

Surprisingly, the following does not work, as we will see below.

def is_missing(x):
    if x == np.nan:
        return "Missing"
    else:
        return x

It seems to work for the input 100.

is_missing(100)
100

And it works for a string input.

is_missing("chris")
'chris'

But it doesn’t return the correct thing, "Missing", for np.nan as the input.

# Surprise
is_missing(np.nan)
nan

The problem is that np.nan == np.nan evaluates to False. There is no way to know this in advance; it is just a convention. The idea behind it is, say we wanted to know if two values were equal (say two people’s salaries), and say they were both missing, would you want to consider those people as having equal salaries? This isn’t a proof by any means, just an explanation of why we might want this convention.

np.nan == np.nan
False

Instead we will use the NumPy function np.isnan. (This is the part that does not seem to work for string inputs.) This is the NumPy analogue of the df.isna() method we used above.

def is_missing(x):
    if np.isnan(x):
        return "Missing"
    else:
        return x

This still works for 10 as the input.

is_missing(10)
10

And now it is working for np.nan as input.

is_missing(np.nan)
'Missing'
  • Apply is_missing to every entry in the columns from “passengers” to “total” using applymap.

Here is how we get those columns. Surprising: loc slices include the right endpoint. (That almost never happens in Python.)

df.loc[:, "passengers":"total"]
passengers distance fare tip tolls total
0 1 1.60 7.0 2.15 0.0 12.95
1 1 0.79 5.0 0.00 0.0 9.30
2 1 1.37 7.5 2.36 0.0 14.16
3 1 7.70 27.0 NaN 0.0 36.95
4 3 2.16 9.0 1.10 0.0 13.40
... ... ... ... ... ... ...
6428 1 0.75 4.5 1.06 0.0 6.36
6429 1 18.74 58.0 0.00 0.0 58.80
6430 1 4.14 16.0 0.00 0.0 17.30
6431 1 1.12 6.0 0.00 0.0 6.80
6432 1 3.85 15.0 3.36 0.0 20.16

6433 rows Ă— 6 columns

The applymap method takes as input a function, and then applies that function to every input. Notice how most values are unchanged, but we now have an additional string "Missing" showing up in row 3.

df.loc[:, "passengers":"total"].applymap(is_missing)
passengers distance fare tip tolls total
0 1 1.60 7.0 2.15 0.0 12.95
1 1 0.79 5.0 0.0 0.0 9.30
2 1 1.37 7.5 2.36 0.0 14.16
3 1 7.70 27.0 Missing 0.0 36.95
4 3 2.16 9.0 1.1 0.0 13.40
... ... ... ... ... ... ...
6428 1 0.75 4.5 1.06 0.0 6.36
6429 1 18.74 58.0 0.0 0.0 58.80
6430 1 4.14 16.0 0.0 0.0 17.30
6431 1 1.12 6.0 0.0 0.0 6.80
6432 1 3.85 15.0 3.36 0.0 20.16

6433 rows Ă— 6 columns

  • Do the same thing using applymap and a lambda function.

Two new things are being introduced here: lambda functions and the syntax ??? if ??? else ???. The term lambda tells Python that we are about to define a function. The part before the colon (x in this case) designates the inputs, and the part after the colon designates the output.

The syntax A if B else C will evaluate to A if B is True and will evaluate to C if B is False. This is a shorter way of writing the function that we defined above using def.

df.loc[:, "passengers":"total"].applymap(lambda x: "Missing" if np.isnan(x) else x)
passengers distance fare tip tolls total
0 1 1.60 7.0 2.15 0.0 12.95
1 1 0.79 5.0 0.0 0.0 9.30
2 1 1.37 7.5 2.36 0.0 14.16
3 1 7.70 27.0 Missing 0.0 36.95
4 3 2.16 9.0 1.1 0.0 13.40
... ... ... ... ... ... ...
6428 1 0.75 4.5 1.06 0.0 6.36
6429 1 18.74 58.0 0.0 0.0 58.80
6430 1 4.14 16.0 0.0 0.0 17.30
6431 1 1.12 6.0 0.0 0.0 6.80
6432 1 3.85 15.0 3.36 0.0 20.16

6433 rows Ă— 6 columns

Notice how we haven’t changed the original DataFrame df, other than putting the missing value in row 3.

  • Drop the rows which contain missing values, using df = df.dropna(axis=???).copy(). We include the copy() at the end to prevent warnings later when we change the DataFrame.

df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 NaN 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan

Notice how the row with the missing value, 3, has disappeared. If you tried to use df.loc[3] now, you would raise an error. If you tried to use df.iloc[3], it would work (and return the row with label 4).

df = df.dropna(axis=0).copy()
df
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan
5 2019-03-11 10:37:23 2019-03-11 10:47:31 1 0.49 7.5 2.16 0.0 12.96 yellow credit card Times Sq/Theatre District Midtown East Manhattan Manhattan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6428 2019-03-31 09:51:53 2019-03-31 09:55:27 1 0.75 4.5 1.06 0.0 6.36 green credit card East Harlem North Central Harlem North Manhattan Manhattan
6429 2019-03-31 17:38:00 2019-03-31 18:34:23 1 18.74 58.0 0.00 0.0 58.80 green credit card Jamaica East Concourse/Concourse Village Queens Bronx
6430 2019-03-23 22:55:18 2019-03-23 23:14:25 1 4.14 16.0 0.00 0.0 17.30 green cash Crown Heights North Bushwick North Brooklyn Brooklyn
6431 2019-03-04 10:09:25 2019-03-04 10:14:29 1 1.12 6.0 0.00 0.0 6.80 green credit card East New York East Flatbush/Remsen Village Brooklyn Brooklyn
6432 2019-03-13 19:31:22 2019-03-13 19:48:02 1 3.85 15.0 3.36 0.0 20.16 green credit card Boerum Hill Windsor Terrace Brooklyn Brooklyn

6340 rows Ă— 14 columns

Examples with list comprehension and f-strings#

  • The expression pd.to_datetime("4-24-2023").day_name() produces the string "Monday". Using this idea, make the following length 7 list, and name the result day_list.

["Monday", "Tuesday", ..., "Sunday"]

(On a quiz or exam, explicitly writing out 7 different values will not get credit.)

import pandas as pd

We don’t need to use the dt accessor here, because we are working with a single value, not with a pandas Series.

pd.to_datetime("4-24-2023").day_name()
'Monday'

Let’s slowly build up to making this list. Here we get the day numbers we will use.

# for-loop way

for i in range(24,31):
    print(i)
24
25
26
27
28
29
30

“Of course” the following is not going to work… how would Python know that the i inside the string was a variable?

# for-loop way

for i in range(24,31):
    print("4-i-2023")
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023

The following is a little clumsy, and even it does not work, because we are trying to combine a string like "4-" with an integer like 24.

# for-loop way, old way

for i in range(24,31):
    print("4-" + i + "-2023")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [32], line 4
      1 # for-loop way, old way
      3 for i in range(24,31):
----> 4     print("4-" + i + "-2023")

TypeError: can only concatenate str (not "int") to str

To fix the above error, we need to convert i to a string, using str. This works, but we will see a much more elegant way below.

# for-loop way, old way

for i in range(24,31):
    print("4-" + str(i) + "-2023")
4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023

Here is the exact same thing, but using f-strings. (These were added relatively recently, in Python 3.6.) Notice the two changes: (1) We put the letter f before the opening quotation mark, and (2) we put the variable inside curly brackets.

# for-loop way, f-string way

for i in range(24,31):
    print(f"4-{i}-2023")
4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023

Now we can finally use the to_datetime function and the day_name method to get the strings for the days of the week.

# for-loop way, f-string way

for i in range(24,31):
    print(pd.to_datetime(f"4-{i}-2023").day_name())
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday

Here would be the for-loop way to put these names into a list.

day_list = []

for i in range(24,31):
    day_list.append(pd.to_datetime(f"4-{i}-2023").day_name())

day_list
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

The list comprehension approach is much more elegant. The most basic example of list comprehension is [A for B in C], where A is what should go into the list, B is the variable name, and C is whatever we are iterating over. The best way to gain comfort with list comprehension is to do the same thing using a for loop, and compare. For example, compare the following to what we just did.

# list comprehension way
day_list = [pd.to_datetime(f"4-{i}-2023").day_name() for i in range(24,31)]
day_list
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

We didn’t get further than this.

  • Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like "Monday").

  • Why didn’t we have to use pd.to_datetime?

  • For each day, what was the median tip for that day? Print out a string with explanation for each day. Use groupby. Also specify :.2f in the string formatting part so that only two decimal places get printed.

  • Make a list of all the row labels in df for which the “pickup_zone” or “dropoff_zone” contains the substring “Airport”. Name the result ind_list.

  • Write a function has_airport which takes as input a row from df (not a row label but an entire row) and as output returns True if the “pickup_zone” or “dropoff_zone” entry contains the substring “Airport”.

  • Make a new column "Airport" which contains True if the “pickup_zone” or “dropoff_zone” contains the substring “Airport”, and otherwise contains False. Use df.apply(???, axis=???).

  • Make the same column, again using apply, but this time using a lambda function.

  • Does the result match the indices we found in ind_list?

Last part of class#

  • Time to work on Worksheets 5-6 (due tonight) and 7 (due next week).

  • Maya is here to help.