Week 4 Monday

Week 4 Monday#

Announcements#

The midterm is a week from today. A sample midterm is posted on the Week 5 page on Canvas.
The midterm is closed book, closed computer, but you are allowed to use one notecard with handwritten notes (both sides). Jinghao will have the notecards for you before the quiz on Tuesday.
The best way to study is to go over the worksheets, quizzes, and the sample midterm. Next priority would be the lecture notes.
I briefly mention lots of topics. If you’re not sure if something could appear on the midterm, ask on Ed Discussion and I’ll answer there.

Plan for today#

Today I would like to briefly cover three useful Python topics (that are not specific to data science):

list comprehension
f-strings
lambda functions

I would also like to discuss the pandas DataFrame methods apply and applymap. Both of these methods take a function as their input. The apply method applies that function to an entire row or an entire column at a time. The applymap method applies that function to every entry on its own. (We have also seen map which is very similar, but which is a Series method rather than a DataFrame method.)

Using `applymap` and a lambda function#

Load the “taxis” dataset from Seaborn.

import seaborn as sns

df = sns.load_dataset("taxis")

Here is a reminder of what df looks like.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

Set the value at row 3 and column "tip" to NumPy’s Not-a-Number value.

import numpy as np

Here is an example of how to set a particular element to a particular value. Because we are referring to the column “tip” by name, we need to use loc rather than iloc.

df.loc[3, "tip"] = np.nan

Notice how the corresponding element has now been changed.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	NaN	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

Look at the first few rows of df.isna(). Can you recognize the effect of the missing value we inserted?

Notice how in the row 3 and the column "tip" we have a True, corresponding to this missing value.

df.isna()

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	False	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	True	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	False	False	False	False	False	False	False	False	False	False	False	False	False	False
6429	False	False	False	False	False	False	False	False	False	False	False	False	False	False
6430	False	False	False	False	False	False	False	False	False	False	False	False	False	False
6431	False	False	False	False	False	False	False	False	False	False	False	False	False	False
6432	False	False	False	False	False	False	False	False	False	False	False	False	False	False

6433 rows × 14 columns

Write a function is_missing which takes as input x and as output returns "Missing" if x is np.nan and returns x otherwise.

(I’m not sure the best way to make this work with string inputs, so you can assume the inputs are numeric.)

Surprisingly, the following does not work, as we will see below.

def is_missing(x):
    if x == np.nan:
        return "Missing"
    else:
        return x

It seems to work for the input 100.

is_missing(100)

And it works for a string input.

is_missing("chris")

'chris'

But it doesn’t return the correct thing, "Missing", for np.nan as the input.

# Surprise
is_missing(np.nan)

nan

The problem is that np.nan == np.nan evaluates to False. There is no way to know this in advance; it is just a convention. The idea behind it is, say we wanted to know if two values were equal (say two people’s salaries), and say they were both missing, would you want to consider those people as having equal salaries? This isn’t a proof by any means, just an explanation of why we might want this convention.

np.nan == np.nan

False

Instead we will use the NumPy function np.isnan. (This is the part that does not seem to work for string inputs.) This is the NumPy analogue of the df.isna() method we used above.

def is_missing(x):
    if np.isnan(x):
        return "Missing"
    else:
        return x

This still works for 10 as the input.

is_missing(10)

And now it is working for np.nan as input.

is_missing(np.nan)

'Missing'

Apply is_missing to every entry in the columns from “passengers” to “total” using applymap.

Here is how we get those columns. Surprising: loc slices include the right endpoint. (That almost never happens in Python.)

df.loc[:, "passengers":"total"]

	passengers	distance	fare	tip	tolls	total
0	1	1.60	7.0	2.15	0.0	12.95
1	1	0.79	5.0	0.00	0.0	9.30
2	1	1.37	7.5	2.36	0.0	14.16
3	1	7.70	27.0	NaN	0.0	36.95
4	3	2.16	9.0	1.10	0.0	13.40
...	...	...	...	...	...	...
6428	1	0.75	4.5	1.06	0.0	6.36
6429	1	18.74	58.0	0.00	0.0	58.80
6430	1	4.14	16.0	0.00	0.0	17.30
6431	1	1.12	6.0	0.00	0.0	6.80
6432	1	3.85	15.0	3.36	0.0	20.16

6433 rows × 6 columns

The applymap method takes as input a function, and then applies that function to every input. Notice how most values are unchanged, but we now have an additional string "Missing" showing up in row 3.

df.loc[:, "passengers":"total"].applymap(is_missing)

	passengers	distance	fare	tip	tolls	total
0	1	1.60	7.0	2.15	0.0	12.95
1	1	0.79	5.0	0.0	0.0	9.30
2	1	1.37	7.5	2.36	0.0	14.16
3	1	7.70	27.0	Missing	0.0	36.95
4	3	2.16	9.0	1.1	0.0	13.40
...	...	...	...	...	...	...
6428	1	0.75	4.5	1.06	0.0	6.36
6429	1	18.74	58.0	0.0	0.0	58.80
6430	1	4.14	16.0	0.0	0.0	17.30
6431	1	1.12	6.0	0.0	0.0	6.80
6432	1	3.85	15.0	3.36	0.0	20.16

6433 rows × 6 columns

Do the same thing using applymap and a lambda function.

Two new things are being introduced here: lambda functions and the syntax ??? if ??? else ???. The term lambda tells Python that we are about to define a function. The part before the colon (x in this case) designates the inputs, and the part after the colon designates the output.

The syntax A if B else C will evaluate to A if B is True and will evaluate to C if B is False. This is a shorter way of writing the function that we defined above using def.

df.loc[:, "passengers":"total"].applymap(lambda x: "Missing" if np.isnan(x) else x)

	passengers	distance	fare	tip	tolls	total
0	1	1.60	7.0	2.15	0.0	12.95
1	1	0.79	5.0	0.0	0.0	9.30
2	1	1.37	7.5	2.36	0.0	14.16
3	1	7.70	27.0	Missing	0.0	36.95
4	3	2.16	9.0	1.1	0.0	13.40
...	...	...	...	...	...	...
6428	1	0.75	4.5	1.06	0.0	6.36
6429	1	18.74	58.0	0.0	0.0	58.80
6430	1	4.14	16.0	0.0	0.0	17.30
6431	1	1.12	6.0	0.0	0.0	6.80
6432	1	3.85	15.0	3.36	0.0	20.16

6433 rows × 6 columns

Notice how we haven’t changed the original DataFrame df, other than putting the missing value in row 3.

Drop the rows which contain missing values, using df = df.dropna(axis=???).copy(). We include the copy() at the end to prevent warnings later when we change the DataFrame.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	NaN	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

Notice how the row with the missing value, 3, has disappeared. If you tried to use df.loc[3] now, you would raise an error. If you tried to use df.iloc[3], it would work (and return the row with label 4).

df = df.dropna(axis=0).copy()
df

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	0.0	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	0.0	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	0.0	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	0.0	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan
5	2019-03-11 10:37:23	2019-03-11 10:47:31	1	0.49	7.5	2.16	0.0	12.96	yellow	credit card	Times Sq/Theatre District	Midtown East	Manhattan	Manhattan
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6428	2019-03-31 09:51:53	2019-03-31 09:55:27	1	0.75	4.5	1.06	0.0	6.36	green	credit card	East Harlem North	Central Harlem North	Manhattan	Manhattan
6429	2019-03-31 17:38:00	2019-03-31 18:34:23	1	18.74	58.0	0.00	0.0	58.80	green	credit card	Jamaica	East Concourse/Concourse Village	Queens	Bronx
6430	2019-03-23 22:55:18	2019-03-23 23:14:25	1	4.14	16.0	0.00	0.0	17.30	green	cash	Crown Heights North	Bushwick North	Brooklyn	Brooklyn
6431	2019-03-04 10:09:25	2019-03-04 10:14:29	1	1.12	6.0	0.00	0.0	6.80	green	credit card	East New York	East Flatbush/Remsen Village	Brooklyn	Brooklyn
6432	2019-03-13 19:31:22	2019-03-13 19:48:02	1	3.85	15.0	3.36	0.0	20.16	green	credit card	Boerum Hill	Windsor Terrace	Brooklyn	Brooklyn

6340 rows × 14 columns

Examples with list comprehension and f-strings#

The expression pd.to_datetime("4-24-2023").day_name() produces the string "Monday". Using this idea, make the following length 7 list, and name the result day_list.

["Monday", "Tuesday", ..., "Sunday"]

(On a quiz or exam, explicitly writing out 7 different values will not get credit.)

import pandas as pd

We don’t need to use the dt accessor here, because we are working with a single value, not with a pandas Series.

pd.to_datetime("4-24-2023").day_name()

'Monday'

Let’s slowly build up to making this list. Here we get the day numbers we will use.

# for-loop way

for i in range(24,31):
    print(i)

“Of course” the following is not going to work… how would Python know that the i inside the string was a variable?

# for-loop way

for i in range(24,31):
    print("4-i-2023")

4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023
4-i-2023

The following is a little clumsy, and even it does not work, because we are trying to combine a string like "4-" with an integer like 24.

# for-loop way, old way

for i in range(24,31):
    print("4-" + i + "-2023")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [32], line 4
      1 # for-loop way, old way
      3 for i in range(24,31):
----> 4     print("4-" + i + "-2023")

TypeError: can only concatenate str (not "int") to str

To fix the above error, we need to convert i to a string, using str. This works, but we will see a much more elegant way below.

# for-loop way, old way

for i in range(24,31):
    print("4-" + str(i) + "-2023")

4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023

Here is the exact same thing, but using f-strings. (These were added relatively recently, in Python 3.6.) Notice the two changes: (1) We put the letter f before the opening quotation mark, and (2) we put the variable inside curly brackets.

# for-loop way, f-string way

for i in range(24,31):
    print(f"4-{i}-2023")

4-24-2023
4-25-2023
4-26-2023
4-27-2023
4-28-2023
4-29-2023
4-30-2023

Now we can finally use the to_datetime function and the day_name method to get the strings for the days of the week.

# for-loop way, f-string way

for i in range(24,31):
    print(pd.to_datetime(f"4-{i}-2023").day_name())

Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday

Here would be the for-loop way to put these names into a list.

day_list = []

for i in range(24,31):
    day_list.append(pd.to_datetime(f"4-{i}-2023").day_name())

day_list

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

The list comprehension approach is much more elegant. The most basic example of list comprehension is [A for B in C], where A is what should go into the list, B is the variable name, and C is whatever we are iterating over. The best way to gain comfort with list comprehension is to do the same thing using a for loop, and compare. For example, compare the following to what we just did.

# list comprehension way
day_list = [pd.to_datetime(f"4-{i}-2023").day_name() for i in range(24,31)]
day_list

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

We didn’t get further than this.

Make a new column named “Day” in the DataFrame, containing the day of the pickup (as a string, like "Monday").

Why didn’t we have to use pd.to_datetime?

For each day, what was the median tip for that day? Print out a string with explanation for each day. Use groupby. Also specify :.2f in the string formatting part so that only two decimal places get printed.

Make a list of all the row labels in df for which the “pickup_zone” or “dropoff_zone” contains the substring “Airport”. Name the result ind_list.

Write a function has_airport which takes as input a row from df (not a row label but an entire row) and as output returns True if the “pickup_zone” or “dropoff_zone” entry contains the substring “Airport”.

Make a new column "Airport" which contains True if the “pickup_zone” or “dropoff_zone” contains the substring “Airport”, and otherwise contains False. Use df.apply(???, axis=???).

Make the same column, again using apply, but this time using a lambda function.

Does the result match the indices we found in ind_list?

Last part of class#

Time to work on Worksheets 5-6 (due tonight) and 7 (due next week).
Maya is here to help.

Week 4 Monday

Contents

Week 4 Monday#

Announcements#

Plan for today#

Using applymap and a lambda function#

Examples with list comprehension and f-strings#

Last part of class#

Using `applymap` and a lambda function#