Week 2 Friday#
Announcements#
I’m in a different notebook. Will upload this one later.
In case you haven’t noticed, I’ve been posting annotated versions of these notebooks in the course notes. Here is the one from Wednesday.
Yufei (one of our three Learning Assistants) is here to help.
(Worse) alternatives to the material from Wednesday#
On Wednesday, we saw how to find all rows in the taxis dataset for which the pickup zone and the dropoff zone both contained the substring "Airport"
. Let’s see some alternative ways to do that. This will give practice with pandas indexing and general Python concepts (like for loops). Then we’ll recap the “right” way to do this, using the str
accessor and Boolean indexing in pandas.
import pandas as pd
Here’s a reminder of how the taxis dataset looks. I’m first going to shuffle the rows, using the sample
method. If I specify frac = 0.1
, we will get 10% of the rows. Because I am using random_state=10
, if you also use random_state=10
, you should get the same random rows as I get. (This keyword argument random_state
is like the seed
argument in NumPy.)
df = pd.read_csv("../data/taxis.csv")
df = df.sample(frac = 0.1, random_state=10)
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
890 | 2019-03-09 16:32:40 | 2019-03-09 16:49:32 | 2 | 2.70 | 13.0 | 0.00 | 0.0 | 16.30 | yellow | cash | Midtown North | Kips Bay | Manhattan | Manhattan |
5870 | 2019-03-26 11:04:06 | 2019-03-26 11:09:23 | 1 | 0.64 | 5.5 | 1.26 | 0.0 | 7.56 | green | credit card | Morningside Heights | Manhattanville | Manhattan | Manhattan |
686 | 2019-03-27 15:02:00 | 2019-03-27 15:31:19 | 1 | 3.79 | 20.0 | 0.00 | 0.0 | 23.30 | yellow | cash | East Chelsea | Manhattan Valley | Manhattan | Manhattan |
4937 | 2019-03-02 19:50:23 | 2019-03-02 20:04:19 | 2 | 1.60 | 10.0 | 2.75 | 0.0 | 16.55 | yellow | credit card | TriBeCa/Civic Center | East Village | Manhattan | Manhattan |
2659 | 2019-03-18 21:36:06 | 2019-03-18 21:48:31 | 1 | 1.10 | 8.0 | 0.00 | 0.0 | 11.80 | yellow | cash | Midtown North | Midtown North | Manhattan | Manhattan |
The original DataFrame had 6433 rows, so here we check that we got only 10% of the rows (because of frac=0.1
).
df.shape
(643, 14)
But I actually want all of the rows, so I’ve just copy-pasted what I have above, but now using frac=1
. The only reason we’re doing this is so the row integer locations are different from the row index names. For example the top three rows below have integer location 0, 1, 2, but they have index names 890, 5870, and 686.
df = pd.read_csv("../data/taxis.csv")
df = df.sample(frac = 1, random_state=10)
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
890 | 2019-03-09 16:32:40 | 2019-03-09 16:49:32 | 2 | 2.70 | 13.0 | 0.00 | 0.0 | 16.30 | yellow | cash | Midtown North | Kips Bay | Manhattan | Manhattan |
5870 | 2019-03-26 11:04:06 | 2019-03-26 11:09:23 | 1 | 0.64 | 5.5 | 1.26 | 0.0 | 7.56 | green | credit card | Morningside Heights | Manhattanville | Manhattan | Manhattan |
686 | 2019-03-27 15:02:00 | 2019-03-27 15:31:19 | 1 | 3.79 | 20.0 | 0.00 | 0.0 | 23.30 | yellow | cash | East Chelsea | Manhattan Valley | Manhattan | Manhattan |
4937 | 2019-03-02 19:50:23 | 2019-03-02 20:04:19 | 2 | 1.60 | 10.0 | 2.75 | 0.0 | 16.55 | yellow | credit card | TriBeCa/Civic Center | East Village | Manhattan | Manhattan |
2659 | 2019-03-18 21:36:06 | 2019-03-18 21:48:31 | 1 | 1.10 | 8.0 | 0.00 | 0.0 | 11.80 | yellow | cash | Midtown North | Midtown North | Manhattan | Manhattan |
Let’s check that we really are getting all 6433 rows.
df.shape
(6433, 14)
Warm-up: What are the integer locations of the “pickup_zone” and “dropoff_zone” columns?#
On Wednesday, I counted to get the integer locations. Here are some better approaches.
Here are all the columns. If you count, you should get that “pickup_zone” occurs in integer position 10, and that “dropoff_zone” occurs in integer position 11.
df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
Convert to a list and then use the
index
method.
This was the first way that came to mind. Lists in Python have an index
method for exactly this purpose (getting the integer location of where an element occurs). The object df.columns
is not a list though, it is a pandas Index.
type(df.columns)
pandas.core.indexes.base.Index
Here we convert it to a list.
mylist = list(df.columns)
mylist
['pickup',
'dropoff',
'passengers',
'distance',
'fare',
'tip',
'tolls',
'total',
'color',
'payment',
'pickup_zone',
'dropoff_zone',
'pickup_borough',
'dropoff_borough']
And we check that it is really a list.
type(mylist)
list
We can now use the index
method of a Python list.
mylist.index("pickup_zone")
10
Get a Boolean array and then use
np.nonzero
(don’t forget to import NumPy before callingnp.nonzero
.)
Here is another approach, using the NumPy function nonzero
. We first make a Boolean array that is equal to True
in precisely one position, where the column name is “pickup_zone”.
df.columns == "pickup_zone"
array([False, False, False, False, False, False, False, False, False,
False, True, False, False, False])
Even though that last thing is a NumPy array, we have not imported NumPy yet, so if we want to use NumPy functions ourselves, we need to import it.
import numpy as np
The following reports that 10
is the slot where we have a True
. It is a little confusing with the parentheses (this is secretly a length 1 tuple). The reason is that the nonzero
function works exactly the same with two-dimensional NumPy arrays, and so a tuple is used in case there are extra dimensions. (If the True
were in row 10 and column 4, we would see something like (array([10]), array([4]))
.
np.nonzero(df.columns == "pickup_zone")
(array([10]),)
An alternative, that has additional functionality, is to use np.where
. For our basic example, the output is exactly the same.
np.where(df.columns == "pickup_zone")
(array([10]),)
Here is an example of the additional functionality. We put the string "Chris"
everywhere there was a True
(just one slot), and we put the string "Davis"
everywhere else.
np.where(df.columns == "pickup_zone", "Chris", "Davis")
array(['Davis', 'Davis', 'Davis', 'Davis', 'Davis', 'Davis', 'Davis',
'Davis', 'Davis', 'Davis', 'Chris', 'Davis', 'Davis', 'Davis'],
dtype='<U5')
Use the
get_loc
method of a pandas Index. (Credit: Bing chat.)
I didn’t know this option before this morning, but it is exactly what we want. (I don’t quite see why it is called get_loc
rather than get_iloc
, it seems like we are getting the integer location.)
df.columns.get_loc("pickup_zone")
10
Use the
get_indexer
method of a pandas Index. (Credit: Bing chat.)
This approach has some extra functionality, because it can get both locations at the same time. I don’t think any of the above approaches can do that, without some additional work.
df.columns.get_indexer(["pickup_zone", "dropoff_zone"])
array([10, 11])
Here’s another reminder of what the columns are.
df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
And a reminder that df.columns
is a pandas Index (this data type is not as essential to Math 10 as pandas DataFrames and pandas Series).
type(df.columns)
pandas.core.indexes.base.Index
Here is the list version.
list(df.columns)
['pickup',
'dropoff',
'passengers',
'distance',
'fare',
'tip',
'tolls',
'total',
'color',
'payment',
'pickup_zone',
'dropoff_zone',
'pickup_borough',
'dropoff_borough']
Which we saved with the name mylist
.
mylist
['pickup',
'dropoff',
'passengers',
'distance',
'fare',
'tip',
'tolls',
'total',
'color',
'payment',
'pickup_zone',
'dropoff_zone',
'pickup_borough',
'dropoff_borough']
type(mylist)
list
Lists in Python can contain anything, including different data types in different positions. In this case, all of the entries in the list are strings. Here we check that the entry at index 8
is a string.
type(mylist[8])
str
Worse approach 1: Using a for loop and iloc
#
On Wednesday, we used str
and contains
to find the sub-DataFrame containing the rows where both the “pickup_zone” and the “dropoff_zone” involved an airport.
Here we’re going to go through the 6433 rows, one at a time. We will do that by going through the integers from 0 (inclusive) to 6433 (exclusive) one at a time. We can do that using for i in range(6433):
. (In Matlab, we would do something like for i = 1:6433
.)
len(df)
6433
The logic of the following is pretty good, but in practice it raises an error because some entries are missing.
# problem from missing values
good_inds = []
for i in range(len(df)):
if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
good_inds.append(i)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[26], line 6
3 good_inds = []
5 for i in range(len(df)):
----> 6 if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
7 good_inds.append(i)
TypeError: argument of type 'float' is not iterable
Let me try to convince you that error is caused by missing values. (It’s not obvious from the description of the error.) One way to represent missing values in Python is to use np.nan
(which stands for “not a number”).
np.nan
nan
If we try to check whether "Airport" in np.nan
, we indeed get the exact same error.
"Airport" in np.nan
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[28], line 1
----> 1 "Airport" in np.nan
TypeError: argument of type 'float' is not iterable
There are many ways we could get around this (such as by dropping rows with missing values in these columns). Here we will use a simple approach, of wrapping the whole problematic portion in a try
block. The except TypeError
portion is saying that even if a TypeError is raised, do not stop the execution of the code, instead just continue to the next step in the for loop.
One side comment: in Python, it is almost never correct to use the code range(len(...))
, there is almost always a more elegant way to iterate through the elements directly.
Another side comment: We are using and
instead of &
in the following, because it is being used with True
and False
values directly, rather than with arrays or Series of True
and False
values.
good_inds = []
for i in range(len(df)):
try:
if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
good_inds.append(i)
except TypeError:
continue
good_inds
[517, 642, 1117, 1239, 1461, 2032, 3576, 3971, 4970, 5417, 5773]
This good_inds
list contains integer locations, so it should be used together with iloc
. We can get all of these rows by passing this list to iloc
, as in the following. (For future reference, notice how the numbers appearing on the left side are not the same as 517, 642, etc.)
Be sure to scroll to the right and convince yourself that this sub-DataFrame does indeed correspond to the rows where “Airport” appears in both the “pickup_zone” and the “dropoff_zone” columns.
df.iloc[good_inds]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
If we try to use this without the iloc
, pandas will try to find columns with these names (not integer positions), and we get an error.
df[good_inds]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[31], line 1
----> 1 df[good_inds]
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/frame.py:3813, in DataFrame.__getitem__(self, key)
3811 if is_iterator(key):
3812 key = list(key)
-> 3813 indexer = self.columns._get_indexer_strict(key, "columns")[1]
3815 # take() does not accept boolean indexers
3816 if getattr(indexer, "dtype", None) == bool:
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
6067 else:
6068 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
6072 keyarr = self.take(indexer)
6073 if isinstance(key, Index):
6074 # GH 42790 - Preserve name from an Index
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name)
6128 if use_interval_msg:
6129 key = list(key)
-> 6130 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
6133 raise KeyError(f"{not_found} not in index")
KeyError: "None of [Int64Index([517, 642, 1117, 1239, 1461, 2032, 3576, 3971, 4970, 5417, 5773], dtype='int64')] are in the [columns]"
Worse approach 2: Using a for loop and loc
#
Before using this approach, we should make sure there aren’t any repetitions in the index. In most natural examples, the elements in the index will be unique, but that’s not a requirement.
The attribute df.index
holds these numbers listed on the left-hand side.
df.index[:5]
Int64Index([890, 5870, 686, 4937, 2659], dtype='int64')
Notice how those same numbers 890, 5870, etc appear on the left.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
890 | 2019-03-09 16:32:40 | 2019-03-09 16:49:32 | 2 | 2.70 | 13.0 | 0.00 | 0.0 | 16.30 | yellow | cash | Midtown North | Kips Bay | Manhattan | Manhattan |
5870 | 2019-03-26 11:04:06 | 2019-03-26 11:09:23 | 1 | 0.64 | 5.5 | 1.26 | 0.0 | 7.56 | green | credit card | Morningside Heights | Manhattanville | Manhattan | Manhattan |
686 | 2019-03-27 15:02:00 | 2019-03-27 15:31:19 | 1 | 3.79 | 20.0 | 0.00 | 0.0 | 23.30 | yellow | cash | East Chelsea | Manhattan Valley | Manhattan | Manhattan |
4937 | 2019-03-02 19:50:23 | 2019-03-02 20:04:19 | 2 | 1.60 | 10.0 | 2.75 | 0.0 | 16.55 | yellow | credit card | TriBeCa/Civic Center | East Village | Manhattan | Manhattan |
2659 | 2019-03-18 21:36:06 | 2019-03-18 21:48:31 | 1 | 1.10 | 8.0 | 0.00 | 0.0 | 11.80 | yellow | cash | Midtown North | Midtown North | Manhattan | Manhattan |
Here we make sure no two rows have the same name. (I can never remember that this is an attribute rather than a method, and I definitely don’t expect you to remember it.)
# checks if two rows have the same name
df.index.is_unique
True
We can now adapt the iloc
code above. There are a few changes to make. We change the range(len(df))
to df.index
. (This df.index
is definitely more elegant.) We also change from iloc
to loc
, because we are now iterating through the row names. Lastly, because we are using loc
, we need to change the integers 10
and 11
to the column names “pickup_zone” and “dropoff_zone”.
good_inds = []
for i in df.index:
try:
if ("Airport" in df.loc[i, "pickup_zone"]) and ("Airport" in df.loc[i, "dropoff_zone"]):
good_inds.append(i)
except TypeError:
continue
good_inds
[2387, 1080, 120, 5364, 1929, 3571, 1416, 4358, 1089, 5095, 770]
We can now get the same sub-DataFrame we found above (and that we found on Wednesday) using df.loc
and this new list. Notice how the integers in good_inds
now do appear on the left side. (That wasn’t the case above with iloc
.)
df.loc[good_inds]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
Worse approach 3: Using the pandas Series map
method#
We first define a small function to determine if the string "Airport"
appears as a substring (in some input s
). We will improve this function below.
def has_airport(s):
if "Airport" in s:
return True
else:
return False
Here is a quick demonstration of how the function works.
has_airport("Christopher")
False
It is checking to see if "Airport"
occurs as a substring.
has_airport("abc Airport3")
True
As you get more experienced as a Python coder, you will be able to identify places to shorten definitions like for our has_airport
function above. Notice how the function is returning True
if "Airport" in s
is True and is returning False
otherwise, so we can simply return "Airport" in s
directly.
def has_airport(s):
return "Airport" in s
We’re going to apply that function to every entry in the following Series.
ser1 = df["pickup_zone"]
ser1
890 Midtown North
5870 Morningside Heights
686 East Chelsea
4937 TriBeCa/Civic Center
2659 Midtown North
...
1180 Upper East Side North
3441 West Chelsea/Hudson Yards
1344 Gramercy
4623 Lincoln Square West
1289 East Village
Name: pickup_zone, Length: 6433, dtype: object
But unfortunately we get a similar error to before, due to missing values.
ser1.map(has_airport)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[43], line 1
----> 1 ser1.map(has_airport)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
4460 def map(
4461 self,
4462 arg: Callable | Mapping | Series,
4463 na_action: Literal["ignore"] | None = None,
4464 ) -> Series:
4465 """
4466 Map values of Series according to an input mapping or function.
4467
(...)
4537 dtype: object
4538 """
-> 4539 new_values = self._map_values(arg, na_action=na_action)
4540 return self._constructor(new_values, index=self.index).__finalize__(
4541 self, method="map"
4542 )
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/base.py:890, in IndexOpsMixin._map_values(self, mapper, na_action)
887 raise ValueError(msg)
889 # mapper is a function
--> 890 new_values = map_f(values, mapper)
892 return new_values
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()
Cell In[41], line 2, in has_airport(s)
1 def has_airport(s):
----> 2 return "Airport" in s
TypeError: argument of type 'float' is not iterable
Let me again try to convince you that the error is due to missing values.
has_airport(np.nan)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[44], line 1
----> 1 has_airport(np.nan)
Cell In[41], line 2, in has_airport(s)
1 def has_airport(s):
----> 2 return "Airport" in s
TypeError: argument of type 'float' is not iterable
An oversight on my part is that I haven’t shown you the Python function help
yet. Here is an example of using it to learn more about the pandas Series method map
. Notice the na_action
keyword argument, which is exactly what we need.
help(ser1.map)
Help on method map in module pandas.core.series:
map(arg: 'Callable | Mapping | Series', na_action: "Literal['ignore'] | None" = None) -> 'Series' method of pandas.core.series.Series instance
Map values of Series according to an input mapping or function.
Used for substituting each value in a Series with another value,
that may be derived from a function, a ``dict`` or
a :class:`Series`.
Parameters
----------
arg : function, collections.abc.Mapping subclass or Series
Mapping correspondence.
na_action : {None, 'ignore'}, default None
If 'ignore', propagate NaN values, without passing them to the
mapping correspondence.
Returns
-------
Series
Same index as caller.
See Also
--------
Series.apply : For applying more complex functions on a Series.
DataFrame.apply : Apply a function row-/column-wise.
DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
Notes
-----
When ``arg`` is a dictionary, values in Series that are not in the
dictionary (as keys) are converted to ``NaN``. However, if the
dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
provides a method for default values), then this default is used
rather than ``NaN``.
Examples
--------
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0 cat
1 dog
2 NaN
3 rabbit
dtype: object
``map`` accepts a ``dict`` or a ``Series``. Values that are not found
in the ``dict`` are converted to ``NaN``, unless the dict has a default
value (e.g. ``defaultdict``):
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0 kitten
1 puppy
2 NaN
3 NaN
dtype: object
It also accepts a function:
>>> s.map('I am a {}'.format)
0 I am a cat
1 I am a dog
2 I am a nan
3 I am a rabbit
dtype: object
To avoid applying the function to missing values (and keep them as
``NaN``) ``na_action='ignore'`` can be used:
>>> s.map('I am a {}'.format, na_action='ignore')
0 I am a cat
1 I am a dog
2 NaN
3 I am a rabbit
dtype: object
For now we only see False
values, because none of the visible entries contain "Airport"
as a substring.
ser1.map(has_airport, na_action="ignore")
890 False
5870 False
686 False
4937 False
2659 False
...
1180 False
3441 False
1344 False
4623 False
1289 False
Name: pickup_zone, Length: 6433, dtype: object
ser2 = df["dropoff_zone"]
The following should be the same sub-DataFrame we attained two different ways above.
df[ser1.map(has_airport, na_action="ignore") & ser2.map(has_airport, na_action="ignore")]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
Best approach I know: Using the str
accessor and Boolean indexing#
This is review of what we did on Wednesday. Here’s a reminder that ser1
is the variable containing the “pickup_zone” column.
ser1
890 Midtown North
5870 Morningside Heights
686 East Chelsea
4937 TriBeCa/Civic Center
2659 Midtown North
...
1180 Upper East Side North
3441 West Chelsea/Hudson Yards
1344 Gramercy
4623 Lincoln Square West
1289 East Village
Name: pickup_zone, Length: 6433, dtype: object
Instead of using map
and our custom has_airport
function, we use the str
accessor, which gives us access to the contains
method. The big advantage here is that we didn’t have to define any special function ourselves.
ser1.str.contains("Airport")
890 False
5870 False
686 False
4937 False
2659 False
...
1180 False
3441 False
1344 False
4623 False
1289 False
Name: pickup_zone, Length: 6433, dtype: object
Here is the final way (and the best way I know) to get the sub-DataFrame containing the rows for which both the pickup and dropoff zones involved an airport. We are using Boolean indexing to keep precisely those rows in df
. (The previous approach, with map
, also used Boolean indexing. The earlier approaches, with loc
and iloc
, used indexing but not Boolean indexing.)
df[ser1.str.contains("Airport") & ser2.str.contains("Airport")]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens |