Week 2 Friday

Week 2 Friday#

Announcements#

I’m in a different notebook. Will upload this one later.
In case you haven’t noticed, I’ve been posting annotated versions of these notebooks in the course notes. Here is the one from Wednesday.
Yufei (one of our three Learning Assistants) is here to help.

(Worse) alternatives to the material from Wednesday#

On Wednesday, we saw how to find all rows in the taxis dataset for which the pickup zone and the dropoff zone both contained the substring "Airport". Let’s see some alternative ways to do that. This will give practice with pandas indexing and general Python concepts (like for loops). Then we’ll recap the “right” way to do this, using the str accessor and Boolean indexing in pandas.

import pandas as pd

Here’s a reminder of how the taxis dataset looks. I’m first going to shuffle the rows, using the sample method. If I specify frac = 0.1, we will get 10% of the rows. Because I am using random_state=10, if you also use random_state=10, you should get the same random rows as I get. (This keyword argument random_state is like the seed argument in NumPy.)

df = pd.read_csv("../data/taxis.csv")
df = df.sample(frac = 0.1, random_state=10)
df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
890	2019-03-09 16:32:40	2019-03-09 16:49:32	2	2.70	13.0	0.00	16.30	yellow	cash	Midtown North	Kips Bay	Manhattan	Manhattan
5870	2019-03-26 11:04:06	2019-03-26 11:09:23	1	0.64	5.5	1.26	7.56	green	credit card	Morningside Heights	Manhattanville	Manhattan	Manhattan
686	2019-03-27 15:02:00	2019-03-27 15:31:19	1	3.79	20.0	0.00	23.30	yellow	cash	East Chelsea	Manhattan Valley	Manhattan	Manhattan
4937	2019-03-02 19:50:23	2019-03-02 20:04:19	2	1.60	10.0	2.75	16.55	yellow	credit card	TriBeCa/Civic Center	East Village	Manhattan	Manhattan
2659	2019-03-18 21:36:06	2019-03-18 21:48:31	1	1.10	8.0	0.00	11.80	yellow	cash	Midtown North	Midtown North	Manhattan	Manhattan

The original DataFrame had 6433 rows, so here we check that we got only 10% of the rows (because of frac=0.1).

df.shape

(643, 14)

But I actually want all of the rows, so I’ve just copy-pasted what I have above, but now using frac=1. The only reason we’re doing this is so the row integer locations are different from the row index names. For example the top three rows below have integer location 0, 1, 2, but they have index names 890, 5870, and 686.

df = pd.read_csv("../data/taxis.csv")
df = df.sample(frac = 1, random_state=10)
df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
890	2019-03-09 16:32:40	2019-03-09 16:49:32	2	2.70	13.0	0.00	16.30	yellow	cash	Midtown North	Kips Bay	Manhattan	Manhattan
5870	2019-03-26 11:04:06	2019-03-26 11:09:23	1	0.64	5.5	1.26	7.56	green	credit card	Morningside Heights	Manhattanville	Manhattan	Manhattan
686	2019-03-27 15:02:00	2019-03-27 15:31:19	1	3.79	20.0	0.00	23.30	yellow	cash	East Chelsea	Manhattan Valley	Manhattan	Manhattan
4937	2019-03-02 19:50:23	2019-03-02 20:04:19	2	1.60	10.0	2.75	16.55	yellow	credit card	TriBeCa/Civic Center	East Village	Manhattan	Manhattan
2659	2019-03-18 21:36:06	2019-03-18 21:48:31	1	1.10	8.0	0.00	11.80	yellow	cash	Midtown North	Midtown North	Manhattan	Manhattan

Let’s check that we really are getting all 6433 rows.

df.shape

(6433, 14)

Warm-up: What are the integer locations of the “pickup_zone” and “dropoff_zone” columns?#

On Wednesday, I counted to get the integer locations. Here are some better approaches.

Here are all the columns. If you count, you should get that “pickup_zone” occurs in integer position 10, and that “dropoff_zone” occurs in integer position 11.

df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Convert to a list and then use the index method.

This was the first way that came to mind. Lists in Python have an index method for exactly this purpose (getting the integer location of where an element occurs). The object df.columns is not a list though, it is a pandas Index.

type(df.columns)

pandas.core.indexes.base.Index

Here we convert it to a list.

mylist = list(df.columns)
mylist

['pickup',
 'dropoff',
 'passengers',
 'distance',
 'fare',
 'tip',
 'tolls',
 'total',
 'color',
 'payment',
 'pickup_zone',
 'dropoff_zone',
 'pickup_borough',
 'dropoff_borough']

And we check that it is really a list.

type(mylist)

list

We can now use the index method of a Python list.

mylist.index("pickup_zone")

Get a Boolean array and then use np.nonzero (don’t forget to import NumPy before calling np.nonzero.)

Here is another approach, using the NumPy function nonzero. We first make a Boolean array that is equal to True in precisely one position, where the column name is “pickup_zone”.

df.columns == "pickup_zone"

array([False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False])

Even though that last thing is a NumPy array, we have not imported NumPy yet, so if we want to use NumPy functions ourselves, we need to import it.

import numpy as np

The following reports that 10 is the slot where we have a True. It is a little confusing with the parentheses (this is secretly a length 1 tuple). The reason is that the nonzero function works exactly the same with two-dimensional NumPy arrays, and so a tuple is used in case there are extra dimensions. (If the True were in row 10 and column 4, we would see something like (array([10]), array([4])).

np.nonzero(df.columns == "pickup_zone")

(array([10]),)

An alternative, that has additional functionality, is to use np.where. For our basic example, the output is exactly the same.

np.where(df.columns == "pickup_zone")

(array([10]),)

Here is an example of the additional functionality. We put the string "Chris" everywhere there was a True (just one slot), and we put the string "Davis" everywhere else.

np.where(df.columns == "pickup_zone", "Chris", "Davis")

array(['Davis', 'Davis', 'Davis', 'Davis', 'Davis', 'Davis', 'Davis',
       'Davis', 'Davis', 'Davis', 'Chris', 'Davis', 'Davis', 'Davis'],
      dtype='<U5')

Use the get_loc method of a pandas Index. (Credit: Bing chat.)

I didn’t know this option before this morning, but it is exactly what we want. (I don’t quite see why it is called get_loc rather than get_iloc, it seems like we are getting the integer location.)

df.columns.get_loc("pickup_zone")

Use the get_indexer method of a pandas Index. (Credit: Bing chat.)

This approach has some extra functionality, because it can get both locations at the same time. I don’t think any of the above approaches can do that, without some additional work.

df.columns.get_indexer(["pickup_zone", "dropoff_zone"])

array([10, 11])

Here’s another reminder of what the columns are.

df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

And a reminder that df.columns is a pandas Index (this data type is not as essential to Math 10 as pandas DataFrames and pandas Series).

type(df.columns)

pandas.core.indexes.base.Index

Here is the list version.

list(df.columns)

['pickup',
 'dropoff',
 'passengers',
 'distance',
 'fare',
 'tip',
 'tolls',
 'total',
 'color',
 'payment',
 'pickup_zone',
 'dropoff_zone',
 'pickup_borough',
 'dropoff_borough']

Which we saved with the name mylist.

mylist

['pickup',
 'dropoff',
 'passengers',
 'distance',
 'fare',
 'tip',
 'tolls',
 'total',
 'color',
 'payment',
 'pickup_zone',
 'dropoff_zone',
 'pickup_borough',
 'dropoff_borough']

type(mylist)

list

Lists in Python can contain anything, including different data types in different positions. In this case, all of the entries in the list are strings. Here we check that the entry at index 8 is a string.

type(mylist[8])

str

Worse approach 1: Using a for loop and `iloc`#

On Wednesday, we used str and contains to find the sub-DataFrame containing the rows where both the “pickup_zone” and the “dropoff_zone” involved an airport.

Here we’re going to go through the 6433 rows, one at a time. We will do that by going through the integers from 0 (inclusive) to 6433 (exclusive) one at a time. We can do that using for i in range(6433):. (In Matlab, we would do something like for i = 1:6433.)

len(df)

The logic of the following is pretty good, but in practice it raises an error because some entries are missing.

# problem from missing values

good_inds = []

for i in range(len(df)):
    if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
        good_inds.append(i)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[26], line 6
      3 good_inds = []
      5 for i in range(len(df)):
----> 6     if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
      7         good_inds.append(i)

TypeError: argument of type 'float' is not iterable

Let me try to convince you that error is caused by missing values. (It’s not obvious from the description of the error.) One way to represent missing values in Python is to use np.nan (which stands for “not a number”).

np.nan

nan

If we try to check whether "Airport" in np.nan, we indeed get the exact same error.

"Airport" in np.nan

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 "Airport" in np.nan

TypeError: argument of type 'float' is not iterable

There are many ways we could get around this (such as by dropping rows with missing values in these columns). Here we will use a simple approach, of wrapping the whole problematic portion in a try block. The except TypeError portion is saying that even if a TypeError is raised, do not stop the execution of the code, instead just continue to the next step in the for loop.

One side comment: in Python, it is almost never correct to use the code range(len(...)), there is almost always a more elegant way to iterate through the elements directly.

Another side comment: We are using and instead of & in the following, because it is being used with True and False values directly, rather than with arrays or Series of True and False values.

good_inds = []

for i in range(len(df)):
    try:
        if ("Airport" in df.iloc[i, 10]) and ("Airport" in df.iloc[i, 11]):
            good_inds.append(i)
    except TypeError:
        continue

good_inds

[517, 642, 1117, 1239, 1461, 2032, 3576, 3971, 4970, 5417, 5773]

This good_inds list contains integer locations, so it should be used together with iloc. We can get all of these rows by passing this list to iloc, as in the following. (For future reference, notice how the numbers appearing on the left side are not the same as 517, 642, etc.)

Be sure to scroll to the right and convince yourself that this sub-DataFrame does indeed correspond to the rows where “Airport” appears in both the “pickup_zone” and the “dropoff_zone” columns.

df.iloc[good_inds]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
2387	2019-03-28 15:58:52	2019-03-28 15:59:25	1	1.80	69.06	20.80	0.00	90.16	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1080	2019-03-04 14:17:05	2019-03-04 14:17:13	1	0.00	2.50	0.00	0.00	3.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
120	2019-03-21 17:21:44	2019-03-21 17:21:49	1	0.00	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5364	2019-03-17 16:59:17	2019-03-17 18:04:08	2	36.70	150.00	0.00	24.02	174.82	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1929	2019-03-13 22:35:35	2019-03-13 22:35:49	1	0.00	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens
3571	2019-03-22 16:47:41	2019-03-22 16:47:50	1	0.81	66.00	0.00	0.00	66.80	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1416	2019-03-09 13:16:32	2019-03-09 13:46:11	2	12.39	35.00	0.00	0.00	35.80	yellow	cash	LaGuardia Airport	JFK Airport	Queens	Queens
4358	2019-03-06 18:24:00	2019-03-06 18:24:13	2	0.01	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1089	2019-03-10 01:43:32	2019-03-10 01:45:22	1	0.37	3.50	0.00	0.00	4.80	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5095	2019-03-30 20:14:44	2019-03-30 21:01:28	3	18.91	52.00	8.78	5.76	67.34	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
770	2019-03-02 03:16:59	2019-03-02 03:17:06	0	9.40	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens

If we try to use this without the iloc, pandas will try to find columns with these names (not integer positions), and we get an error.

df[good_inds]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[31], line 1
----> 1 df[good_inds]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/frame.py:3813, in DataFrame.__getitem__(self, key)
   3811     if is_iterator(key):
   3812         key = list(key)
-> 3813     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3815 # take() does not accept boolean indexers
   3816 if getattr(indexer, "dtype", None) == bool:

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
   6067 else:
   6068     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
   6072 keyarr = self.take(indexer)
   6073 if isinstance(key, Index):
   6074     # GH 42790 - Preserve name from an Index

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name)
   6128     if use_interval_msg:
   6129         key = list(key)
-> 6130     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6133 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Int64Index([517, 642, 1117, 1239, 1461, 2032, 3576, 3971, 4970, 5417, 5773], dtype='int64')] are in the [columns]"

Worse approach 2: Using a for loop and `loc`#

Before using this approach, we should make sure there aren’t any repetitions in the index. In most natural examples, the elements in the index will be unique, but that’s not a requirement.

The attribute df.index holds these numbers listed on the left-hand side.

df.index[:5]

Int64Index([890, 5870, 686, 4937, 2659], dtype='int64')

Notice how those same numbers 890, 5870, etc appear on the left.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
890	2019-03-09 16:32:40	2019-03-09 16:49:32	2	2.70	13.0	0.00	16.30	yellow	cash	Midtown North	Kips Bay	Manhattan	Manhattan
5870	2019-03-26 11:04:06	2019-03-26 11:09:23	1	0.64	5.5	1.26	7.56	green	credit card	Morningside Heights	Manhattanville	Manhattan	Manhattan
686	2019-03-27 15:02:00	2019-03-27 15:31:19	1	3.79	20.0	0.00	23.30	yellow	cash	East Chelsea	Manhattan Valley	Manhattan	Manhattan
4937	2019-03-02 19:50:23	2019-03-02 20:04:19	2	1.60	10.0	2.75	16.55	yellow	credit card	TriBeCa/Civic Center	East Village	Manhattan	Manhattan
2659	2019-03-18 21:36:06	2019-03-18 21:48:31	1	1.10	8.0	0.00	11.80	yellow	cash	Midtown North	Midtown North	Manhattan	Manhattan

Here we make sure no two rows have the same name. (I can never remember that this is an attribute rather than a method, and I definitely don’t expect you to remember it.)

# checks if two rows have the same name
df.index.is_unique

True

We can now adapt the iloc code above. There are a few changes to make. We change the range(len(df)) to df.index. (This df.index is definitely more elegant.) We also change from iloc to loc, because we are now iterating through the row names. Lastly, because we are using loc, we need to change the integers 10 and 11 to the column names “pickup_zone” and “dropoff_zone”.

good_inds = []

for i in df.index:
    try:
        if ("Airport" in df.loc[i, "pickup_zone"]) and ("Airport" in df.loc[i, "dropoff_zone"]):
            good_inds.append(i)
    except TypeError:
        continue

good_inds

[2387, 1080, 120, 5364, 1929, 3571, 1416, 4358, 1089, 5095, 770]

We can now get the same sub-DataFrame we found above (and that we found on Wednesday) using df.loc and this new list. Notice how the integers in good_inds now do appear on the left side. (That wasn’t the case above with iloc.)

df.loc[good_inds]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
2387	2019-03-28 15:58:52	2019-03-28 15:59:25	1	1.80	69.06	20.80	0.00	90.16	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1080	2019-03-04 14:17:05	2019-03-04 14:17:13	1	0.00	2.50	0.00	0.00	3.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
120	2019-03-21 17:21:44	2019-03-21 17:21:49	1	0.00	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5364	2019-03-17 16:59:17	2019-03-17 18:04:08	2	36.70	150.00	0.00	24.02	174.82	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1929	2019-03-13 22:35:35	2019-03-13 22:35:49	1	0.00	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens
3571	2019-03-22 16:47:41	2019-03-22 16:47:50	1	0.81	66.00	0.00	0.00	66.80	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1416	2019-03-09 13:16:32	2019-03-09 13:46:11	2	12.39	35.00	0.00	0.00	35.80	yellow	cash	LaGuardia Airport	JFK Airport	Queens	Queens
4358	2019-03-06 18:24:00	2019-03-06 18:24:13	2	0.01	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1089	2019-03-10 01:43:32	2019-03-10 01:45:22	1	0.37	3.50	0.00	0.00	4.80	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5095	2019-03-30 20:14:44	2019-03-30 21:01:28	3	18.91	52.00	8.78	5.76	67.34	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
770	2019-03-02 03:16:59	2019-03-02 03:17:06	0	9.40	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens

Worse approach 3: Using the pandas Series `map` method#

We first define a small function to determine if the string "Airport" appears as a substring (in some input s). We will improve this function below.

def has_airport(s):
    if "Airport" in s:
        return True
    else:
        return False

Here is a quick demonstration of how the function works.

has_airport("Christopher")

False

It is checking to see if "Airport" occurs as a substring.

has_airport("abc Airport3")

True

As you get more experienced as a Python coder, you will be able to identify places to shorten definitions like for our has_airport function above. Notice how the function is returning True if "Airport" in s is True and is returning False otherwise, so we can simply return "Airport" in s directly.

def has_airport(s):
    return "Airport" in s

We’re going to apply that function to every entry in the following Series.

ser1 = df["pickup_zone"]
ser1

               Midtown North
        Morningside Heights
                East Chelsea
       TriBeCa/Civic Center
              Midtown North
                  ...            
      Upper East Side North
  West Chelsea/Hudson Yards
                   Gramercy
        Lincoln Square West
               East Village
Name: pickup_zone, Length: 6433, dtype: object

But unfortunately we get a similar error to before, due to missing values.

ser1.map(has_airport)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[43], line 1
----> 1 ser1.map(has_airport)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4467 
   (...)
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/base.py:890, in IndexOpsMixin._map_values(self, mapper, na_action)
    887         raise ValueError(msg)
    889 # mapper is a function
--> 890 new_values = map_f(values, mapper)
    892 return new_values

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

Cell In[41], line 2, in has_airport(s)
      1 def has_airport(s):
----> 2     return "Airport" in s

TypeError: argument of type 'float' is not iterable

Let me again try to convince you that the error is due to missing values.

has_airport(np.nan)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[44], line 1
----> 1 has_airport(np.nan)

Cell In[41], line 2, in has_airport(s)
      1 def has_airport(s):
----> 2     return "Airport" in s

TypeError: argument of type 'float' is not iterable

An oversight on my part is that I haven’t shown you the Python function help yet. Here is an example of using it to learn more about the pandas Series method map. Notice the na_action keyword argument, which is exactly what we need.

help(ser1.map)

Help on method map in module pandas.core.series:

map(arg: 'Callable | Mapping | Series', na_action: "Literal['ignore'] | None" = None) -> 'Series' method of pandas.core.series.Series instance
    Map values of Series according to an input mapping or function.
    
    Used for substituting each value in a Series with another value,
    that may be derived from a function, a ``dict`` or
    a :class:`Series`.
    
    Parameters
    ----------
    arg : function, collections.abc.Mapping subclass or Series
        Mapping correspondence.
    na_action : {None, 'ignore'}, default None
        If 'ignore', propagate NaN values, without passing them to the
        mapping correspondence.
    
    Returns
    -------
    Series
        Same index as caller.
    
    See Also
    --------
    Series.apply : For applying more complex functions on a Series.
    DataFrame.apply : Apply a function row-/column-wise.
    DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
    
    Notes
    -----
    When ``arg`` is a dictionary, values in Series that are not in the
    dictionary (as keys) are converted to ``NaN``. However, if the
    dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
    provides a method for default values), then this default is used
    rather than ``NaN``.
    
    Examples
    --------
    >>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
    >>> s
    0      cat
    1      dog
    2      NaN
    3   rabbit
    dtype: object
    
    ``map`` accepts a ``dict`` or a ``Series``. Values that are not found
    in the ``dict`` are converted to ``NaN``, unless the dict has a default
    value (e.g. ``defaultdict``):
    
    >>> s.map({'cat': 'kitten', 'dog': 'puppy'})
    0   kitten
    1    puppy
    2      NaN
    3      NaN
    dtype: object
    
    It also accepts a function:
    
    >>> s.map('I am a {}'.format)
    0       I am a cat
    1       I am a dog
    2       I am a nan
    3    I am a rabbit
    dtype: object
    
    To avoid applying the function to missing values (and keep them as
    ``NaN``) ``na_action='ignore'`` can be used:
    
    >>> s.map('I am a {}'.format, na_action='ignore')
    0     I am a cat
    1     I am a dog
    2            NaN
    3  I am a rabbit
    dtype: object

For now we only see False values, because none of the visible entries contain "Airport" as a substring.

ser1.map(has_airport, na_action="ignore")

   False
  False
   False
  False
  False
        ...  
  False
  False
  False
  False
  False
Name: pickup_zone, Length: 6433, dtype: object

ser2 = df["dropoff_zone"]

The following should be the same sub-DataFrame we attained two different ways above.

df[ser1.map(has_airport, na_action="ignore") & ser2.map(has_airport, na_action="ignore")]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
2387	2019-03-28 15:58:52	2019-03-28 15:59:25	1	1.80	69.06	20.80	0.00	90.16	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1080	2019-03-04 14:17:05	2019-03-04 14:17:13	1	0.00	2.50	0.00	0.00	3.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
120	2019-03-21 17:21:44	2019-03-21 17:21:49	1	0.00	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5364	2019-03-17 16:59:17	2019-03-17 18:04:08	2	36.70	150.00	0.00	24.02	174.82	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1929	2019-03-13 22:35:35	2019-03-13 22:35:49	1	0.00	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens
3571	2019-03-22 16:47:41	2019-03-22 16:47:50	1	0.81	66.00	0.00	0.00	66.80	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1416	2019-03-09 13:16:32	2019-03-09 13:46:11	2	12.39	35.00	0.00	0.00	35.80	yellow	cash	LaGuardia Airport	JFK Airport	Queens	Queens
4358	2019-03-06 18:24:00	2019-03-06 18:24:13	2	0.01	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1089	2019-03-10 01:43:32	2019-03-10 01:45:22	1	0.37	3.50	0.00	0.00	4.80	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5095	2019-03-30 20:14:44	2019-03-30 21:01:28	3	18.91	52.00	8.78	5.76	67.34	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
770	2019-03-02 03:16:59	2019-03-02 03:17:06	0	9.40	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens

Best approach I know: Using the `str` accessor and Boolean indexing#

This is review of what we did on Wednesday. Here’s a reminder that ser1 is the variable containing the “pickup_zone” column.

ser1

               Midtown North
        Morningside Heights
                East Chelsea
       TriBeCa/Civic Center
              Midtown North
                  ...            
      Upper East Side North
  West Chelsea/Hudson Yards
                   Gramercy
        Lincoln Square West
               East Village
Name: pickup_zone, Length: 6433, dtype: object

Instead of using map and our custom has_airport function, we use the str accessor, which gives us access to the contains method. The big advantage here is that we didn’t have to define any special function ourselves.

ser1.str.contains("Airport")

   False
  False
   False
  False
  False
        ...  
  False
  False
  False
  False
  False
Name: pickup_zone, Length: 6433, dtype: object

Here is the final way (and the best way I know) to get the sub-DataFrame containing the rows for which both the pickup and dropoff zones involved an airport. We are using Boolean indexing to keep precisely those rows in df. (The previous approach, with map, also used Boolean indexing. The earlier approaches, with loc and iloc, used indexing but not Boolean indexing.)

df[ser1.str.contains("Airport") & ser2.str.contains("Airport")]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
2387	2019-03-28 15:58:52	2019-03-28 15:59:25	1	1.80	69.06	20.80	0.00	90.16	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1080	2019-03-04 14:17:05	2019-03-04 14:17:13	1	0.00	2.50	0.00	0.00	3.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
120	2019-03-21 17:21:44	2019-03-21 17:21:49	1	0.00	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5364	2019-03-17 16:59:17	2019-03-17 18:04:08	2	36.70	150.00	0.00	24.02	174.82	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1929	2019-03-13 22:35:35	2019-03-13 22:35:49	1	0.00	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens
3571	2019-03-22 16:47:41	2019-03-22 16:47:50	1	0.81	66.00	0.00	0.00	66.80	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
1416	2019-03-09 13:16:32	2019-03-09 13:46:11	2	12.39	35.00	0.00	0.00	35.80	yellow	cash	LaGuardia Airport	JFK Airport	Queens	Queens
4358	2019-03-06 18:24:00	2019-03-06 18:24:13	2	0.01	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
1089	2019-03-10 01:43:32	2019-03-10 01:45:22	1	0.37	3.50	0.00	0.00	4.80	yellow	cash	JFK Airport	JFK Airport	Queens	Queens
5095	2019-03-30 20:14:44	2019-03-30 21:01:28	3	18.91	52.00	8.78	5.76	67.34	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens
770	2019-03-02 03:16:59	2019-03-02 03:17:06	0	9.40	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens

Week 2 Friday

Contents

Week 2 Friday#

Announcements#

(Worse) alternatives to the material from Wednesday#

Warm-up: What are the integer locations of the “pickup_zone” and “dropoff_zone” columns?#

Worse approach 1: Using a for loop and iloc#

Worse approach 2: Using a for loop and loc#

Worse approach 3: Using the pandas Series map method#

Best approach I know: Using the str accessor and Boolean indexing#

Worse approach 1: Using a for loop and `iloc`#

Worse approach 2: Using a for loop and `loc`#

Worse approach 3: Using the pandas Series `map` method#

Best approach I know: Using the `str` accessor and Boolean indexing#