Week 2 Wednesday

Week 2 Wednesday#

Announcements#

Trying something new: I’m in Deepnote, but in a “Private” project. Will upload this afterwards. You can find the blank version in the Week 2 folder.
Worksheet 4 handed out today. Both Worksheets 3 and 4 due Monday.
Hanson (one of our three Learning Assistants) is here to help.

Exploring the taxis dataset#

I refer to the first half of Math 10 as the “Exploratory Data Analysis” half (EDA). In the first part today, I want to go through some basic tools in pandas for getting a quick overview of the contents of a pandas DataFrame.

Comment: this taxis dataset is included with the Python library Seaborn. Usually that’s where I get it from. In this case, I’m getting it from a csv file instead, because Seaborn automatically converts some dates for us, but I want us to see how to make that conversion explicitly.

Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd

df = pd.read_csv("../data/taxis.csv")

Get a sense for the contents of df using the following.

The shape attribute.
The head method, to view the first few rows.
The sample method, to view a random collection of rows.
The info method.
The describe method.

The shape attribute reports the number of rows (6433 in this case) and the number of columns (14).

df.shape

(6433, 14)

Here we actually look at the contents of df, namely, we look at its first 3 rows.

df.head(3)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan

The sample method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.

df.sample(3)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
794	2019-03-06 10:01:53	2019-03-06 10:27:29	2	2.15	16.0	3.86	23.16	yellow	credit card	Upper East Side South	Clinton East	Manhattan	Manhattan
6352	2019-03-28 12:46:54	2019-03-28 12:51:24	1	0.55	5.0	0.58	6.38	green	credit card	Morningside Heights	Morningside Heights	Manhattan	Manhattan
3412	2019-03-07 09:37:32	2019-03-07 10:03:00	1	1.61	15.5	3.76	22.56	yellow	credit card	Upper East Side South	Midtown South	Manhattan	Manhattan

The info method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).

A huge issue in data science is missing data (and how to deal with it). This info method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44 values are missing from the “payment” column.)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pickup           6433 non-null   object 
 1   dropoff          6433 non-null   object 
 2   passengers       6433 non-null   int64  
 3   distance         6433 non-null   float64
 4   fare             6433 non-null   float64
 5   tip              6433 non-null   float64
 6   tolls            6433 non-null   float64
 7   total            6433 non-null   float64
 8   color            6433 non-null   object 
 9   payment          6389 non-null   object 
 10  pickup_zone      6407 non-null   object 
 11  dropoff_zone     6388 non-null   object 
 12  pickup_borough   6407 non-null   object 
 13  dropoff_borough  6388 non-null   object 
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB

The describe method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5 (the 50-th percentile), whereas the mean fare is 13.09.

df.describe()

	passengers	distance	fare	tip	tolls	total
count	6433.000000	6433.000000	6433.000000	6433.00000	6433.000000	6433.000000
mean	1.539251	3.024617	13.091073	1.97922	0.325273	18.517794
std	1.203768	3.827867	11.551804	2.44856	1.415267	13.815570
min	0.000000	0.000000	1.000000	0.00000	0.000000	1.300000
25%	1.000000	0.980000	6.500000	0.00000	0.000000	10.800000
50%	1.000000	1.640000	9.500000	1.70000	0.000000	14.160000
75%	2.000000	3.210000	15.000000	2.80000	0.000000	20.300000
max	6.000000	36.700000	150.000000	33.20000	24.020000	174.820000

How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then

use the unique method;
use the value_counts method.

We first define a pandas Series ser that will be equal to the “pickup_borough” column.

ser = df["pickup_borough"]
ser

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Notice that this is indeed a pandas Series.

type(ser)

pandas.core.series.Series

We now call the unique method. This gets us the distinct values which occur in the Series (including the missing value, represented as nan, which stands for “not a number”).

ser.unique()

array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)

The output is a NumPy array. (NumPy is a dependency of pandas, so anyone who has pandas installed will also have NumPy installed.)

type(ser.unique())

numpy.ndarray

If we want a programmatic way of counting how many unique values there are, we can use the Python function len. (If we wanted to not count the missing value, we’d have to work harder.)

len(ser.unique())

Notice that unique is a Series method, not a DataFrame method.

# don't know if this works
df.unique()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[13], line 2
      1 # don't know if this works
----> 2 df.unique()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'unique'

Aside: there was a question about how we could get this column using loc or iloc. The first thing you might try, df.loc["pickup_borough"] does not work, because this is looking for a row named "pickup_borough".

df.loc["pickup_borough"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[14], line 1
----> 1 df.loc["pickup_borough"]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
   1070 axis = self.axis or 0
   1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis)
   1310 # fall thru to straight lookup
   1311 self._validate_key(key, axis)
-> 1312 return self._get_label(key, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis)
   1258 def _get_label(self, label, axis: int):
   1259     # GH#5567 this will fail if the label is not present in the axis.
-> 1260     return self.obj.xs(label, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:4056, in NDFrame.xs(self, key, axis, level, drop_level)
   4054             new_index = index[loc]
   4055 else:
-> 4056     loc = index.get_loc(key)
   4058     if isinstance(loc, np.ndarray):
   4059         if loc.dtype == np.bool_:

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/range.py:395, in RangeIndex.get_loc(self, key, method, tolerance)
    393             raise KeyError(key) from err
    394     self._check_indexing_error(key)
--> 395     raise KeyError(key)
    396 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'pickup_borough'

Here is the right way using loc. It is definitely more convenient to use df["pickup_borough"] rather than the following.

df.loc[:, "pickup_borough"]

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

If we want to use iloc, we need to find the integer position of the “pickup_borough” column. We can use the columns attribute to see all the columns.

df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Here I just counted (remember to start counting at 0). There are of course more sophisticated ways also; ask if you’re interested! The following is the iloc way to get the “pickup_borough” column.

df.iloc[:, 12]

     Manhattan
     Manhattan
     Manhattan
     Manhattan
     Manhattan
          ...    
  Manhattan
     Queens
   Brooklyn
   Brooklyn
   Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Indexing in pandas is confusing and you should not expect to be comfortable with it until you have practiced for at least a few weeks. If we use square brackets and what’s called a “slice” (using colon :), and we don’t use loc or iloc, then this accesses rows.

# First three rows
df[:3]

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan

If we use just an integer (not a slice), then it is looking for columns. I found this pretty confusing at first, but eventually got used to it.

# looking for a column named 3
df[3]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 3

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[20], line 2
      1 # looking for a column named 3
----> 2 df[3]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 3

Working with dates in pandas#

Here is a reminder of how df looks.

df.sample(3)

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
4997	2019-03-16 11:04:35	2019-03-16 11:40:25	5	6.31	26.0	3.00	32.30	yellow	credit card	TriBeCa/Civic Center	Upper East Side North	Manhattan	Manhattan
4905	2019-03-12 16:23:50	2019-03-12 16:27:07	5	0.56	4.5	0.00	8.80	yellow	cash	Clinton East	Clinton East	Manhattan	Manhattan
536	2019-03-22 15:59:29	2019-03-22 16:10:08	1	0.52	8.0	2.26	13.56	yellow	credit card	Clinton East	Times Sq/Theatre District	Manhattan	Manhattan

Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function to_datetime. Save the results as new columns in df, named “picktime” and “droptime”.

Here is the “pickup” column. We can recognize these as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)

df["pickup"]

     2019-03-23 20:21:09
     2019-03-04 16:11:55
     2019-03-27 17:53:01
     2019-03-10 01:23:59
     2019-03-30 13:27:42
               ...         
  2019-03-31 09:51:53
  2019-03-31 17:38:00
  2019-03-23 22:55:18
  2019-03-04 10:09:25
  2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: object

Notice how the data type changes to “datetime64[ns]” in the following output (it has no affect on the original column).

pd.to_datetime(df["pickup"])

    2019-03-23 20:21:09
    2019-03-04 16:11:55
    2019-03-27 17:53:01
    2019-03-10 01:23:59
    2019-03-30 13:27:42
               ...        
 2019-03-31 09:51:53
 2019-03-31 17:38:00
 2019-03-23 22:55:18
 2019-03-04 10:09:25
 2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: datetime64[ns]

Let’s see in a smaller example how the pd.to_datetime function works. We start out by defining a string representing today’s date.

today = "April 12, 2023"

This is just a string.

type(today)

str

Here we convert it to a Timestamp using pd.to_datetime.

ts = pd.to_datetime(today)
ts

Timestamp('2023-04-12 00:00:00')

We now have access to all sorts of additional information related to this date.

I can never remember which of these are methods and which are attributes. Here I tried calling day_of_year as a method, but that was incorrect.

ts.day_of_year()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 ts.day_of_year()

TypeError: 'int' object is not callable

The attribute usage is correct. This is telling us that April 12th is the 102nd day of 2023. (I’m not sure if this starts counting at 0 or not, I would guess that it does.)

ts.day_of_year

We can also get the name of the day. (Notice that this one is indeed a method. To my eyes, it’s a little unpredictable which are methods. Whether day_name or day_of_year is a method is definitely not something you would be tested on.)

ts.day_name()

'Wednesday'

If you want to see all the various attributes and methods, you can evaluate dir. Probably you should ignore the ones which begin and end with two underscores, which are called “dunder methods” (for “double underscore”), and are typically not used by us directly.

dir(ts)

['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_as_unit',
 '_date_repr',
 '_freq',
 '_freqstr',
 '_from_dt64',
 '_from_value_and_reso',
 '_repr_base',
 '_reso',
 '_round',
 '_set_freq',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'now',
 'quarter',
 'replace',
 'resolution',
 'round',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'to_datetime64',
 'to_julian_date',
 'to_numpy',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']

Here we are adding two new columns to the DataFrame. We name them “picktime” and “droptime”.

# syntax for a new column in df
df["picktime"] = pd.to_datetime(df["pickup"])
df["droptime"] = pd.to_datetime(df["dropoff"])

Notice how these new columns have shown up all the way on the right side.

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	picktime	droptime
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	2019-03-23 20:21:09	2019-03-23 20:27:24
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	2019-03-04 16:11:55	2019-03-04 16:19:00
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	2019-03-27 17:53:01	2019-03-27 18:00:25
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	2019-03-10 01:23:59	2019-03-10 01:49:51
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	2019-03-30 13:27:42	2019-03-30 13:37:14

Make a new column, “day”, containing the day of the week of the pickup. Use the accessor dt and then the method day_name.

If we try to call day_name on the Series directly, even though it contains datetime values, we get an error. We need to use the dt accessor to apply these kind of methods to the entries in the Series.

It might be hard for you as students to know what you should remember for a quiz and what is less important. I would like you to remember that the dt accessor (and the str accessor below) are required.

df["picktime"].day_name()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[34], line 1
----> 1 df["picktime"].day_name()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'day_name'

Here we successfully use the dt accessor.

df["day"] = df["picktime"].dt.day_name()
df["day"]

      Saturday
        Monday
     Wednesday
        Sunday
      Saturday
          ...    
     Sunday
     Sunday
   Saturday
     Monday
  Wednesday
Name: day, Length: 6433, dtype: object

Make a new column “duration” containing the amount of time the ride took. (Now that pandas recognizes these as dates, you can just subtract.)

Let’s remind ourselves what columns are available. The ones we want to use are near the end.

df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough', 'picktime', 'droptime', 'day'],
      dtype='object')

This sort of subtraction of times would never work if we hadn’t converted them to datetime types using pd.to_datetime.

df["duration"] = df["droptime"] - df["picktime"]
df["duration"]

    0 days 00:06:15
    0 days 00:07:05
    0 days 00:07:24
    0 days 00:25:52
    0 days 00:09:32
             ...      
 0 days 00:03:34
 0 days 00:56:23
 0 days 00:19:07
 0 days 00:05:04
 0 days 00:16:40
Name: duration, Length: 6433, dtype: timedelta64[ns]

Here we check out the first three rows and first two columns. Does the top row look like a 6 minute taxi ride? Yes, because it went from 8:21pm to 8:27pm.

df.iloc[:3, :2]

	pickup	dropoff
0	2019-03-23 20:21:09	2019-03-23 20:27:24
1	2019-03-04 16:11:55	2019-03-04 16:19:00
2	2019-03-27 17:53:01	2019-03-27 18:00:25

What are the data types of these four columns we just made?

Notice how the first two we made are called “datetime64[ns]”, whereas the last one is called a “timedelta64[ns]”. When you see “delta”, you should usually think “difference”, and that’s exactly what this column represents. The middle column, “day”, is reported as object, because they are just strings like "Monday".

df.dtypes

pickup                      object
dropoff                     object
passengers                   int64
distance                   float64
fare                       float64
tip                        float64
tolls                      float64
total                      float64
color                       object
payment                     object
pickup_zone                 object
dropoff_zone                object
pickup_borough              object
dropoff_borough             object
picktime            datetime64[ns]
droptime            datetime64[ns]
day                         object
duration           timedelta64[ns]
dtype: object

Working with strings in pandas#

How many rows in df have the word “Airport” in either the “pickup_zone” column or the “dropoff_zone” column? Do any rows have the word “Airport” in both columns?

Use the str accessor and the contains method.

Here we check out the first three entries in the “pickup_zone” column.

df["pickup_zone"][:3]

        Lenox Hill West
  Upper West Side South
          Alphabet City
Name: pickup_zone, dtype: object

Here is an example of how str.contains works. Only the middle of these three contains the substring "South", so we get False then True then False.

df["pickup_zone"][:3].str.contains("South")

  False
   True
  False
Name: pickup_zone, dtype: bool

We now do the same thing for the word “Airport”. We save these Boolean Series.

ps = df["pickup_zone"].str.contains("Airport")
ds = df["dropoff_zone"].str.contains("Airport")

Here are the first three values. None of these first three contains the substring "Airport".

ps[:3]

  False
  False
  False
Name: pickup_zone, dtype: object

Here we use Boolean indexing to find all the rows where either the “pickup_zone” or the “dropoff_zone” column contains the substring “Airport”. Notice how the logical or is made using a vertical bar. (Don’t spell it out in this context. Spelled out or is used in some contexts in Python, but not with NumPy Boolean arrays or pandas Boolean Series.)

Scorll to the right and check that each of these rows really does involve an “Airport”.

df[ps | ds]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	picktime	droptime	day	duration
10	2019-03-16 10:02:25	2019-03-16 10:22:29	1	3.90	17.00	0.00	0.00	17.80	yellow	cash	LaGuardia Airport	Astoria	Queens	Queens	2019-03-16 10:02:25	2019-03-16 10:22:29	Saturday	0 days 00:20:04
22	2019-03-28 17:20:54	2019-03-28 17:55:57	1	9.82	31.50	8.31	5.76	49.87	yellow	credit card	LaGuardia Airport	Sutton Place/Turtle Bay North	Queens	Manhattan	2019-03-28 17:20:54	2019-03-28 17:55:57	Thursday	0 days 00:35:03
43	2019-03-27 19:39:16	2019-03-27 20:00:26	1	9.50	28.00	0.00	5.76	35.06	yellow	cash	LaGuardia Airport	Washington Heights South	Queens	Manhattan	2019-03-27 19:39:16	2019-03-27 20:00:26	Wednesday	0 days 00:21:10
53	2019-03-20 18:49:24	2019-03-20 19:12:40	1	13.45	36.00	0.00	0.00	37.80	yellow	cash	JFK Airport	Sunnyside	Queens	Queens	2019-03-20 18:49:24	2019-03-20 19:12:40	Wednesday	0 days 00:23:16
56	2019-03-05 17:57:00	2019-03-05 18:45:25	2	19.59	52.00	0.00	5.76	65.56	yellow	credit card	JFK Airport	Upper East Side North	Queens	Manhattan	2019-03-05 17:57:00	2019-03-05 18:45:25	Tuesday	0 days 00:48:25
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6243	2019-03-12 11:40:49	2019-03-12 11:49:23	1	3.50	12.00	0.00	0.00	12.80	green	cash	Steinway	LaGuardia Airport	Queens	Queens	2019-03-12 11:40:49	2019-03-12 11:49:23	Tuesday	0 days 00:08:34
6280	2019-03-27 12:25:04	2019-03-27 12:31:15	1	2.74	10.00	0.00	0.00	10.80	green	credit card	Steinway	LaGuardia Airport	Queens	Queens	2019-03-27 12:25:04	2019-03-27 12:31:15	Wednesday	0 days 00:06:11
6322	2019-03-01 17:03:11	2019-03-01 17:17:11	1	5.39	17.00	3.76	0.00	22.56	green	credit card	Jamaica	JFK Airport	Queens	Queens	2019-03-01 17:03:11	2019-03-01 17:17:11	Friday	0 days 00:14:00
6396	2019-03-15 17:44:09	2019-03-15 18:20:25	1	11.44	41.63	0.00	0.00	42.13	green	credit card	Bushwick South	LaGuardia Airport	Brooklyn	Queens	2019-03-15 17:44:09	2019-03-15 18:20:25	Friday	0 days 00:36:16
6398	2019-03-17 13:07:29	2019-03-17 13:51:56	1	18.39	50.50	7.70	0.00	59.00	green	credit card	DUMBO/Vinegar Hill	JFK Airport	Brooklyn	Queens	2019-03-17 13:07:29	2019-03-17 13:51:56	Sunday	0 days 00:44:27

407 rows × 18 columns

Here is similar, but we use & instead of |. This corresponds to the rows where both the pickup zone and the dropoff zone involved an airport. Many of these rides were very short (less than a minute).

df[ps & ds]

	pickup	dropoff	passengers	distance	fare	tip	tolls	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	picktime	droptime	day	duration
120	2019-03-21 17:21:44	2019-03-21 17:21:49	1	0.00	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens	2019-03-21 17:21:44	2019-03-21 17:21:49	Thursday	0 days 00:00:05
770	2019-03-02 03:16:59	2019-03-02 03:17:06	0	9.40	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens	2019-03-02 03:16:59	2019-03-02 03:17:06	Saturday	0 days 00:00:07
1080	2019-03-04 14:17:05	2019-03-04 14:17:13	1	0.00	2.50	0.00	0.00	3.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens	2019-03-04 14:17:05	2019-03-04 14:17:13	Monday	0 days 00:00:08
1089	2019-03-10 01:43:32	2019-03-10 01:45:22	1	0.37	3.50	0.00	0.00	4.80	yellow	cash	JFK Airport	JFK Airport	Queens	Queens	2019-03-10 01:43:32	2019-03-10 01:45:22	Sunday	0 days 00:01:50
1416	2019-03-09 13:16:32	2019-03-09 13:46:11	2	12.39	35.00	0.00	0.00	35.80	yellow	cash	LaGuardia Airport	JFK Airport	Queens	Queens	2019-03-09 13:16:32	2019-03-09 13:46:11	Saturday	0 days 00:29:39
1929	2019-03-13 22:35:35	2019-03-13 22:35:49	1	0.00	2.50	0.00	0.00	3.80	yellow	NaN	JFK Airport	JFK Airport	Queens	Queens	2019-03-13 22:35:35	2019-03-13 22:35:49	Wednesday	0 days 00:00:14
2387	2019-03-28 15:58:52	2019-03-28 15:59:25	1	1.80	69.06	20.80	0.00	90.16	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens	2019-03-28 15:58:52	2019-03-28 15:59:25	Thursday	0 days 00:00:33
3571	2019-03-22 16:47:41	2019-03-22 16:47:50	1	0.81	66.00	0.00	0.00	66.80	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens	2019-03-22 16:47:41	2019-03-22 16:47:50	Friday	0 days 00:00:09
4358	2019-03-06 18:24:00	2019-03-06 18:24:13	2	0.01	2.50	0.00	0.00	4.30	yellow	cash	JFK Airport	JFK Airport	Queens	Queens	2019-03-06 18:24:00	2019-03-06 18:24:13	Wednesday	0 days 00:00:13
5095	2019-03-30 20:14:44	2019-03-30 21:01:28	3	18.91	52.00	8.78	5.76	67.34	yellow	credit card	JFK Airport	JFK Airport	Queens	Queens	2019-03-30 20:14:44	2019-03-30 21:01:28	Saturday	0 days 00:46:44
5364	2019-03-17 16:59:17	2019-03-17 18:04:08	2	36.70	150.00	0.00	24.02	174.82	yellow	cash	JFK Airport	JFK Airport	Queens	Queens	2019-03-17 16:59:17	2019-03-17 18:04:08	Sunday	0 days 01:04:51

We didn’t get any further.

Alternative, less elegant, approach. Find the rows where “Airport” is in the “pickup_zone” column by writing an appropriate function and using the map method of a pandas Series. Use the keyword argument na_action="ignore" so that missing values don’t cause errors.

Missing values in pandas#

Look at df.info() again. Which columns have missing values?

Define df_sub to be the sub-DataFrame containing all the rows for which the “pickup_zone” entry is missing. Use the isna method.

What if you tried to check for equality with np.nan?

Time to work on Worksheets 3-4#

Hanson and I are here to help.
I have another class at 2pm, so if you have questions, ask me now rather than after class!