Week 2 Wednesday#

Announcements#

  • Trying something new: I’m in Deepnote, but in a “Private” project. Will upload this afterwards. You can find the blank version in the Week 2 folder.

  • Worksheet 4 handed out today. Both Worksheets 3 and 4 due Monday.

  • Hanson (one of our three Learning Assistants) is here to help.

Exploring the taxis dataset#

I refer to the first half of Math 10 as the “Exploratory Data Analysis” half (EDA). In the first part today, I want to go through some basic tools in pandas for getting a quick overview of the contents of a pandas DataFrame.

Comment: this taxis dataset is included with the Python library Seaborn. Usually that’s where I get it from. In this case, I’m getting it from a csv file instead, because Seaborn automatically converts some dates for us, but I want us to see how to make that conversion explicitly.

  • Load the attached file taxis.csv using the pandas function read_csv. Store the resulting DataFrame in the variable df.

import pandas as pd
df = pd.read_csv("../data/taxis.csv")

Get a sense for the contents of df using the following.

  • The shape attribute.

  • The head method, to view the first few rows.

  • The sample method, to view a random collection of rows.

  • The info method.

  • The describe method.

The shape attribute reports the number of rows (6433 in this case) and the number of columns (14).

df.shape
(6433, 14)

Here we actually look at the contents of df, namely, we look at its first 3 rows.

df.head(3)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan

The sample method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.

df.sample(3)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
794 2019-03-06 10:01:53 2019-03-06 10:27:29 2 2.15 16.0 3.86 0.0 23.16 yellow credit card Upper East Side South Clinton East Manhattan Manhattan
6352 2019-03-28 12:46:54 2019-03-28 12:51:24 1 0.55 5.0 0.58 0.0 6.38 green credit card Morningside Heights Morningside Heights Manhattan Manhattan
3412 2019-03-07 09:37:32 2019-03-07 10:03:00 1 1.61 15.5 3.76 0.0 22.56 yellow credit card Upper East Side South Midtown South Manhattan Manhattan

The info method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).

A huge issue in data science is missing data (and how to deal with it). This info method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44 values are missing from the “payment” column.)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pickup           6433 non-null   object 
 1   dropoff          6433 non-null   object 
 2   passengers       6433 non-null   int64  
 3   distance         6433 non-null   float64
 4   fare             6433 non-null   float64
 5   tip              6433 non-null   float64
 6   tolls            6433 non-null   float64
 7   total            6433 non-null   float64
 8   color            6433 non-null   object 
 9   payment          6389 non-null   object 
 10  pickup_zone      6407 non-null   object 
 11  dropoff_zone     6388 non-null   object 
 12  pickup_borough   6407 non-null   object 
 13  dropoff_borough  6388 non-null   object 
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB

The describe method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5 (the 50-th percentile), whereas the mean fare is 13.09.

df.describe()
passengers distance fare tip tolls total
count 6433.000000 6433.000000 6433.000000 6433.00000 6433.000000 6433.000000
mean 1.539251 3.024617 13.091073 1.97922 0.325273 18.517794
std 1.203768 3.827867 11.551804 2.44856 1.415267 13.815570
min 0.000000 0.000000 1.000000 0.00000 0.000000 1.300000
25% 1.000000 0.980000 6.500000 0.00000 0.000000 10.800000
50% 1.000000 1.640000 9.500000 1.70000 0.000000 14.160000
75% 2.000000 3.210000 15.000000 2.80000 0.000000 20.300000
max 6.000000 36.700000 150.000000 33.20000 24.020000 174.820000

How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then

  • use the unique method;

  • use the value_counts method.

We first define a pandas Series ser that will be equal to the “pickup_borough” column.

ser = df["pickup_borough"]
ser
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Notice that this is indeed a pandas Series.

type(ser)
pandas.core.series.Series

We now call the unique method. This gets us the distinct values which occur in the Series (including the missing value, represented as nan, which stands for “not a number”).

ser.unique()
array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)

The output is a NumPy array. (NumPy is a dependency of pandas, so anyone who has pandas installed will also have NumPy installed.)

type(ser.unique())
numpy.ndarray

If we want a programmatic way of counting how many unique values there are, we can use the Python function len. (If we wanted to not count the missing value, we’d have to work harder.)

len(ser.unique())
5

Notice that unique is a Series method, not a DataFrame method.

# don't know if this works
df.unique()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[13], line 2
      1 # don't know if this works
----> 2 df.unique()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'unique'

Aside: there was a question about how we could get this column using loc or iloc. The first thing you might try, df.loc["pickup_borough"] does not work, because this is looking for a row named "pickup_borough".

df.loc["pickup_borough"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[14], line 1
----> 1 df.loc["pickup_borough"]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
   1070 axis = self.axis or 0
   1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis)
   1310 # fall thru to straight lookup
   1311 self._validate_key(key, axis)
-> 1312 return self._get_label(key, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis)
   1258 def _get_label(self, label, axis: int):
   1259     # GH#5567 this will fail if the label is not present in the axis.
-> 1260     return self.obj.xs(label, axis=axis)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:4056, in NDFrame.xs(self, key, axis, level, drop_level)
   4054             new_index = index[loc]
   4055 else:
-> 4056     loc = index.get_loc(key)
   4058     if isinstance(loc, np.ndarray):
   4059         if loc.dtype == np.bool_:

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/range.py:395, in RangeIndex.get_loc(self, key, method, tolerance)
    393             raise KeyError(key) from err
    394     self._check_indexing_error(key)
--> 395     raise KeyError(key)
    396 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'pickup_borough'

Here is the right way using loc. It is definitely more convenient to use df["pickup_borough"] rather than the following.

df.loc[:, "pickup_borough"]
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

If we want to use iloc, we need to find the integer position of the “pickup_borough” column. We can use the columns attribute to see all the columns.

df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Here I just counted (remember to start counting at 0). There are of course more sophisticated ways also; ask if you’re interested! The following is the iloc way to get the “pickup_borough” column.

df.iloc[:, 12]
0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Indexing in pandas is confusing and you should not expect to be comfortable with it until you have practiced for at least a few weeks. If we use square brackets and what’s called a “slice” (using colon :), and we don’t use loc or iloc, then this accesses rows.

# First three rows
df[:3]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan

If we use just an integer (not a slice), then it is looking for columns. I found this pretty confusing at first, but eventually got used to it.

# looking for a column named 3
df[3]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 3

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[20], line 2
      1 # looking for a column named 3
----> 2 df[3]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 3

Working with dates in pandas#

Here is a reminder of how df looks.

df.sample(3)
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
4997 2019-03-16 11:04:35 2019-03-16 11:40:25 5 6.31 26.0 3.00 0.0 32.30 yellow credit card TriBeCa/Civic Center Upper East Side North Manhattan Manhattan
4905 2019-03-12 16:23:50 2019-03-12 16:27:07 5 0.56 4.5 0.00 0.0 8.80 yellow cash Clinton East Clinton East Manhattan Manhattan
536 2019-03-22 15:59:29 2019-03-22 16:10:08 1 0.52 8.0 2.26 0.0 13.56 yellow credit card Clinton East Times Sq/Theatre District Manhattan Manhattan
  • Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function to_datetime. Save the results as new columns in df, named “picktime” and “droptime”.

Here is the “pickup” column. We can recognize these as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)

df["pickup"]
0       2019-03-23 20:21:09
1       2019-03-04 16:11:55
2       2019-03-27 17:53:01
3       2019-03-10 01:23:59
4       2019-03-30 13:27:42
               ...         
6428    2019-03-31 09:51:53
6429    2019-03-31 17:38:00
6430    2019-03-23 22:55:18
6431    2019-03-04 10:09:25
6432    2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: object

Notice how the data type changes to “datetime64[ns]” in the following output (it has no affect on the original column).

pd.to_datetime(df["pickup"])
0      2019-03-23 20:21:09
1      2019-03-04 16:11:55
2      2019-03-27 17:53:01
3      2019-03-10 01:23:59
4      2019-03-30 13:27:42
               ...        
6428   2019-03-31 09:51:53
6429   2019-03-31 17:38:00
6430   2019-03-23 22:55:18
6431   2019-03-04 10:09:25
6432   2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: datetime64[ns]

Let’s see in a smaller example how the pd.to_datetime function works. We start out by defining a string representing today’s date.

today = "April 12, 2023"

This is just a string.

type(today)
str

Here we convert it to a Timestamp using pd.to_datetime.

ts = pd.to_datetime(today)
ts
Timestamp('2023-04-12 00:00:00')

We now have access to all sorts of additional information related to this date.

I can never remember which of these are methods and which are attributes. Here I tried calling day_of_year as a method, but that was incorrect.

ts.day_of_year()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 ts.day_of_year()

TypeError: 'int' object is not callable

The attribute usage is correct. This is telling us that April 12th is the 102nd day of 2023. (I’m not sure if this starts counting at 0 or not, I would guess that it does.)

ts.day_of_year
102

We can also get the name of the day. (Notice that this one is indeed a method. To my eyes, it’s a little unpredictable which are methods. Whether day_name or day_of_year is a method is definitely not something you would be tested on.)

ts.day_name()
'Wednesday'

If you want to see all the various attributes and methods, you can evaluate dir. Probably you should ignore the ones which begin and end with two underscores, which are called “dunder methods” (for “double underscore”), and are typically not used by us directly.

dir(ts)
['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_as_unit',
 '_date_repr',
 '_freq',
 '_freqstr',
 '_from_dt64',
 '_from_value_and_reso',
 '_repr_base',
 '_reso',
 '_round',
 '_set_freq',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'now',
 'quarter',
 'replace',
 'resolution',
 'round',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'to_datetime64',
 'to_julian_date',
 'to_numpy',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']

Here we are adding two new columns to the DataFrame. We name them “picktime” and “droptime”.

# syntax for a new column in df
df["picktime"] = pd.to_datetime(df["pickup"])
df["droptime"] = pd.to_datetime(df["dropoff"])

Notice how these new columns have shown up all the way on the right side.

df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough picktime droptime
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan 2019-03-23 20:21:09 2019-03-23 20:27:24
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan 2019-03-04 16:11:55 2019-03-04 16:19:00
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan 2019-03-27 17:53:01 2019-03-27 18:00:25
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan 2019-03-10 01:23:59 2019-03-10 01:49:51
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan 2019-03-30 13:27:42 2019-03-30 13:37:14
  • Make a new column, “day”, containing the day of the week of the pickup. Use the accessor dt and then the method day_name.

If we try to call day_name on the Series directly, even though it contains datetime values, we get an error. We need to use the dt accessor to apply these kind of methods to the entries in the Series.

It might be hard for you as students to know what you should remember for a quiz and what is less important. I would like you to remember that the dt accessor (and the str accessor below) are required.

df["picktime"].day_name()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[34], line 1
----> 1 df["picktime"].day_name()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'day_name'

Here we successfully use the dt accessor.

df["day"] = df["picktime"].dt.day_name()
df["day"]
0        Saturday
1          Monday
2       Wednesday
3          Sunday
4        Saturday
          ...    
6428       Sunday
6429       Sunday
6430     Saturday
6431       Monday
6432    Wednesday
Name: day, Length: 6433, dtype: object
  • Make a new column “duration” containing the amount of time the ride took. (Now that pandas recognizes these as dates, you can just subtract.)

Let’s remind ourselves what columns are available. The ones we want to use are near the end.

df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough', 'picktime', 'droptime', 'day'],
      dtype='object')

This sort of subtraction of times would never work if we hadn’t converted them to datetime types using pd.to_datetime.

df["duration"] = df["droptime"] - df["picktime"]
df["duration"]
0      0 days 00:06:15
1      0 days 00:07:05
2      0 days 00:07:24
3      0 days 00:25:52
4      0 days 00:09:32
             ...      
6428   0 days 00:03:34
6429   0 days 00:56:23
6430   0 days 00:19:07
6431   0 days 00:05:04
6432   0 days 00:16:40
Name: duration, Length: 6433, dtype: timedelta64[ns]

Here we check out the first three rows and first two columns. Does the top row look like a 6 minute taxi ride? Yes, because it went from 8:21pm to 8:27pm.

df.iloc[:3, :2]
pickup dropoff
0 2019-03-23 20:21:09 2019-03-23 20:27:24
1 2019-03-04 16:11:55 2019-03-04 16:19:00
2 2019-03-27 17:53:01 2019-03-27 18:00:25
  • What are the data types of these four columns we just made?

Notice how the first two we made are called “datetime64[ns]”, whereas the last one is called a “timedelta64[ns]”. When you see “delta”, you should usually think “difference”, and that’s exactly what this column represents. The middle column, “day”, is reported as object, because they are just strings like "Monday".

df.dtypes
pickup                      object
dropoff                     object
passengers                   int64
distance                   float64
fare                       float64
tip                        float64
tolls                      float64
total                      float64
color                       object
payment                     object
pickup_zone                 object
dropoff_zone                object
pickup_borough              object
dropoff_borough             object
picktime            datetime64[ns]
droptime            datetime64[ns]
day                         object
duration           timedelta64[ns]
dtype: object

Working with strings in pandas#

  • How many rows in df have the word “Airport” in either the “pickup_zone” column or the “dropoff_zone” column? Do any rows have the word “Airport” in both columns?

Use the str accessor and the contains method.

Here we check out the first three entries in the “pickup_zone” column.

df["pickup_zone"][:3]
0          Lenox Hill West
1    Upper West Side South
2            Alphabet City
Name: pickup_zone, dtype: object

Here is an example of how str.contains works. Only the middle of these three contains the substring "South", so we get False then True then False.

df["pickup_zone"][:3].str.contains("South")
0    False
1     True
2    False
Name: pickup_zone, dtype: bool

We now do the same thing for the word “Airport”. We save these Boolean Series.

ps = df["pickup_zone"].str.contains("Airport")
ds = df["dropoff_zone"].str.contains("Airport")

Here are the first three values. None of these first three contains the substring "Airport".

ps[:3]
0    False
1    False
2    False
Name: pickup_zone, dtype: object

Here we use Boolean indexing to find all the rows where either the “pickup_zone” or the “dropoff_zone” column contains the substring “Airport”. Notice how the logical or is made using a vertical bar. (Don’t spell it out in this context. Spelled out or is used in some contexts in Python, but not with NumPy Boolean arrays or pandas Boolean Series.)

Scorll to the right and check that each of these rows really does involve an “Airport”.

df[ps | ds]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough picktime droptime day duration
10 2019-03-16 10:02:25 2019-03-16 10:22:29 1 3.90 17.00 0.00 0.00 17.80 yellow cash LaGuardia Airport Astoria Queens Queens 2019-03-16 10:02:25 2019-03-16 10:22:29 Saturday 0 days 00:20:04
22 2019-03-28 17:20:54 2019-03-28 17:55:57 1 9.82 31.50 8.31 5.76 49.87 yellow credit card LaGuardia Airport Sutton Place/Turtle Bay North Queens Manhattan 2019-03-28 17:20:54 2019-03-28 17:55:57 Thursday 0 days 00:35:03
43 2019-03-27 19:39:16 2019-03-27 20:00:26 1 9.50 28.00 0.00 5.76 35.06 yellow cash LaGuardia Airport Washington Heights South Queens Manhattan 2019-03-27 19:39:16 2019-03-27 20:00:26 Wednesday 0 days 00:21:10
53 2019-03-20 18:49:24 2019-03-20 19:12:40 1 13.45 36.00 0.00 0.00 37.80 yellow cash JFK Airport Sunnyside Queens Queens 2019-03-20 18:49:24 2019-03-20 19:12:40 Wednesday 0 days 00:23:16
56 2019-03-05 17:57:00 2019-03-05 18:45:25 2 19.59 52.00 0.00 5.76 65.56 yellow credit card JFK Airport Upper East Side North Queens Manhattan 2019-03-05 17:57:00 2019-03-05 18:45:25 Tuesday 0 days 00:48:25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6243 2019-03-12 11:40:49 2019-03-12 11:49:23 1 3.50 12.00 0.00 0.00 12.80 green cash Steinway LaGuardia Airport Queens Queens 2019-03-12 11:40:49 2019-03-12 11:49:23 Tuesday 0 days 00:08:34
6280 2019-03-27 12:25:04 2019-03-27 12:31:15 1 2.74 10.00 0.00 0.00 10.80 green credit card Steinway LaGuardia Airport Queens Queens 2019-03-27 12:25:04 2019-03-27 12:31:15 Wednesday 0 days 00:06:11
6322 2019-03-01 17:03:11 2019-03-01 17:17:11 1 5.39 17.00 3.76 0.00 22.56 green credit card Jamaica JFK Airport Queens Queens 2019-03-01 17:03:11 2019-03-01 17:17:11 Friday 0 days 00:14:00
6396 2019-03-15 17:44:09 2019-03-15 18:20:25 1 11.44 41.63 0.00 0.00 42.13 green credit card Bushwick South LaGuardia Airport Brooklyn Queens 2019-03-15 17:44:09 2019-03-15 18:20:25 Friday 0 days 00:36:16
6398 2019-03-17 13:07:29 2019-03-17 13:51:56 1 18.39 50.50 7.70 0.00 59.00 green credit card DUMBO/Vinegar Hill JFK Airport Brooklyn Queens 2019-03-17 13:07:29 2019-03-17 13:51:56 Sunday 0 days 00:44:27

407 rows Ă— 18 columns

Here is similar, but we use & instead of |. This corresponds to the rows where both the pickup zone and the dropoff zone involved an airport. Many of these rides were very short (less than a minute).

df[ps & ds]
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough picktime droptime day duration
120 2019-03-21 17:21:44 2019-03-21 17:21:49 1 0.00 2.50 0.00 0.00 4.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-21 17:21:44 2019-03-21 17:21:49 Thursday 0 days 00:00:05
770 2019-03-02 03:16:59 2019-03-02 03:17:06 0 9.40 2.50 0.00 0.00 3.80 yellow NaN JFK Airport JFK Airport Queens Queens 2019-03-02 03:16:59 2019-03-02 03:17:06 Saturday 0 days 00:00:07
1080 2019-03-04 14:17:05 2019-03-04 14:17:13 1 0.00 2.50 0.00 0.00 3.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-04 14:17:05 2019-03-04 14:17:13 Monday 0 days 00:00:08
1089 2019-03-10 01:43:32 2019-03-10 01:45:22 1 0.37 3.50 0.00 0.00 4.80 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-10 01:43:32 2019-03-10 01:45:22 Sunday 0 days 00:01:50
1416 2019-03-09 13:16:32 2019-03-09 13:46:11 2 12.39 35.00 0.00 0.00 35.80 yellow cash LaGuardia Airport JFK Airport Queens Queens 2019-03-09 13:16:32 2019-03-09 13:46:11 Saturday 0 days 00:29:39
1929 2019-03-13 22:35:35 2019-03-13 22:35:49 1 0.00 2.50 0.00 0.00 3.80 yellow NaN JFK Airport JFK Airport Queens Queens 2019-03-13 22:35:35 2019-03-13 22:35:49 Wednesday 0 days 00:00:14
2387 2019-03-28 15:58:52 2019-03-28 15:59:25 1 1.80 69.06 20.80 0.00 90.16 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-28 15:58:52 2019-03-28 15:59:25 Thursday 0 days 00:00:33
3571 2019-03-22 16:47:41 2019-03-22 16:47:50 1 0.81 66.00 0.00 0.00 66.80 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-22 16:47:41 2019-03-22 16:47:50 Friday 0 days 00:00:09
4358 2019-03-06 18:24:00 2019-03-06 18:24:13 2 0.01 2.50 0.00 0.00 4.30 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-06 18:24:00 2019-03-06 18:24:13 Wednesday 0 days 00:00:13
5095 2019-03-30 20:14:44 2019-03-30 21:01:28 3 18.91 52.00 8.78 5.76 67.34 yellow credit card JFK Airport JFK Airport Queens Queens 2019-03-30 20:14:44 2019-03-30 21:01:28 Saturday 0 days 00:46:44
5364 2019-03-17 16:59:17 2019-03-17 18:04:08 2 36.70 150.00 0.00 24.02 174.82 yellow cash JFK Airport JFK Airport Queens Queens 2019-03-17 16:59:17 2019-03-17 18:04:08 Sunday 0 days 01:04:51

We didn’t get any further.

  • Alternative, less elegant, approach. Find the rows where “Airport” is in the “pickup_zone” column by writing an appropriate function and using the map method of a pandas Series. Use the keyword argument na_action="ignore" so that missing values don’t cause errors.

Missing values in pandas#

  • Look at df.info() again. Which columns have missing values?

  • Define df_sub to be the sub-DataFrame containing all the rows for which the “pickup_zone” entry is missing. Use the isna method.

  • What if you tried to check for equality with np.nan?

Time to work on Worksheets 3-4#

  • Hanson and I are here to help.

  • I have another class at 2pm, so if you have questions, ask me now rather than after class!