Week 2 Wednesday#
Announcements#
Trying something new: I’m in Deepnote, but in a “Private” project. Will upload this afterwards. You can find the blank version in the Week 2 folder.
Worksheet 4 handed out today. Both Worksheets 3 and 4 due Monday.
Hanson (one of our three Learning Assistants) is here to help.
Exploring the taxis dataset#
I refer to the first half of Math 10 as the “Exploratory Data Analysis” half (EDA). In the first part today, I want to go through some basic tools in pandas for getting a quick overview of the contents of a pandas DataFrame.
Comment: this taxis dataset is included with the Python library Seaborn. Usually that’s where I get it from. In this case, I’m getting it from a csv file instead, because Seaborn automatically converts some dates for us, but I want us to see how to make that conversion explicitly.
Load the attached file
taxis.csv
using the pandas functionread_csv
. Store the resulting DataFrame in the variabledf
.
import pandas as pd
df = pd.read_csv("../data/taxis.csv")
Get a sense for the contents of df
using the following.
The
shape
attribute.The
head
method, to view the first few rows.The
sample
method, to view a random collection of rows.The
info
method.The
describe
method.
The shape
attribute reports the number of rows (6433 in this case) and the number of columns (14).
df.shape
(6433, 14)
Here we actually look at the contents of df
, namely, we look at its first 3 rows.
df.head(3)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
The sample
method is similar, but it returns a random selection of the rows. Notice that they could also be in a different order.
df.sample(3)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
794 | 2019-03-06 10:01:53 | 2019-03-06 10:27:29 | 2 | 2.15 | 16.0 | 3.86 | 0.0 | 23.16 | yellow | credit card | Upper East Side South | Clinton East | Manhattan | Manhattan |
6352 | 2019-03-28 12:46:54 | 2019-03-28 12:51:24 | 1 | 0.55 | 5.0 | 0.58 | 0.0 | 6.38 | green | credit card | Morningside Heights | Morningside Heights | Manhattan | Manhattan |
3412 | 2019-03-07 09:37:32 | 2019-03-07 10:03:00 | 1 | 1.61 | 15.5 | 3.76 | 0.0 | 22.56 | yellow | credit card | Upper East Side South | Midtown South | Manhattan | Manhattan |
The info
method contains lots of iinformation. At the very bottom, it tells us how much space the DataFrame is taking up (which can be very useful if you’ve loaded a huge DataFrame from some external source). For each row, it tells us the data type (in the usual rough sense, for example, strings are reported as the “object” data type).
A huge issue in data science is missing data (and how to deal with it). This info
method also reports missing data. (Or more precisely, it reports non-missing data, so for example we can infer below that 6433-6389 = 44
values are missing from the “payment” column.)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup 6433 non-null object
1 dropoff 6433 non-null object
2 passengers 6433 non-null int64
3 distance 6433 non-null float64
4 fare 6433 non-null float64
5 tip 6433 non-null float64
6 tolls 6433 non-null float64
7 total 6433 non-null float64
8 color 6433 non-null object
9 payment 6389 non-null object
10 pickup_zone 6407 non-null object
11 dropoff_zone 6388 non-null object
12 pickup_borough 6407 non-null object
13 dropoff_borough 6388 non-null object
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB
The describe
method is returns information about the distribution of numbers in numeric columns. For example, we can see that the median fare is 6.5
(the 50-th percentile), whereas the mean fare is 13.09
.
df.describe()
passengers | distance | fare | tip | tolls | total | |
---|---|---|---|---|---|---|
count | 6433.000000 | 6433.000000 | 6433.000000 | 6433.00000 | 6433.000000 | 6433.000000 |
mean | 1.539251 | 3.024617 | 13.091073 | 1.97922 | 0.325273 | 18.517794 |
std | 1.203768 | 3.827867 | 11.551804 | 2.44856 | 1.415267 | 13.815570 |
min | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 0.000000 | 1.300000 |
25% | 1.000000 | 0.980000 | 6.500000 | 0.00000 | 0.000000 | 10.800000 |
50% | 1.000000 | 1.640000 | 9.500000 | 1.70000 | 0.000000 | 14.160000 |
75% | 2.000000 | 3.210000 | 15.000000 | 2.80000 | 0.000000 | 20.300000 |
max | 6.000000 | 36.700000 | 150.000000 | 33.20000 | 24.020000 | 174.820000 |
How many different values are there in the “pickup_borough” column? First get the pandas Series containing that column, then
use the
unique
method;use the
value_counts
method.
We first define a pandas Series ser
that will be equal to the “pickup_borough” column.
ser = df["pickup_borough"]
ser
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
Notice that this is indeed a pandas Series.
type(ser)
pandas.core.series.Series
We now call the unique
method. This gets us the distinct values which occur in the Series (including the missing value, represented as nan
, which stands for “not a number”).
ser.unique()
array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)
The output is a NumPy array. (NumPy is a dependency of pandas, so anyone who has pandas installed will also have NumPy installed.)
type(ser.unique())
numpy.ndarray
If we want a programmatic way of counting how many unique values there are, we can use the Python function len
. (If we wanted to not count the missing value, we’d have to work harder.)
len(ser.unique())
5
Notice that unique
is a Series method, not a DataFrame method.
# don't know if this works
df.unique()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[13], line 2
1 # don't know if this works
----> 2 df.unique()
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
5895 if (
5896 name not in self._internal_names_set
5897 and name not in self._metadata
5898 and name not in self._accessors
5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5900 ):
5901 return self[name]
-> 5902 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'unique'
Aside: there was a question about how we could get this column using loc
or iloc
. The first thing you might try, df.loc["pickup_borough"]
does not work, because this is looking for a row named "pickup_borough"
.
df.loc["pickup_borough"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[14], line 1
----> 1 df.loc["pickup_borough"]
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
1070 axis = self.axis or 0
1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis)
1310 # fall thru to straight lookup
1311 self._validate_key(key, axis)
-> 1312 return self._get_label(key, axis=axis)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis)
1258 def _get_label(self, label, axis: int):
1259 # GH#5567 this will fail if the label is not present in the axis.
-> 1260 return self.obj.xs(label, axis=axis)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:4056, in NDFrame.xs(self, key, axis, level, drop_level)
4054 new_index = index[loc]
4055 else:
-> 4056 loc = index.get_loc(key)
4058 if isinstance(loc, np.ndarray):
4059 if loc.dtype == np.bool_:
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/range.py:395, in RangeIndex.get_loc(self, key, method, tolerance)
393 raise KeyError(key) from err
394 self._check_indexing_error(key)
--> 395 raise KeyError(key)
396 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'pickup_borough'
Here is the right way using loc
. It is definitely more convenient to use df["pickup_borough"]
rather than the following.
df.loc[:, "pickup_borough"]
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
If we want to use iloc
, we need to find the integer position of the “pickup_borough” column. We can use the columns
attribute to see all the columns.
df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
Here I just counted (remember to start counting at 0
). There are of course more sophisticated ways also; ask if you’re interested! The following is the iloc
way to get the “pickup_borough” column.
df.iloc[:, 12]
0 Manhattan
1 Manhattan
2 Manhattan
3 Manhattan
4 Manhattan
...
6428 Manhattan
6429 Queens
6430 Brooklyn
6431 Brooklyn
6432 Brooklyn
Name: pickup_borough, Length: 6433, dtype: object
Indexing in pandas is confusing and you should not expect to be comfortable with it until you have practiced for at least a few weeks. If we use square brackets and what’s called a “slice” (using colon :
), and we don’t use loc
or iloc
, then this accesses rows.
# First three rows
df[:3]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
If we use just an integer (not a slice), then it is looking for columns. I found this pretty confusing at first, but eventually got used to it.
# looking for a column named 3
df[3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 3
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[20], line 2
1 # looking for a column named 3
----> 2 df[3]
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
3805 if self.columns.nlevels > 1:
3806 return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
3808 if is_integer(indexer):
3809 indexer = [indexer]
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
3807 # InvalidIndexError. Otherwise we fall through and re-raise
3808 # the TypeError.
3809 self._check_indexing_error(key)
KeyError: 3
Working with dates in pandas#
Here is a reminder of how df
looks.
df.sample(3)
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4997 | 2019-03-16 11:04:35 | 2019-03-16 11:40:25 | 5 | 6.31 | 26.0 | 3.00 | 0.0 | 32.30 | yellow | credit card | TriBeCa/Civic Center | Upper East Side North | Manhattan | Manhattan |
4905 | 2019-03-12 16:23:50 | 2019-03-12 16:27:07 | 5 | 0.56 | 4.5 | 0.00 | 0.0 | 8.80 | yellow | cash | Clinton East | Clinton East | Manhattan | Manhattan |
536 | 2019-03-22 15:59:29 | 2019-03-22 16:10:08 | 1 | 0.52 | 8.0 | 2.26 | 0.0 | 13.56 | yellow | credit card | Clinton East | Times Sq/Theatre District | Manhattan | Manhattan |
Convert the “pickup” and “dropoff” columns to datetime values, using the pandas function
to_datetime
. Save the results as new columns indf
, named “picktime” and “droptime”.
Here is the “pickup” column. We can recognize these as dates, but for now, they are just strings to Python and to pandas. (Notice that it is reported as an “object” data type.)
df["pickup"]
0 2019-03-23 20:21:09
1 2019-03-04 16:11:55
2 2019-03-27 17:53:01
3 2019-03-10 01:23:59
4 2019-03-30 13:27:42
...
6428 2019-03-31 09:51:53
6429 2019-03-31 17:38:00
6430 2019-03-23 22:55:18
6431 2019-03-04 10:09:25
6432 2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: object
Notice how the data type changes to “datetime64[ns]” in the following output (it has no affect on the original column).
pd.to_datetime(df["pickup"])
0 2019-03-23 20:21:09
1 2019-03-04 16:11:55
2 2019-03-27 17:53:01
3 2019-03-10 01:23:59
4 2019-03-30 13:27:42
...
6428 2019-03-31 09:51:53
6429 2019-03-31 17:38:00
6430 2019-03-23 22:55:18
6431 2019-03-04 10:09:25
6432 2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: datetime64[ns]
Let’s see in a smaller example how the pd.to_datetime
function works. We start out by defining a string representing today’s date.
today = "April 12, 2023"
This is just a string.
type(today)
str
Here we convert it to a Timestamp using pd.to_datetime
.
ts = pd.to_datetime(today)
ts
Timestamp('2023-04-12 00:00:00')
We now have access to all sorts of additional information related to this date.
I can never remember which of these are methods and which are attributes. Here I tried calling day_of_year
as a method, but that was incorrect.
ts.day_of_year()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[28], line 1
----> 1 ts.day_of_year()
TypeError: 'int' object is not callable
The attribute usage is correct. This is telling us that April 12th is the 102nd day of 2023. (I’m not sure if this starts counting at 0 or not, I would guess that it does.)
ts.day_of_year
102
We can also get the name of the day. (Notice that this one is indeed a method. To my eyes, it’s a little unpredictable which are methods. Whether day_name
or day_of_year
is a method is definitely not something you would be tested on.)
ts.day_name()
'Wednesday'
If you want to see all the various attributes and methods, you can evaluate dir
. Probably you should ignore the ones which begin and end with two underscores, which are called “dunder methods” (for “double underscore”), and are typically not used by us directly.
dir(ts)
['__add__',
'__array_priority__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pyx_vtable__',
'__radd__',
'__reduce__',
'__reduce_cython__',
'__reduce_ex__',
'__repr__',
'__rsub__',
'__setattr__',
'__setstate__',
'__setstate_cython__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__weakref__',
'_as_unit',
'_date_repr',
'_freq',
'_freqstr',
'_from_dt64',
'_from_value_and_reso',
'_repr_base',
'_reso',
'_round',
'_set_freq',
'_short_repr',
'_time_repr',
'asm8',
'astimezone',
'ceil',
'combine',
'ctime',
'date',
'day',
'day_name',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',
'daysinmonth',
'dst',
'floor',
'fold',
'freq',
'freqstr',
'fromisocalendar',
'fromisoformat',
'fromordinal',
'fromtimestamp',
'hour',
'is_leap_year',
'is_month_end',
'is_month_start',
'is_quarter_end',
'is_quarter_start',
'is_year_end',
'is_year_start',
'isocalendar',
'isoformat',
'isoweekday',
'max',
'microsecond',
'min',
'minute',
'month',
'month_name',
'nanosecond',
'normalize',
'now',
'quarter',
'replace',
'resolution',
'round',
'second',
'strftime',
'strptime',
'time',
'timestamp',
'timetuple',
'timetz',
'to_datetime64',
'to_julian_date',
'to_numpy',
'to_period',
'to_pydatetime',
'today',
'toordinal',
'tz',
'tz_convert',
'tz_localize',
'tzinfo',
'tzname',
'utcfromtimestamp',
'utcnow',
'utcoffset',
'utctimetuple',
'value',
'week',
'weekday',
'weekofyear',
'year']
Here we are adding two new columns to the DataFrame. We name them “picktime” and “droptime”.
# syntax for a new column in df
df["picktime"] = pd.to_datetime(df["pickup"])
df["droptime"] = pd.to_datetime(df["dropoff"])
Notice how these new columns have shown up all the way on the right side.
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | picktime | droptime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 |
Make a new column, “day”, containing the day of the week of the pickup. Use the accessor
dt
and then the methodday_name
.
If we try to call day_name
on the Series directly, even though it contains datetime values, we get an error. We need to use the dt
accessor to apply these kind of methods to the entries in the Series.
It might be hard for you as students to know what you should remember for a quiz and what is less important. I would like you to remember that the dt
accessor (and the str
accessor below) are required.
df["picktime"].day_name()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[34], line 1
----> 1 df["picktime"].day_name()
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
5895 if (
5896 name not in self._internal_names_set
5897 and name not in self._metadata
5898 and name not in self._accessors
5899 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5900 ):
5901 return self[name]
-> 5902 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'day_name'
Here we successfully use the dt
accessor.
df["day"] = df["picktime"].dt.day_name()
df["day"]
0 Saturday
1 Monday
2 Wednesday
3 Sunday
4 Saturday
...
6428 Sunday
6429 Sunday
6430 Saturday
6431 Monday
6432 Wednesday
Name: day, Length: 6433, dtype: object
Make a new column “duration” containing the amount of time the ride took. (Now that pandas recognizes these as dates, you can just subtract.)
Let’s remind ourselves what columns are available. The ones we want to use are near the end.
df.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough', 'picktime', 'droptime', 'day'],
dtype='object')
This sort of subtraction of times would never work if we hadn’t converted them to datetime types using pd.to_datetime
.
df["duration"] = df["droptime"] - df["picktime"]
df["duration"]
0 0 days 00:06:15
1 0 days 00:07:05
2 0 days 00:07:24
3 0 days 00:25:52
4 0 days 00:09:32
...
6428 0 days 00:03:34
6429 0 days 00:56:23
6430 0 days 00:19:07
6431 0 days 00:05:04
6432 0 days 00:16:40
Name: duration, Length: 6433, dtype: timedelta64[ns]
Here we check out the first three rows and first two columns. Does the top row look like a 6 minute taxi ride? Yes, because it went from 8:21pm to 8:27pm.
df.iloc[:3, :2]
pickup | dropoff | |
---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 |
What are the data types of these four columns we just made?
Notice how the first two we made are called “datetime64[ns]”, whereas the last one is called a “timedelta64[ns]”. When you see “delta”, you should usually think “difference”, and that’s exactly what this column represents. The middle column, “day”, is reported as object, because they are just strings like "Monday"
.
df.dtypes
pickup object
dropoff object
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
picktime datetime64[ns]
droptime datetime64[ns]
day object
duration timedelta64[ns]
dtype: object
Working with strings in pandas#
How many rows in
df
have the word “Airport” in either the “pickup_zone” column or the “dropoff_zone” column? Do any rows have the word “Airport” in both columns?
Use the str
accessor and the contains
method.
Here we check out the first three entries in the “pickup_zone” column.
df["pickup_zone"][:3]
0 Lenox Hill West
1 Upper West Side South
2 Alphabet City
Name: pickup_zone, dtype: object
Here is an example of how str.contains
works. Only the middle of these three contains the substring "South"
, so we get False
then True
then False
.
df["pickup_zone"][:3].str.contains("South")
0 False
1 True
2 False
Name: pickup_zone, dtype: bool
We now do the same thing for the word “Airport”. We save these Boolean Series.
ps = df["pickup_zone"].str.contains("Airport")
ds = df["dropoff_zone"].str.contains("Airport")
Here are the first three values. None of these first three contains the substring "Airport"
.
ps[:3]
0 False
1 False
2 False
Name: pickup_zone, dtype: object
Here we use Boolean indexing to find all the rows where either the “pickup_zone” or the “dropoff_zone” column contains the substring “Airport”. Notice how the logical or is made using a vertical bar. (Don’t spell it out in this context. Spelled out or
is used in some contexts in Python, but not with NumPy Boolean arrays or pandas Boolean Series.)
Scorll to the right and check that each of these rows really does involve an “Airport”.
df[ps | ds]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | picktime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | 1 | 3.90 | 17.00 | 0.00 | 0.00 | 17.80 | yellow | cash | LaGuardia Airport | Astoria | Queens | Queens | 2019-03-16 10:02:25 | 2019-03-16 10:22:29 | Saturday | 0 days 00:20:04 |
22 | 2019-03-28 17:20:54 | 2019-03-28 17:55:57 | 1 | 9.82 | 31.50 | 8.31 | 5.76 | 49.87 | yellow | credit card | LaGuardia Airport | Sutton Place/Turtle Bay North | Queens | Manhattan | 2019-03-28 17:20:54 | 2019-03-28 17:55:57 | Thursday | 0 days 00:35:03 |
43 | 2019-03-27 19:39:16 | 2019-03-27 20:00:26 | 1 | 9.50 | 28.00 | 0.00 | 5.76 | 35.06 | yellow | cash | LaGuardia Airport | Washington Heights South | Queens | Manhattan | 2019-03-27 19:39:16 | 2019-03-27 20:00:26 | Wednesday | 0 days 00:21:10 |
53 | 2019-03-20 18:49:24 | 2019-03-20 19:12:40 | 1 | 13.45 | 36.00 | 0.00 | 0.00 | 37.80 | yellow | cash | JFK Airport | Sunnyside | Queens | Queens | 2019-03-20 18:49:24 | 2019-03-20 19:12:40 | Wednesday | 0 days 00:23:16 |
56 | 2019-03-05 17:57:00 | 2019-03-05 18:45:25 | 2 | 19.59 | 52.00 | 0.00 | 5.76 | 65.56 | yellow | credit card | JFK Airport | Upper East Side North | Queens | Manhattan | 2019-03-05 17:57:00 | 2019-03-05 18:45:25 | Tuesday | 0 days 00:48:25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6243 | 2019-03-12 11:40:49 | 2019-03-12 11:49:23 | 1 | 3.50 | 12.00 | 0.00 | 0.00 | 12.80 | green | cash | Steinway | LaGuardia Airport | Queens | Queens | 2019-03-12 11:40:49 | 2019-03-12 11:49:23 | Tuesday | 0 days 00:08:34 |
6280 | 2019-03-27 12:25:04 | 2019-03-27 12:31:15 | 1 | 2.74 | 10.00 | 0.00 | 0.00 | 10.80 | green | credit card | Steinway | LaGuardia Airport | Queens | Queens | 2019-03-27 12:25:04 | 2019-03-27 12:31:15 | Wednesday | 0 days 00:06:11 |
6322 | 2019-03-01 17:03:11 | 2019-03-01 17:17:11 | 1 | 5.39 | 17.00 | 3.76 | 0.00 | 22.56 | green | credit card | Jamaica | JFK Airport | Queens | Queens | 2019-03-01 17:03:11 | 2019-03-01 17:17:11 | Friday | 0 days 00:14:00 |
6396 | 2019-03-15 17:44:09 | 2019-03-15 18:20:25 | 1 | 11.44 | 41.63 | 0.00 | 0.00 | 42.13 | green | credit card | Bushwick South | LaGuardia Airport | Brooklyn | Queens | 2019-03-15 17:44:09 | 2019-03-15 18:20:25 | Friday | 0 days 00:36:16 |
6398 | 2019-03-17 13:07:29 | 2019-03-17 13:51:56 | 1 | 18.39 | 50.50 | 7.70 | 0.00 | 59.00 | green | credit card | DUMBO/Vinegar Hill | JFK Airport | Brooklyn | Queens | 2019-03-17 13:07:29 | 2019-03-17 13:51:56 | Sunday | 0 days 00:44:27 |
407 rows Ă— 18 columns
Here is similar, but we use &
instead of |
. This corresponds to the rows where both the pickup zone and the dropoff zone involved an airport. Many of these rides were very short (less than a minute).
df[ps & ds]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | picktime | droptime | day | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-21 17:21:44 | 2019-03-21 17:21:49 | Thursday | 0 days 00:00:05 |
770 | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | 0 | 9.40 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens | 2019-03-02 03:16:59 | 2019-03-02 03:17:06 | Saturday | 0 days 00:00:07 |
1080 | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-04 14:17:05 | 2019-03-04 14:17:13 | Monday | 0 days 00:00:08 |
1089 | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | 1 | 0.37 | 3.50 | 0.00 | 0.00 | 4.80 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-10 01:43:32 | 2019-03-10 01:45:22 | Sunday | 0 days 00:01:50 |
1416 | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | 2 | 12.39 | 35.00 | 0.00 | 0.00 | 35.80 | yellow | cash | LaGuardia Airport | JFK Airport | Queens | Queens | 2019-03-09 13:16:32 | 2019-03-09 13:46:11 | Saturday | 0 days 00:29:39 |
1929 | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | 1 | 0.00 | 2.50 | 0.00 | 0.00 | 3.80 | yellow | NaN | JFK Airport | JFK Airport | Queens | Queens | 2019-03-13 22:35:35 | 2019-03-13 22:35:49 | Wednesday | 0 days 00:00:14 |
2387 | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | 1 | 1.80 | 69.06 | 20.80 | 0.00 | 90.16 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-28 15:58:52 | 2019-03-28 15:59:25 | Thursday | 0 days 00:00:33 |
3571 | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | 1 | 0.81 | 66.00 | 0.00 | 0.00 | 66.80 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-22 16:47:41 | 2019-03-22 16:47:50 | Friday | 0 days 00:00:09 |
4358 | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | 2 | 0.01 | 2.50 | 0.00 | 0.00 | 4.30 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-06 18:24:00 | 2019-03-06 18:24:13 | Wednesday | 0 days 00:00:13 |
5095 | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | 3 | 18.91 | 52.00 | 8.78 | 5.76 | 67.34 | yellow | credit card | JFK Airport | JFK Airport | Queens | Queens | 2019-03-30 20:14:44 | 2019-03-30 21:01:28 | Saturday | 0 days 00:46:44 |
5364 | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | 2 | 36.70 | 150.00 | 0.00 | 24.02 | 174.82 | yellow | cash | JFK Airport | JFK Airport | Queens | Queens | 2019-03-17 16:59:17 | 2019-03-17 18:04:08 | Sunday | 0 days 01:04:51 |
We didn’t get any further.
Alternative, less elegant, approach. Find the rows where “Airport” is in the “pickup_zone” column by writing an appropriate function and using the
map
method of a pandas Series. Use the keyword argumentna_action="ignore"
so that missing values don’t cause errors.
Missing values in pandas#
Look at
df.info()
again. Which columns have missing values?
Define
df_sub
to be the sub-DataFrame containing all the rows for which the “pickup_zone” entry is missing. Use theisna
method.
What if you tried to check for equality with
np.nan
?
Time to work on Worksheets 3-4#
Hanson and I are here to help.
I have another class at 2pm, so if you have questions, ask me now rather than after class!