Week 5 Wednesday#
Announcements#
Solutions to our midterm are posted on the Week 5 page on Canvas. Midterms should be returned next week.
There is an in-class quiz next week, based on this week’s worksheets.
Slight change to the routine: No “time to work on worksheets” today, but there will be about 15 minutes at the beginning of class Friday, and Yufei, Hanson, and I will be there to answer questions.
Reminder: I have office hours Friday before class, 12-1pm, in our usual classroom ALP 3610.
Introduction#
We’re starting the Machine Learning portion of Math 10. Here is a quote I like (even though it’s somewhat vague) from Hands on Machine Learning by Aurélien Géron:
Machine Learning is the science (and art) of programming computers so they can learn from data.
Machine Learning is roughly divided into two large categories: supervised and unsupervised machine learning. We will mostly (entirely?) focus on supervised machine learning, where there is typically some correct value that we are trying to predict.
Within supervised machine learning, there are two further sub-categories: regression and classification. In regression problems, we are trying to compute a continuous quantitative value (":Q"
in the Altair encoding type syntax). In classification problems, we are trying to compute a discrete value (":N"
or ":O"
in the Altair encoding type syntax).
I don’t think we will cover unsupervised learning this quarter, but we will cover both regression and classification problems. We’ll start with regression. Linear regression in particular is a nice first example of machine learning. Jinghao showed on Tuesday how to generate “artificial” data for linear regression. Here we’ll use a real dataset, and we’ll see how to perform linear regression using scikit-learn.
Performing linear regression using scikit-learn#
import seaborn as sns
import altair as alt
import pandas as pd
Load the taxis dataset from Seaborn.
Drop rows with missing values.
df = sns.load_dataset("taxis")
df = df.dropna(axis=0)
Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis.
Choose 5000 random rows to avoid the
max_rows
error.
Here is the error we get if we don’t restrict the DataFrame.
alt.Chart(df).mark_circle().encode(
x="distance",
y="fare"
)
---------------------------------------------------------------------------
MaxRowsError Traceback (most recent call last)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/altair/vegalite/v4/api.py:374, in TopLevelMixin.to_dict(self, *args, **kwargs)
372 copy = self.copy(deep=False)
373 original_data = getattr(copy, "data", Undefined)
--> 374 copy.data = _prepare_data(original_data, context)
376 if original_data is not Undefined:
377 context["data"] = original_data
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/altair/vegalite/v4/api.py:89, in _prepare_data(data, context)
87 # convert dataframes or objects with __geo_interface__ to dict
88 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 89 data = _pipe(data, data_transformers.get())
91 # convert string input to a URLData
92 if isinstance(data, str):
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
17 @curried.curry
18 def default_data_transformer(data, max_rows=5000):
---> 19 return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/altair/utils/data.py:80, in limit_rows(data, max_rows)
78 return data
79 if max_rows is not None and len(values) > max_rows:
---> 80 raise MaxRowsError(
81 "The number of rows in your dataset is greater "
82 "than the maximum allowed ({}). "
83 "For information on how to plot larger datasets "
84 "in Altair, see the documentation".format(max_rows)
85 )
86 return data
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)
Here we are using the sample
method to get 5000 random rows from the DataFrame. There’s no reason to use random_state
here, because we’re just trying to get a quick overview of how the data looks.
As a fun aside, notice the plateau around fare 50. By adding a tooltip, we can see that these mostly correspond to airport rides, where there is (or was) a fixed rate.
alt.Chart(df.sample(5000)).mark_circle().encode(
x="distance",
y="fare",
tooltip=["dropoff_zone", "pickup_zone", "fare"]
)
But the airport rides part was just for fun. The most important thing to notice is that this data is clearly well-suited to linear regression. It lies roughly on a line, but not perfectly on a line.
What would you estimate is the slope of the “line of best fit” for this data?
The line might go from (2,7) to (4,12), for example, so maybe the slope will be 2.5?
Find this slope using the
LinearRegression
class from scikit-learn.
There is a routine in scikit-learn that we will see many times:
Import
Instantiate (meaning create an object, aka, an instance, of the appropriate class)
Fit
Predict
We will see this pattern many times, starting right now. The more comfortable you get with this routine, the more comfortable you will be with conventions in Object Oriented Programming.
Here we import the LinearRegression
class.
# import
from sklearn.linear_model import LinearRegression
Here we create a LinearRegression
object, and name it reg
(for regression).
# instantiate (make an instance)
reg = LinearRegression()
Notice that this is a LinearRegression
object. (That class does not exist in base Python; it is defined by scikit-learn.) This emphasis on objects (as opposed to, for example, functions) is the standard in Object Oriented Programming.
type(reg)
sklearn.linear_model._base.LinearRegression
Now we try to fit the LinearRegression
object, but an error is raised. I guarantee you will also encounter this same error.
reg.fit(df["distance"], df["fare"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 reg.fit(df["distance"], df["fare"])
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/linear_model/_base.py:648, in LinearRegression.fit(self, X, y, sample_weight)
644 n_jobs_ = self.n_jobs
646 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 648 X, y = self._validate_data(
649 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
650 )
652 sample_weight = _check_sample_weight(
653 sample_weight, X, dtype=X.dtype, only_non_negative=True
654 )
656 X, y, X_offset, y_offset, X_scale = _preprocess_data(
657 X,
658 y,
(...)
661 sample_weight=sample_weight,
662 )
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
582 y = check_array(y, input_name="y", **check_y_params)
583 else:
--> 584 X, y = check_X_y(X, y, **check_params)
585 out = X, y
587 if not no_val_X and check_params.get("ensure_2d", True):
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/utils/validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1101 estimator_name = _check_estimator_name(estimator)
1102 raise ValueError(
1103 f"{estimator_name} requires y to be passed, but the target y is None"
1104 )
-> 1106 X = check_array(
1107 X,
1108 accept_sparse=accept_sparse,
1109 accept_large_sparse=accept_large_sparse,
1110 dtype=dtype,
1111 order=order,
1112 copy=copy,
1113 force_all_finite=force_all_finite,
1114 ensure_2d=ensure_2d,
1115 allow_nd=allow_nd,
1116 ensure_min_samples=ensure_min_samples,
1117 ensure_min_features=ensure_min_features,
1118 estimator=estimator,
1119 input_name="X",
1120 )
1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1124 check_consistent_length(X, y)
File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/utils/validation.py:902, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
900 # If input is 1D raise error
901 if array.ndim == 1:
--> 902 raise ValueError(
903 "Expected 2D array, got 1D array instead:\narray={}.\n"
904 "Reshape your data either using array.reshape(-1, 1) if "
905 "your data has a single feature or array.reshape(1, -1) "
906 "if it contains a single sample.".format(array)
907 )
909 if dtype_numeric and array.dtype.kind in "USV":
910 raise ValueError(
911 "dtype='numeric' is not compatible with arrays of bytes/strings."
912 "Convert your data to numeric values explicitly instead."
913 )
ValueError: Expected 2D array, got 1D array instead:
array=[1.6 0.79 1.37 ... 4.14 1.12 3.85].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The problem is that scikit-learn wants two-dimension input data, whereas we passed a pandas Series, which should be viewed as one-dimensional. (This two-dimensionality is also why the documentation uses a capital letter X
for the input features, but uses a lower-case y
for the target. The capital letter is a reminder that X
should be two-dimensional.)
df["distance"]
0 1.60
1 0.79
2 1.37
3 7.70
4 2.16
...
6428 0.75
6429 18.74
6430 4.14
6431 1.12
6432 3.85
Name: distance, Length: 6341, dtype: float64
Notice the subtle difference in how the following is presented. The above was a pandas Series, but the following is a one-column pandas DataFrame.
# ["distance"] is a length-1 list
df[["distance"]]
distance | |
---|---|
0 | 1.60 |
1 | 0.79 |
2 | 1.37 |
3 | 7.70 |
4 | 2.16 |
... | ... |
6428 | 0.75 |
6429 | 18.74 |
6430 | 4.14 |
6431 | 1.12 |
6432 | 3.85 |
6341 rows Ă— 1 columns
Here is some more evidence that df[["distance"]]
could be viewed as being two-dimensional: notice how there are two integers reported by the shape
attribute, one for each dimension. It just so happens that the columns dimension is length one.
df[["distance"]].shape
(6341, 1)
Using these double square brackets (which you should think of as a list inside the outer square brackets), we are able to call the fit
method without error.
reg.fit(df[["distance"]], df["fare"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Now reg
has done all the hard work of finding a linear equation that approximates our taxi data (just the “fare” column as a linear function of the “distance” column). The slope of the resulting line is stored in the coef_
attribute.
reg.coef_
array([2.73248966])
Find the intercept.
The intercept is stored in the intercept_
attribute.
reg.intercept_
What are the predicted outputs for the first 5 rows? What are the actual outputs?
Here are those first 5 rows.
df[:5]
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
When calling the predict
method, we do not pass the target values, just the input values. The predict
method then returns the outputs of the linear equation.
Here are the first 5 outputs.
reg.predict(df[:5][["distance"]])
array([ 9.06871104, 6.85539442, 8.44023842, 25.73689794, 10.59890525])
Something fancy is happening when we call reg.fit
, but nothing fancy is happening when we call reg.predict
: scikit-learn is simply evaluating the linear function. For example, notice that one of the rows had a distance of 7.70
miles (I assume the unit is miles, but I’m not certain).
Here is the corresponding output. Looking in the above DataFrame, we see that the true output was 27.0
, so we are pretty close with our 25.7
estimate.
reg.coef_*7.7 + reg.intercept_
array([25.73689794])
Notice that reg.coef_
is a length-one NumPy array. If we just want a number, we can extract the element out of it using indexing.
reg.coef_[0]*7.7 + reg.intercept_
25.736897944692878
Interpreting linear regression coefficients#
Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.
Notice that we already have date-time data type values in the first two columns, so there is no need to use pd.to_datetime
.
df.dtypes
pickup datetime64[ns]
dropoff datetime64[ns]
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
dtype: object
Here we add an “hour” column.
df["hour"] = df["pickup"].dt.hour
Notice how that new column has shown up on the right side.
df
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 20 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 16 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 17 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 1 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 1 | 0.75 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | East Harlem North | Central Harlem North | Manhattan | Manhattan | 9 |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | 17 |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | 22 |
6431 | 2019-03-04 10:09:25 | 2019-03-04 10:14:29 | 1 | 1.12 | 6.0 | 0.00 | 0.0 | 6.80 | green | credit card | East New York | East Flatbush/Remsen Village | Brooklyn | Brooklyn | 10 |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | 19 |
6341 rows Ă— 15 columns
Remove all rows from the DataFrame where the hour is
16
or earlier. (So we are only using late afternoon and evening taxi rides.)
In class, we originally left off the copy()
, and we got a warning below when we tried to add a new column. So that’s why were are using copy
here.
df2 = df[df["hour"] > 16].copy()
Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.
Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.
Hint 2. I expected there to be a minutes
attribute (after using the dt
accessor) but there wasn’t. Call dir
to see some options.
df2["duration"] = df2["dropoff"] - df2["pickup"]
I expected the following to work, but it didn’t.
df2["duration"].dt.minutes
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[25], line 1
----> 1 df2["duration"].dt.minutes
AttributeError: 'TimedeltaProperties' object has no attribute 'minutes'
The error message is telling us that df2["duration"].dt
does not have an attribute minutes
. We can see all the attributes available to us using the built-in Python function dir
. You should in general ignore the attributes and methods surrounded by two underscores (these are usually called “dunder methods”, for “double underscore”).
dir(df2["duration"].dt)
['__annotations__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__frozen',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_accessors',
'_add_delegate_accessors',
'_constructor',
'_delegate_method',
'_delegate_property_get',
'_delegate_property_set',
'_dir_additions',
'_dir_deletions',
'_freeze',
'_get_values',
'_hidden_attrs',
'_parent',
'_reset_cache',
'ceil',
'components',
'days',
'floor',
'freq',
'microseconds',
'nanoseconds',
'round',
'seconds',
'to_pytimedelta',
'total_seconds']
There actually aren’t too many options. (If you compare this to a NumPy array, for example, you will see many more options for the NumPy array.) It seems like seconds
is probably the most relevant.
In the following output, we can see the current value of the “duration” column on the right side.
df2.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | hour | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 20 | 0 days 00:06:15 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 17 | 0 days 00:07:24 |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.65 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan | 21 | 0 days 00:09:58 |
11 | 2019-03-20 19:39:42 | 2019-03-20 19:45:36 | 1 | 1.53 | 6.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Upper West Side South | Manhattan Valley | Manhattan | Manhattan | 19 | 0 days 00:05:54 |
12 | 2019-03-18 21:27:14 | 2019-03-18 21:34:16 | 1 | 1.05 | 6.5 | 1.00 | 0.0 | 11.30 | yellow | credit card | Murray Hill | Midtown Center | Manhattan | Manhattan | 21 | 0 days 00:07:02 |
Let’s start off repeating our “duration” code from above (so that if we make a mistake, we don’t have to go searching for that code, we can just re-execute this cell once). We are then going to get the number of seconds, and divide by 60 to get the number of minutes (no advantage to rounding here, we’re okay with decimals).
df2["duration"] = df2["dropoff"] - df2["pickup"]
df2["duration"] = df2["duration"].dt.seconds/60
Notice how the right-hand side now shows the duration of the taxi ride, in minutes.
df2
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | hour | duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 20 | 6.250000 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 17 | 7.400000 |
6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 1 | 3.65 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Battery Park City | Two Bridges/Seward Park | Manhattan | Manhattan | 21 | 9.966667 |
11 | 2019-03-20 19:39:42 | 2019-03-20 19:45:36 | 1 | 1.53 | 6.5 | 2.16 | 0.0 | 12.96 | yellow | credit card | Upper West Side South | Manhattan Valley | Manhattan | Manhattan | 19 | 5.900000 |
12 | 2019-03-18 21:27:14 | 2019-03-18 21:34:16 | 1 | 1.05 | 6.5 | 1.00 | 0.0 | 11.30 | yellow | credit card | Murray Hill | Midtown Center | Manhattan | Manhattan | 21 | 7.033333 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6424 | 2019-03-30 20:52:15 | 2019-03-30 20:59:55 | 1 | 1.70 | 8.0 | 0.00 | 0.0 | 9.30 | green | cash | Central Harlem | Central Harlem North | Manhattan | Manhattan | 20 | 7.666667 |
6427 | 2019-03-23 18:26:09 | 2019-03-23 18:49:12 | 1 | 7.07 | 20.0 | 0.00 | 0.0 | 20.00 | green | cash | Parkchester | East Harlem South | Bronx | Manhattan | 18 | 23.050000 |
6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 1 | 18.74 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Jamaica | East Concourse/Concourse Village | Queens | Bronx | 17 | 56.383333 |
6430 | 2019-03-23 22:55:18 | 2019-03-23 23:14:25 | 1 | 4.14 | 16.0 | 0.00 | 0.0 | 17.30 | green | cash | Crown Heights North | Bushwick North | Brooklyn | Brooklyn | 22 | 19.116667 |
6432 | 2019-03-13 19:31:22 | 2019-03-13 19:48:02 | 1 | 3.85 | 15.0 | 3.36 | 0.0 | 20.16 | green | credit card | Boerum Hill | Windsor Terrace | Brooklyn | Brooklyn | 19 | 16.666667 |
2524 rows Ă— 16 columns
We were going pretty fast through this last section, but we did get through most of it.
Fit a new
LinearRegression
object, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.
Here we instantiate a new LinearRegression
object. (It would also be okay to overwrite the old one, just by calling fit
on it with the new data.)
reg2 = LinearRegression()
Now we are using three input variables, rather than just one. We also have a new target variable: “duration”. Our overall question is, how can we predict “duration” using “distance”, “hour”, and “passengers”.
reg2.fit(df2[["distance", "hour", "passengers"]], df2["duration"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Here are the resulting coefficients.
reg2.coef_
array([ 2.40296826, -0.56721926, -0.01271133])
Pay particular attention to their signs. It certainly makes intuitive sense that a greater distance corresponds to a longer taxi ride. This corresponds to the first coefficient, 2.4
, being positive. These coefficients are like partial derivatives. This one is saying that as we go one more mile (I think the unit is miles), the length of the ride goes up by 2.4 minutes.
The reason we restricted to times from 5pm and beyond is that I wanted us to be getting further and further from rush hour as the hour gets larger. So you would expect the taxi rides to take less time as the hour increases. That’s why we have a negative number in the “hour” coefficient.
Lastly, we have a negative coefficient for passengers. But notice also that it is a quite small number, which makes intuitive sense. The number of passengers does not play much role with respect to the duration of the taxi ride.