Linear regression worksheet
Linear regression worksheet¶
Here are some linear regression questions, since there weren’t any linear regression questions on the sample midterm.
import seaborn as sns
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression
taxis = sns.load_dataset('taxis')
taxis.dropna(inplace=True)
taxis.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
Question 1: Fit a linear regression model to the taxis data using “distance”,”pickup hour”,”tip” as input (predictor) variables, and using “total” as the output (target) variable. (Note: “pickup hour” is not in the original DataFrame, so you will have to create it yourself.)
The first task is to create the “pickup hour” column. We get an error if we try to use .dt.hour
directly.
taxis["pickup"].dt.hour
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [3], in <module>
----> 1 taxis["pickup"].dt.hour
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/generic.py:5583, in NDFrame.__getattr__(self, name)
5576 if (
5577 name not in self._internal_names_set
5578 and name not in self._metadata
5579 and name not in self._accessors
5580 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5581 ):
5582 return self[name]
-> 5583 return object.__getattribute__(self, name)
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
179 if obj is None:
180 # we're accessing the attribute of the class, i.e., Dataset.geo
181 return self._accessor
--> 182 accessor_obj = self._accessor(obj)
183 # Replace the property with the accessor object. Inspired by:
184 # https://www.pydanny.com/cached-property.html
185 # We need to use object.__setattr__ because we overwrite __setattr__ on
186 # NDFrame
187 object.__setattr__(obj, self._name, accessor_obj)
File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/indexes/accessors.py:509, in CombinedDatetimelikeProperties.__new__(cls, data)
506 elif is_period_dtype(data.dtype):
507 return PeriodProperties(data, orig)
--> 509 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
The problem is these entries are strings.
taxis["pickup"]
0 2019-03-23 20:21:09
1 2019-03-04 16:11:55
2 2019-03-27 17:53:01
3 2019-03-10 01:23:59
4 2019-03-30 13:27:42
...
6428 2019-03-31 09:51:53
6429 2019-03-31 17:38:00
6430 2019-03-23 22:55:18
6431 2019-03-04 10:09:25
6432 2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: object
type(taxis.iloc[0]["pickup"])
str
taxis["pickup"].map(type)
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>
...
6428 <class 'str'>
6429 <class 'str'>
6430 <class 'str'>
6431 <class 'str'>
6432 <class 'str'>
Name: pickup, Length: 6341, dtype: object
We convert the entries using pd.to_datetime
.
pd.to_datetime(taxis["pickup"])
0 2019-03-23 20:21:09
1 2019-03-04 16:11:55
2 2019-03-27 17:53:01
3 2019-03-10 01:23:59
4 2019-03-30 13:27:42
...
6428 2019-03-31 09:51:53
6429 2019-03-31 17:38:00
6430 2019-03-23 22:55:18
6431 2019-03-04 10:09:25
6432 2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: datetime64[ns]
Now we can use .dt.hour
.
pd.to_datetime(taxis["pickup"]).dt.hour
0 20
1 16
2 17
3 1
4 13
..
6428 9
6429 17
6430 22
6431 10
6432 19
Name: pickup, Length: 6341, dtype: int64
We put this new column into the DataFrame with the column name “pickup hour”.
taxis["pickup hour"] = pd.to_datetime(taxis["pickup"]).dt.hour
The column now appears on the right side of the DataFrame.
taxis.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | pickup hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan | 20 |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan | 16 |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan | 17 |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan | 1 |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan | 13 |
That was preliminary work (an example of “feature engineering”). Now we start on the linear regression portion.
We first create (instantiate) the linear regression object.
reg = LinearRegression()
Now we fit this linear regression object to the data. Make sure to use double square brackets (the inner square brackets are making a list).
reg.fit(taxis[["distance", "pickup hour", "tip"]], taxis["total"])
LinearRegression()
Question 2: (a) How much does your model predict is the rate per mile?
(b) Does your model predict that taxi rides get more or less expensive later in the day?
(c) To me, the “tips” coefficient seems incorrect. Do you see why I think that? Do you agree or do you see an explanation of why it makes sense?
reg.coef_
array([2.79303224, 0.04437202, 1.46653112])
reg.intercept_
6.427803918508337
(a) This data suggests a rate of $2.79 per mile.
(b) The 0.04 coefficient suggests there is a slight increase in price as the taxi ride occurs later in the day (from midnight counting as earliest, 0, to 11pm counting as latest, 23). Because the 0.04 is positive, that is why we say increase instead of decrease.
(c) Since the true formula for the total cost of the taxi ride includes the tip exactly once (as opposed to 1.466*tip), it seems a little surprising that the coefficient corresponding to tip is quite far from 1.
Question 3: Let m
be your calculated “distance” coefficient and let b
be your calculated intercept. Use the plotting function defined below to see how accurate the line seems.
def draw_line(m,b):
alt.data_transformers.disable_max_rows()
c1 = alt.Chart(taxis).mark_circle().encode(
x = "distance",
y = "total"
)
xmax = 40
df_line = pd.DataFrame({"distance":[0,xmax],"total":[b,xmax*m]})
c2 = alt.Chart(df_line).mark_line(color="red").encode(
x = "distance",
y = "total"
)
return c1+c2
This result looks okay, but it does not match the data very closely. It looks significantly low. (That makes sense, because this estimate of total as approximately 2.8*distance + 6.4 is missing the tip data (and the hour data, but that is less impactful).
draw_line(2.8, 6.4)
Question 4: Fit a new linear regression model using only “distance” as your input and “total” as your output (no other variables). Does the resulting plot seem more or less accurate?
reg2 = LinearRegression()
reg2.fit(taxis[["distance"]], taxis["total"])
LinearRegression()
reg2.coef_
array([3.23508597])
reg2.intercept_
8.612423554391206
This line looks significantly more accurate (as it should, since it is the “best” line, in the sense that it minimizes the Mean Squared Error).
draw_line(3.23, 8.6)