Linear regression worksheet

Linear regression worksheet

YuJa recording

Here are some linear regression questions, since there weren’t any linear regression questions on the sample midterm.

import seaborn as sns
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression
taxis = sns.load_dataset('taxis')
taxis.dropna(inplace=True)
taxis.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Question 1: Fit a linear regression model to the taxis data using “distance”,”pickup hour”,”tip” as input (predictor) variables, and using “total” as the output (target) variable. (Note: “pickup hour” is not in the original DataFrame, so you will have to create it yourself.)

The first task is to create the “pickup hour” column. We get an error if we try to use .dt.hour directly.

taxis["pickup"].dt.hour
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [3], in <module>
----> 1 taxis["pickup"].dt.hour

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/generic.py:5583, in NDFrame.__getattr__(self, name)
   5576 if (
   5577     name not in self._internal_names_set
   5578     and name not in self._metadata
   5579     and name not in self._accessors
   5580     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5581 ):
   5582     return self[name]
-> 5583 return object.__getattribute__(self, name)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/indexes/accessors.py:509, in CombinedDatetimelikeProperties.__new__(cls, data)
    506 elif is_period_dtype(data.dtype):
    507     return PeriodProperties(data, orig)
--> 509 raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

The problem is these entries are strings.

taxis["pickup"]
0       2019-03-23 20:21:09
1       2019-03-04 16:11:55
2       2019-03-27 17:53:01
3       2019-03-10 01:23:59
4       2019-03-30 13:27:42
               ...         
6428    2019-03-31 09:51:53
6429    2019-03-31 17:38:00
6430    2019-03-23 22:55:18
6431    2019-03-04 10:09:25
6432    2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: object
type(taxis.iloc[0]["pickup"])
str
taxis["pickup"].map(type)
0       <class 'str'>
1       <class 'str'>
2       <class 'str'>
3       <class 'str'>
4       <class 'str'>
            ...      
6428    <class 'str'>
6429    <class 'str'>
6430    <class 'str'>
6431    <class 'str'>
6432    <class 'str'>
Name: pickup, Length: 6341, dtype: object

We convert the entries using pd.to_datetime.

pd.to_datetime(taxis["pickup"])
0      2019-03-23 20:21:09
1      2019-03-04 16:11:55
2      2019-03-27 17:53:01
3      2019-03-10 01:23:59
4      2019-03-30 13:27:42
               ...        
6428   2019-03-31 09:51:53
6429   2019-03-31 17:38:00
6430   2019-03-23 22:55:18
6431   2019-03-04 10:09:25
6432   2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: datetime64[ns]

Now we can use .dt.hour.

pd.to_datetime(taxis["pickup"]).dt.hour
0       20
1       16
2       17
3        1
4       13
        ..
6428     9
6429    17
6430    22
6431    10
6432    19
Name: pickup, Length: 6341, dtype: int64

We put this new column into the DataFrame with the column name “pickup hour”.

taxis["pickup hour"] = pd.to_datetime(taxis["pickup"]).dt.hour

The column now appears on the right side of the DataFrame.

taxis.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough pickup hour
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan 20
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan 16
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan 17
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan 1
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan 13

That was preliminary work (an example of “feature engineering”). Now we start on the linear regression portion.

We first create (instantiate) the linear regression object.

reg = LinearRegression()

Now we fit this linear regression object to the data. Make sure to use double square brackets (the inner square brackets are making a list).

reg.fit(taxis[["distance", "pickup hour", "tip"]], taxis["total"])
LinearRegression()

Question 2: (a) How much does your model predict is the rate per mile?

(b) Does your model predict that taxi rides get more or less expensive later in the day?

(c) To me, the “tips” coefficient seems incorrect. Do you see why I think that? Do you agree or do you see an explanation of why it makes sense?

reg.coef_
array([2.79303224, 0.04437202, 1.46653112])
reg.intercept_
6.427803918508337

(a) This data suggests a rate of $2.79 per mile.

(b) The 0.04 coefficient suggests there is a slight increase in price as the taxi ride occurs later in the day (from midnight counting as earliest, 0, to 11pm counting as latest, 23). Because the 0.04 is positive, that is why we say increase instead of decrease.

(c) Since the true formula for the total cost of the taxi ride includes the tip exactly once (as opposed to 1.466*tip), it seems a little surprising that the coefficient corresponding to tip is quite far from 1.

Question 3: Let m be your calculated “distance” coefficient and let b be your calculated intercept. Use the plotting function defined below to see how accurate the line seems.

def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    c1 = alt.Chart(taxis).mark_circle().encode(
        x = "distance",
        y = "total"
    )

    xmax = 40
    df_line = pd.DataFrame({"distance":[0,xmax],"total":[b,xmax*m]})
    c2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "distance",
        y = "total"
    )
    return c1+c2

This result looks okay, but it does not match the data very closely. It looks significantly low. (That makes sense, because this estimate of total as approximately 2.8*distance + 6.4 is missing the tip data (and the hour data, but that is less impactful).

draw_line(2.8, 6.4)

Question 4: Fit a new linear regression model using only “distance” as your input and “total” as your output (no other variables). Does the resulting plot seem more or less accurate?

reg2 = LinearRegression()
reg2.fit(taxis[["distance"]], taxis["total"])
LinearRegression()
reg2.coef_
array([3.23508597])
reg2.intercept_
8.612423554391206

This line looks significantly more accurate (as it should, since it is the “best” line, in the sense that it minimizes the Mean Squared Error).

draw_line(3.23, 8.6)