Linear regression worksheet¶

Here are some linear regression questions, since there weren’t any linear regression questions on the sample midterm.

import seaborn as sns
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression
taxis = sns.load_dataset('taxis')
taxis.dropna(inplace=True)

taxis.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

Question 1: Fit a linear regression model to the taxis data using “distance”,”pickup hour”,”tip” as input (predictor) variables, and using “total” as the output (target) variable. (Note: “pickup hour” is not in the original DataFrame, so you will have to create it yourself.)

The first task is to create the “pickup hour” column. We get an error if we try to use .dt.hour directly.

taxis["pickup"].dt.hour

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [3], in <module>
----> 1 taxis["pickup"].dt.hour

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/generic.py:5583, in NDFrame.__getattr__(self, name)
   5576 if (
   5577     name not in self._internal_names_set
   5578     and name not in self._metadata
   5579     and name not in self._accessors
   5580     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5581 ):
   5582     return self[name]
-> 5583 return object.__getattribute__(self, name)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/pandas/core/indexes/accessors.py:509, in CombinedDatetimelikeProperties.__new__(cls, data)
    506 elif is_period_dtype(data.dtype):
    507     return PeriodProperties(data, orig)
--> 509 raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

The problem is these entries are strings.

taxis["pickup"]

     2019-03-23 20:21:09
     2019-03-04 16:11:55
     2019-03-27 17:53:01
     2019-03-10 01:23:59
     2019-03-30 13:27:42
               ...         
  2019-03-31 09:51:53
  2019-03-31 17:38:00
  2019-03-23 22:55:18
  2019-03-04 10:09:25
  2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: object

type(taxis.iloc[0]["pickup"])

str

taxis["pickup"].map(type)

     <class 'str'>
     <class 'str'>
     <class 'str'>
     <class 'str'>
     <class 'str'>
            ...      
  <class 'str'>
  <class 'str'>
  <class 'str'>
  <class 'str'>
  <class 'str'>
Name: pickup, Length: 6341, dtype: object

We convert the entries using pd.to_datetime.

pd.to_datetime(taxis["pickup"])

    2019-03-23 20:21:09
    2019-03-04 16:11:55
    2019-03-27 17:53:01
    2019-03-10 01:23:59
    2019-03-30 13:27:42
               ...        
 2019-03-31 09:51:53
 2019-03-31 17:38:00
 2019-03-23 22:55:18
 2019-03-04 10:09:25
 2019-03-13 19:31:22
Name: pickup, Length: 6341, dtype: datetime64[ns]

Now we can use .dt.hour.

pd.to_datetime(taxis["pickup"]).dt.hour

     20
     16
     17
      1
     13
        ..
   9
  17
  22
  10
  19
Name: pickup, Length: 6341, dtype: int64

We put this new column into the DataFrame with the column name “pickup hour”.

taxis["pickup hour"] = pd.to_datetime(taxis["pickup"]).dt.hour

The column now appears on the right side of the DataFrame.

taxis.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough	pickup hour
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan	20
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan	16
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan	17
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan	1
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan	13

That was preliminary work (an example of “feature engineering”). Now we start on the linear regression portion.

We first create (instantiate) the linear regression object.

reg = LinearRegression()

Now we fit this linear regression object to the data. Make sure to use double square brackets (the inner square brackets are making a list).

reg.fit(taxis[["distance", "pickup hour", "tip"]], taxis["total"])

LinearRegression()

Question 2: (a) How much does your model predict is the rate per mile?

(b) Does your model predict that taxi rides get more or less expensive later in the day?

(c) To me, the “tips” coefficient seems incorrect. Do you see why I think that? Do you agree or do you see an explanation of why it makes sense?

reg.coef_

array([2.79303224, 0.04437202, 1.46653112])

reg.intercept_

6.427803918508337

(a) This data suggests a rate of $2.79 per mile.

(b) The 0.04 coefficient suggests there is a slight increase in price as the taxi ride occurs later in the day (from midnight counting as earliest, 0, to 11pm counting as latest, 23). Because the 0.04 is positive, that is why we say increase instead of decrease.

(c) Since the true formula for the total cost of the taxi ride includes the tip exactly once (as opposed to 1.466*tip), it seems a little surprising that the coefficient corresponding to tip is quite far from 1.

Question 3: Let m be your calculated “distance” coefficient and let b be your calculated intercept. Use the plotting function defined below to see how accurate the line seems.

def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    c1 = alt.Chart(taxis).mark_circle().encode(
        x = "distance",
        y = "total"
    )

    xmax = 40
    df_line = pd.DataFrame({"distance":[0,xmax],"total":[b,xmax*m]})
    c2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "distance",
        y = "total"
    )
    return c1+c2

This result looks okay, but it does not match the data very closely. It looks significantly low. (That makes sense, because this estimate of total as approximately 2.8*distance + 6.4 is missing the tip data (and the hour data, but that is less impactful).

draw_line(2.8, 6.4)

Question 4: Fit a new linear regression model using only “distance” as your input and “total” as your output (no other variables). Does the resulting plot seem more or less accurate?

reg2 = LinearRegression()
reg2.fit(taxis[["distance"]], taxis["total"])

LinearRegression()

reg2.coef_

array([3.23508597])

reg2.intercept_

8.612423554391206

This line looks significantly more accurate (as it should, since it is the “best” line, in the sense that it minimizes the Mean Squared Error).

draw_line(3.23, 8.6)

UC Irvine Math 10 W22

Linear regression worksheet

Linear regression worksheet¶