# Linear and Polynomial Regression with the taxis dataset

In [1]:
import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=10000)
import seaborn as sns

In [2]:
df = sns.load_dataset("taxis").dropna()

## Linear regression

* Fit a linear regression model to the data from the taxis dataset, using multiple input variables (also called features, also called predictors), and with "total" as the output variable (the target).

In [3]:
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


What are the rows with the biggest values in the "tolls" column?

In [4]:
df.sort_values("tolls", ascending=False)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
5364,2019-03-17 16:59:17,2019-03-17 18:04:08,2,36.70,150.00,0.00,24.02,174.82,yellow,cash,JFK Airport,JFK Airport,Queens,Queens
2122,2019-03-08 00:40:32,2019-03-08 01:11:53,1,15.51,44.00,16.27,17.28,81.35,yellow,credit card,TriBeCa/Civic Center,West Brighton,Manhattan,Staten Island
3640,2019-03-22 07:54:09,2019-03-22 09:05:13,1,16.42,52.00,0.00,12.50,67.80,yellow,cash,JFK Airport,Murray Hill,Queens,Manhattan
5911,2019-03-09 12:27:51,2019-03-09 13:11:18,1,11.40,39.00,0.00,11.52,51.32,green,credit card,Windsor Terrace,Clinton East,Brooklyn,Manhattan
5728,2019-03-01 17:07:09,2019-03-01 18:05:41,1,21.27,65.59,0.00,11.52,77.61,green,credit card,Cambria Heights,Morningside Heights,Queens,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2203,2019-03-03 00:24:50,2019-03-03 00:56:17,1,4.72,22.00,5.16,0.00,30.96,yellow,credit card,Meatpacking/West Village West,Williamsburg (South Side),Manhattan,Brooklyn
2202,2019-03-14 22:32:33,2019-03-14 22:49:39,1,2.90,13.00,3.36,0.00,20.16,yellow,credit card,East Chelsea,East Village,Manhattan,Manhattan
2201,2019-03-18 21:16:42,2019-03-18 21:27:49,1,3.00,11.50,3.06,0.00,18.36,yellow,credit card,Clinton East,Upper East Side North,Manhattan,Manhattan
2200,2019-03-03 07:21:40,2019-03-03 07:39:12,1,7.20,22.00,7.55,0.00,32.85,yellow,credit card,Midtown Center,World Trade Center,Manhattan,Manhattan


Let's try to use the following columns as the inputs for our linear regression.

In [5]:
cols = ["distance", "tip", "tolls", "pickup_borough"]

In [6]:
from sklearn.linear_model import LinearRegression

In [7]:
reg = LinearRegression()

This doesn't work, because the values in the "pickup_borough" column are strings, not numbers.

In [8]:
reg.fit(df[cols],df["total"])

ValueError: could not convert string to float: 'Manhattan'

Let's make a new column called "Manhattan".  This will contain `1` for the "Manhattan" pickup borough rows, and contain `0` for all the other rows.

In [9]:
df["Manhattan"] = 0

In [10]:
df["pickup_borough"] == "Manhattan"

0        True
1        True
2        True
3        True
4        True
        ...  
6428     True
6429    False
6430    False
6431    False
6432    False
Name: pickup_borough, Length: 6341, dtype: bool

(I think we could also store Boolean values directly in this new "Manhattan" column, but I think it's less confusing to have `0` and `1`.)

In [12]:
# Put a 1 (for True) where the value is Manhattan
df.loc[df["pickup_borough"] == "Manhattan", "Manhattan"] = 1

In [13]:
df

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,Manhattan
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,1
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan,1
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan,1
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan,1
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan,1
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx,0
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn,0
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn,0


We now replace the old "pickup_borough" column with the newly created "Manhattan" column.

In [17]:
cols = ['distance', 'tip', 'tolls', 'Manhattan']

In [18]:
reg.fit(df[cols],df["total"])

LinearRegression()

The goal of the `fit` method is to find the following coefficients, as well as the intercept.

In [20]:
reg.coef_

array([2.6294669 , 1.3306588 , 1.11487961, 1.55477579])

You should interpret the following as saying that the total cost of the taxi ride will be modeled by a formula involving 2.63 times the distance traveled.  This can be interpreted as $2.63 per mile.

In [22]:
pd.Series(reg.coef_, index=cols)

distance     2.629467
tip          1.330659
tolls        1.114880
Manhattan    1.554776
dtype: float64

In [23]:
reg.intercept_

6.170557177757704

In [24]:
df[:2]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,Manhattan
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,1
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan,1


In [25]:
reg.predict(df[:2][cols])

array([14.79339642,  9.80261182])

For example, we can view `14.8` as the predicted output for the 0th row.  The `predict` method isn't doing anything mysterious.  It's just evaluating this linear function on the given inputs.  Here is the by-hand computation for the 0th row.

In [28]:
2.63*1.6+1.33*2.15+1.11*0+1.55*1+6.17

14.787500000000001

## Polynomial regression

Last time, we fit a degree 9 polynomial model to this data, using "distance" as the (only) input variable and using "total" as the output variable.  The code from last time is below.

Using 100 training points, adapt the code from last time to fit models of different degrees, for each degree from 1 to 25.  Plot the resulting polynomials for $0 \leq x \leq M$, where $M$ is the maximum "distance" value within the training data.

A lot of this code was copied from last time, and then adjusted to the current goals.

In [29]:
from sklearn.model_selection import train_test_split

We're using 100 data points, instead of 40 from last time, so we should expect that there will be slightly less overfitting this time, since we are using more data points.

In [31]:
df_train, df_test = train_test_split(df, train_size=100)

In [32]:
df_train.shape

(100, 15)

In [33]:
c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

In [34]:
c

In [35]:
df_train["distance"].max()

26.92

In [36]:
import numpy as np

In [37]:
df_plot = pd.DataFrame({"distance":np.arange(0,df_train["distance"].max()+0.1,0.1)})

In [38]:
df_plot.head()

Unnamed: 0,distance
0,0.0
1,0.1
2,0.2
3,0.3
4,0.4


In [39]:
cols = []
for deg in range(1,25):
    col = f"d{deg}"
    cols.append(col)
    for x in [df_train, df_plot]:
        x[col] = x["distance"]**deg

In [40]:
df_plot.head()

Unnamed: 0,distance,d1,d2,d3,d4,d5,d6,d7,d8,d9,...,d15,d16,d17,d18,d19,d20,d21,d22,d23,d24
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.1,0.1,0.01,0.001,0.0001,1e-05,1e-06,1e-07,1e-08,1e-09,...,1e-15,1e-16,1e-17,1e-18,9.999999999999999e-20,1e-20,1e-21,1e-22,1.0000000000000001e-23,1e-24
2,0.2,0.2,0.04,0.008,0.0016,0.00032,6.4e-05,1.28e-05,2.56e-06,5.12e-07,...,3.2768e-11,6.5536e-12,1.31072e-12,2.62144e-13,5.24288e-14,1.048576e-14,2.097152e-15,4.194304e-16,8.388608000000001e-17,1.6777220000000003e-17
3,0.3,0.3,0.09,0.027,0.0081,0.00243,0.000729,0.0002187,6.561e-05,1.9683e-05,...,1.434891e-08,4.304672e-09,1.291402e-09,3.874205e-10,1.162261e-10,3.486784e-11,1.046035e-11,3.138106e-12,9.414318e-13,2.824295e-13
4,0.4,0.4,0.16,0.064,0.0256,0.01024,0.004096,0.0016384,0.00065536,0.000262144,...,1.073742e-06,4.294967e-07,1.717987e-07,6.871948e-08,2.748779e-08,1.099512e-08,4.398047e-09,1.759219e-09,7.036874e-10,2.81475e-10


In [41]:
cols

['d1',
 'd2',
 'd3',
 'd4',
 'd5',
 'd6',
 'd7',
 'd8',
 'd9',
 'd10',
 'd11',
 'd12',
 'd13',
 'd14',
 'd15',
 'd16',
 'd17',
 'd18',
 'd19',
 'd20',
 'd21',
 'd22',
 'd23',
 'd24']

In [42]:
cols[:4]

['d1', 'd2', 'd3', 'd4']

In [43]:
chart_list = []

for deg in range(1,25):
    subcols = cols[:deg]
    reg = LinearRegression()
    reg.fit(df_train[subcols],df_train["total"])
    df_plot[f"Pred{deg}"] = reg.predict(df_plot[subcols])
    c_temp = alt.Chart(df_plot).mark_line(color="red", clip=True).encode(
        x="distance",
        y=alt.Y(f"Pred{deg}", scale=alt.Scale(domain=(0,200)))
    )
    chart_list.append(c_temp)

In [44]:
both_charts = [c+d for d in chart_list]

The input to `alt.vconcat` needs to be one or more Altair charts, not a list of Altair charts.

In [45]:
alt.vconcat(both_charts)

ValueError: Only chart objects can be used in VConcatChart.

So we use list unpacking.  Notice how the overfitting gets more extreme as the degree of the polynomial gets higher.

In [46]:
alt.vconcat(*both_charts)