Polynomial Regression 2

I think the most important concept in machine learning is the concept of overfitting. The idea is if you have too flexible of a model (relative to the number of data points), then your model can match the data very closely, but it will do a poor job of predicting future values.

When performing polynomial regression, the higher the degree of the polynomial, the more flexible the model is. So when the degree of the polynomial is higher, there is a greater risk of overfitting.

The taxis dataset

The taxis dataset contains information about approximately 6000 taxi rides in New York City. Our goal is to try to model the total cost of the taxi ride using the distance.

import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=10000)
import seaborn as sns
df = sns.load_dataset("taxis").dropna()
df.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24 1 1.60 7.0 2.15 0.0 12.95 yellow credit card Lenox Hill West UN/Turtle Bay South Manhattan Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00 1 0.79 5.0 0.00 0.0 9.30 yellow cash Upper West Side South Upper West Side South Manhattan Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25 1 1.37 7.5 2.36 0.0 14.16 yellow credit card Alphabet City West Village Manhattan Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51 1 7.70 27.0 6.15 0.0 36.95 yellow credit card Hudson Sq Yorkville West Manhattan Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14 3 2.16 9.0 1.10 0.0 13.40 yellow credit card Midtown East Yorkville West Manhattan Manhattan

The following plot shows that the data follows a linear model very closely. We are going to instead try using polynomial regression with a high degree (degree 9 in this case).

alt.Chart(df).mark_circle().encode(
    x="distance",
    y="total"
)

If we fit a degree 9 polynomial to all 6000+ data points, it will look very good. Instead we will only use 40 of the data points. In general, the fewer the data points you use, the more risk there is for overfitting.

from sklearn.model_selection import train_test_split

The following is not the usual way to call train_test_split. Below we will use the usual way, which involves list unpacking.

Here train_size=40 says we want to choose 40 random rows from the DataFrame df. If we had instead used train_size=0.4, that would say we want to choose 40% of the rows.

# use 40 data points
a = train_test_split(df, train_size=40)
type(a)
list
len(a)
2
type(a[0])
pandas.core.frame.DataFrame
a[0].shape
(40, 14)
a[1].shape
(6301, 14)

Currently a[0] contains 40 rows from the DataFrame, and a[1] contains the remaining 6301 rows.

df.shape
(6341, 14)

The rows are chosen at random. (They aren’t even presented in order.)

a[0].head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
1855 2019-03-29 14:20:18 2019-03-29 14:36:35 6 2.72 12.5 3.16 0.0 18.96 yellow credit card Hudson Sq Murray Hill Manhattan Manhattan
5029 2019-03-21 06:51:40 2019-03-21 07:07:11 1 3.39 14.0 2.75 0.0 20.05 yellow credit card Central Park Times Sq/Theatre District Manhattan Manhattan
5277 2019-03-19 04:54:04 2019-03-19 05:00:05 2 1.65 7.5 2.00 0.0 13.30 yellow credit card East Chelsea Clinton East Manhattan Manhattan
247 2019-03-07 18:49:41 2019-03-07 19:07:05 1 2.75 13.0 3.46 0.0 20.76 yellow credit card Yorkville West Lincoln Square East Manhattan Manhattan
912 2019-03-15 23:13:44 2019-03-15 23:24:47 1 1.70 9.5 2.65 0.0 15.95 yellow credit card Lincoln Square West East Chelsea Manhattan Manhattan

The usual way to call train_test_split is to use list unpacking, which is what we do in the following code.

# list unpacking
df_train, df_test = train_test_split(df, train_size=40)

Now df_train will contain a different set of 40 rows from df.

df_train.head()
pickup dropoff passengers distance fare tip tolls total color payment pickup_zone dropoff_zone pickup_borough dropoff_borough
1091 2019-03-05 20:17:31 2019-03-05 20:35:21 1 4.20 16.0 0.0 0.0 19.8 yellow cash Penn Station/Madison Sq West Battery Park City Manhattan Manhattan
5459 2019-03-29 16:25:22 2019-03-29 17:15:57 1 12.76 39.5 0.0 0.0 41.3 green credit card Brighton Beach Richmond Hill Brooklyn Queens
4164 2019-03-20 09:11:06 2019-03-20 09:21:44 2 0.83 8.0 0.0 0.0 11.3 yellow cash Midtown South Midtown East Manhattan Manhattan
819 2019-03-17 19:30:19 2019-03-17 19:41:39 1 2.10 9.5 0.0 0.0 12.8 yellow cash Lincoln Square East Upper West Side North Manhattan Manhattan
6136 2019-03-05 19:01:21 2019-03-05 19:05:56 1 0.87 5.0 0.0 0.0 6.8 green cash Elmhurst Elmhurst Queens Queens
df_train.shape
(40, 14)

Fitting polynomial regression

cols = []
for deg in range(1,10):
    cols.append(f"d{deg}")
    df_train[f"d{deg}"] = df_train["distance"]**deg

Here is a more DRY approach, where we only have to type f"d{deg}" once.

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg
cols
['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']

Notice how 9 rows have been added to df_train. They contain the 1st, 2nd, …, 9th powers of the values in the “distance” column.

df_train.head()
pickup dropoff passengers distance fare tip tolls total color payment ... dropoff_borough d1 d2 d3 d4 d5 d6 d7 d8 d9
1091 2019-03-05 20:17:31 2019-03-05 20:35:21 1 4.20 16.0 0.0 0.0 19.8 yellow cash ... Manhattan 4.20 17.6400 74.088000 311.169600 1306.912320 5.489032e+03 2.305393e+04 9.682652e+04 4.066714e+05
5459 2019-03-29 16:25:22 2019-03-29 17:15:57 1 12.76 39.5 0.0 0.0 41.3 green credit card ... Queens 12.76 162.8176 2077.552576 26509.570870 338262.124298 4.316225e+06 5.507503e+07 7.027573e+08 8.967184e+09
4164 2019-03-20 09:11:06 2019-03-20 09:21:44 2 0.83 8.0 0.0 0.0 11.3 yellow cash ... Manhattan 0.83 0.6889 0.571787 0.474583 0.393904 3.269404e-01 2.713605e-01 2.252292e-01 1.869403e-01
819 2019-03-17 19:30:19 2019-03-17 19:41:39 1 2.10 9.5 0.0 0.0 12.8 yellow cash ... Manhattan 2.10 4.4100 9.261000 19.448100 40.841010 8.576612e+01 1.801089e+02 3.782286e+02 7.942800e+02
6136 2019-03-05 19:01:21 2019-03-05 19:05:56 1 0.87 5.0 0.0 0.0 6.8 green cash ... Queens 0.87 0.7569 0.658503 0.572898 0.498421 4.336262e-01 3.772548e-01 3.282117e-01 2.855442e-01

5 rows × 23 columns

Remember, as we saw last time, polynomial regression can be viewed as a special case of linear regression. (The reverse is also true and more obvious.)

from sklearn.linear_model import LinearRegression
reg = LinearRegression()

It is important to only fit on df_train, because we want to only use 40 data points, so that we can demonstrate overfitting more effectively.

reg.fit(df_train[cols], df_train["total"])
LinearRegression()
df_train["Pred"] = reg.predict(df_train[cols])

The red line in the following shows the polynomial. It doesn’t look like a polynomial, because Altair (like Matlab and like Matplotlib) connects the points with straight lines. Below we will make a DataFrame with more input values, so that the polynomial will look smoother.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_train).mark_line(color="red").encode(
    x="distance",
    y="Pred"
)

c+c9
import numpy as np
df_plot = pd.DataFrame({"distance": np.arange(0,40,0.1)})
df_plot
distance
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
... ...
395 39.5
396 39.6
397 39.7
398 39.8
399 39.9

400 rows × 1 columns

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg
    df_plot[c] = df_plot["distance"]**deg
df_plot
distance d1 d2 d3 d4 d5 d6 d7 d8 d9
0 0.0 0.0 0.00 0.000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
1 0.1 0.1 0.01 0.001 1.000000e-04 1.000000e-05 1.000000e-06 1.000000e-07 1.000000e-08 1.000000e-09
2 0.2 0.2 0.04 0.008 1.600000e-03 3.200000e-04 6.400000e-05 1.280000e-05 2.560000e-06 5.120000e-07
3 0.3 0.3 0.09 0.027 8.100000e-03 2.430000e-03 7.290000e-04 2.187000e-04 6.561000e-05 1.968300e-05
4 0.4 0.4 0.16 0.064 2.560000e-02 1.024000e-02 4.096000e-03 1.638400e-03 6.553600e-04 2.621440e-04
... ... ... ... ... ... ... ... ... ... ...
395 39.5 39.5 1560.25 61629.875 2.434380e+06 9.615801e+07 3.798241e+09 1.500305e+11 5.926206e+12 2.340851e+14
396 39.6 39.6 1568.16 62099.136 2.459126e+06 9.738138e+07 3.856303e+09 1.527096e+11 6.047300e+12 2.394731e+14
397 39.7 39.7 1576.09 62570.773 2.484060e+06 9.861717e+07 3.915102e+09 1.554295e+11 6.170553e+12 2.449709e+14
398 39.8 39.8 1584.04 63044.792 2.509183e+06 9.986547e+07 3.974646e+09 1.581909e+11 6.295998e+12 2.505807e+14
399 39.9 39.9 1592.01 63521.199 2.534496e+06 1.011264e+08 4.034943e+09 1.609942e+11 6.423669e+12 2.563044e+14

400 rows × 10 columns

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg
    df_plot[c] = df_plot["distance"]**deg
cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    for x in [df_train, df_plot]:
        x[c] = x["distance"]**deg
df_plot
distance d1 d2 d3 d4 d5 d6 d7 d8 d9
0 0.0 0.0 0.00 0.000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
1 0.1 0.1 0.01 0.001 1.000000e-04 1.000000e-05 1.000000e-06 1.000000e-07 1.000000e-08 1.000000e-09
2 0.2 0.2 0.04 0.008 1.600000e-03 3.200000e-04 6.400000e-05 1.280000e-05 2.560000e-06 5.120000e-07
3 0.3 0.3 0.09 0.027 8.100000e-03 2.430000e-03 7.290000e-04 2.187000e-04 6.561000e-05 1.968300e-05
4 0.4 0.4 0.16 0.064 2.560000e-02 1.024000e-02 4.096000e-03 1.638400e-03 6.553600e-04 2.621440e-04
... ... ... ... ... ... ... ... ... ... ...
395 39.5 39.5 1560.25 61629.875 2.434380e+06 9.615801e+07 3.798241e+09 1.500305e+11 5.926206e+12 2.340851e+14
396 39.6 39.6 1568.16 62099.136 2.459126e+06 9.738138e+07 3.856303e+09 1.527096e+11 6.047300e+12 2.394731e+14
397 39.7 39.7 1576.09 62570.773 2.484060e+06 9.861717e+07 3.915102e+09 1.554295e+11 6.170553e+12 2.449709e+14
398 39.8 39.8 1584.04 63044.792 2.509183e+06 9.986547e+07 3.974646e+09 1.581909e+11 6.295998e+12 2.505807e+14
399 39.9 39.9 1592.01 63521.199 2.534496e+06 1.011264e+08 4.034943e+09 1.609942e+11 6.423669e+12 2.563044e+14

400 rows × 10 columns

df_plot.columns
Index(['distance', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9'], dtype='object')
df_plot["Pred"] = reg.predict(df_plot[cols])

The following looks pretty bad, because the scale of the fitted polynomial is so much larger, so we can’t see any detail in the curve.

Also notice that it doesn’t make sense. Our model is predicting that, when the taxi ride is 40 miles, the cost should be approximately negative two billion dollars.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_plot).mark_line(color="red").encode(
    x="distance",
    y="Pred"
)

c+c9

We will fix this by doing two things: specifying the domain for the y-axis, and specifying clip=True (that is important) when calling mark_line(). The clip=True says to get rid of the points that are outside the axes ranges.

The following is a clear example of overfitting. Our polynomial has too much flexibility (the 9 coefficients) relative to the number of data points (40 in this case). For example, this flexibility allows the polynomial to nearly exactly pass through many of the data points.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_plot).mark_line(color="red", clip=True).encode(
    x="distance",
    y=alt.Y("Pred", scale=alt.Scale(domain=(0,200)))
)

c+c9