Polynomial Regression 2
Contents
Polynomial Regression 2¶
I think the most important concept in machine learning is the concept of overfitting. The idea is if you have too flexible of a model (relative to the number of data points), then your model can match the data very closely, but it will do a poor job of predicting future values.
When performing polynomial regression, the higher the degree of the polynomial, the more flexible the model is. So when the degree of the polynomial is higher, there is a greater risk of overfitting.
The taxis dataset¶
The taxis dataset contains information about approximately 6000 taxi rides in New York City. Our goal is to try to model the total cost of the taxi ride using the distance.
import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=10000)
import seaborn as sns
df = sns.load_dataset("taxis").dropna()
df.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 1 | 1.60 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Lenox Hill West | UN/Turtle Bay South | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 1 | 0.79 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Upper West Side South | Upper West Side South | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 1 | 1.37 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Alphabet City | West Village | Manhattan | Manhattan |
3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 1 | 7.70 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Hudson Sq | Yorkville West | Manhattan | Manhattan |
4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 3 | 2.16 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Midtown East | Yorkville West | Manhattan | Manhattan |
The following plot shows that the data follows a linear model very closely. We are going to instead try using polynomial regression with a high degree (degree 9 in this case).
alt.Chart(df).mark_circle().encode(
x="distance",
y="total"
)
If we fit a degree 9 polynomial to all 6000+ data points, it will look very good. Instead we will only use 40 of the data points. In general, the fewer the data points you use, the more risk there is for overfitting.
from sklearn.model_selection import train_test_split
The following is not the usual way to call train_test_split
. Below we will use the usual way, which involves list unpacking.
Here train_size=40
says we want to choose 40 random rows from the DataFrame df
. If we had instead used train_size=0.4
, that would say we want to choose 40% of the rows.
# use 40 data points
a = train_test_split(df, train_size=40)
type(a)
list
len(a)
2
type(a[0])
pandas.core.frame.DataFrame
a[0].shape
(40, 14)
a[1].shape
(6301, 14)
Currently a[0]
contains 40 rows from the DataFrame, and a[1]
contains the remaining 6301 rows.
df.shape
(6341, 14)
The rows are chosen at random. (They aren’t even presented in order.)
a[0].head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1855 | 2019-03-29 14:20:18 | 2019-03-29 14:36:35 | 6 | 2.72 | 12.5 | 3.16 | 0.0 | 18.96 | yellow | credit card | Hudson Sq | Murray Hill | Manhattan | Manhattan |
5029 | 2019-03-21 06:51:40 | 2019-03-21 07:07:11 | 1 | 3.39 | 14.0 | 2.75 | 0.0 | 20.05 | yellow | credit card | Central Park | Times Sq/Theatre District | Manhattan | Manhattan |
5277 | 2019-03-19 04:54:04 | 2019-03-19 05:00:05 | 2 | 1.65 | 7.5 | 2.00 | 0.0 | 13.30 | yellow | credit card | East Chelsea | Clinton East | Manhattan | Manhattan |
247 | 2019-03-07 18:49:41 | 2019-03-07 19:07:05 | 1 | 2.75 | 13.0 | 3.46 | 0.0 | 20.76 | yellow | credit card | Yorkville West | Lincoln Square East | Manhattan | Manhattan |
912 | 2019-03-15 23:13:44 | 2019-03-15 23:24:47 | 1 | 1.70 | 9.5 | 2.65 | 0.0 | 15.95 | yellow | credit card | Lincoln Square West | East Chelsea | Manhattan | Manhattan |
The usual way to call train_test_split
is to use list unpacking, which is what we do in the following code.
# list unpacking
df_train, df_test = train_test_split(df, train_size=40)
Now df_train
will contain a different set of 40 rows from df
.
df_train.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1091 | 2019-03-05 20:17:31 | 2019-03-05 20:35:21 | 1 | 4.20 | 16.0 | 0.0 | 0.0 | 19.8 | yellow | cash | Penn Station/Madison Sq West | Battery Park City | Manhattan | Manhattan |
5459 | 2019-03-29 16:25:22 | 2019-03-29 17:15:57 | 1 | 12.76 | 39.5 | 0.0 | 0.0 | 41.3 | green | credit card | Brighton Beach | Richmond Hill | Brooklyn | Queens |
4164 | 2019-03-20 09:11:06 | 2019-03-20 09:21:44 | 2 | 0.83 | 8.0 | 0.0 | 0.0 | 11.3 | yellow | cash | Midtown South | Midtown East | Manhattan | Manhattan |
819 | 2019-03-17 19:30:19 | 2019-03-17 19:41:39 | 1 | 2.10 | 9.5 | 0.0 | 0.0 | 12.8 | yellow | cash | Lincoln Square East | Upper West Side North | Manhattan | Manhattan |
6136 | 2019-03-05 19:01:21 | 2019-03-05 19:05:56 | 1 | 0.87 | 5.0 | 0.0 | 0.0 | 6.8 | green | cash | Elmhurst | Elmhurst | Queens | Queens |
df_train.shape
(40, 14)
Fitting polynomial regression¶
cols = []
for deg in range(1,10):
cols.append(f"d{deg}")
df_train[f"d{deg}"] = df_train["distance"]**deg
Here is a more DRY approach, where we only have to type f"d{deg}"
once.
cols = []
for deg in range(1,10):
c = f"d{deg}"
cols.append(c)
df_train[c] = df_train["distance"]**deg
cols
['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']
Notice how 9 rows have been added to df_train
. They contain the 1st, 2nd, …, 9th powers of the values in the “distance” column.
df_train.head()
pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | ... | dropoff_borough | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1091 | 2019-03-05 20:17:31 | 2019-03-05 20:35:21 | 1 | 4.20 | 16.0 | 0.0 | 0.0 | 19.8 | yellow | cash | ... | Manhattan | 4.20 | 17.6400 | 74.088000 | 311.169600 | 1306.912320 | 5.489032e+03 | 2.305393e+04 | 9.682652e+04 | 4.066714e+05 |
5459 | 2019-03-29 16:25:22 | 2019-03-29 17:15:57 | 1 | 12.76 | 39.5 | 0.0 | 0.0 | 41.3 | green | credit card | ... | Queens | 12.76 | 162.8176 | 2077.552576 | 26509.570870 | 338262.124298 | 4.316225e+06 | 5.507503e+07 | 7.027573e+08 | 8.967184e+09 |
4164 | 2019-03-20 09:11:06 | 2019-03-20 09:21:44 | 2 | 0.83 | 8.0 | 0.0 | 0.0 | 11.3 | yellow | cash | ... | Manhattan | 0.83 | 0.6889 | 0.571787 | 0.474583 | 0.393904 | 3.269404e-01 | 2.713605e-01 | 2.252292e-01 | 1.869403e-01 |
819 | 2019-03-17 19:30:19 | 2019-03-17 19:41:39 | 1 | 2.10 | 9.5 | 0.0 | 0.0 | 12.8 | yellow | cash | ... | Manhattan | 2.10 | 4.4100 | 9.261000 | 19.448100 | 40.841010 | 8.576612e+01 | 1.801089e+02 | 3.782286e+02 | 7.942800e+02 |
6136 | 2019-03-05 19:01:21 | 2019-03-05 19:05:56 | 1 | 0.87 | 5.0 | 0.0 | 0.0 | 6.8 | green | cash | ... | Queens | 0.87 | 0.7569 | 0.658503 | 0.572898 | 0.498421 | 4.336262e-01 | 3.772548e-01 | 3.282117e-01 | 2.855442e-01 |
5 rows × 23 columns
Remember, as we saw last time, polynomial regression can be viewed as a special case of linear regression. (The reverse is also true and more obvious.)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
It is important to only fit on df_train
, because we want to only use 40 data points, so that we can demonstrate overfitting more effectively.
reg.fit(df_train[cols], df_train["total"])
LinearRegression()
df_train["Pred"] = reg.predict(df_train[cols])
The red line in the following shows the polynomial. It doesn’t look like a polynomial, because Altair (like Matlab and like Matplotlib) connects the points with straight lines. Below we will make a DataFrame with more input values, so that the polynomial will look smoother.
c = alt.Chart(df_train).mark_circle().encode(
x="distance",
y="total"
)
c9 = alt.Chart(df_train).mark_line(color="red").encode(
x="distance",
y="Pred"
)
c+c9
import numpy as np
df_plot = pd.DataFrame({"distance": np.arange(0,40,0.1)})
df_plot
distance | |
---|---|
0 | 0.0 |
1 | 0.1 |
2 | 0.2 |
3 | 0.3 |
4 | 0.4 |
... | ... |
395 | 39.5 |
396 | 39.6 |
397 | 39.7 |
398 | 39.8 |
399 | 39.9 |
400 rows × 1 columns
cols = []
for deg in range(1,10):
c = f"d{deg}"
cols.append(c)
df_train[c] = df_train["distance"]**deg
df_plot[c] = df_plot["distance"]**deg
df_plot
distance | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.00 | 0.000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
1 | 0.1 | 0.1 | 0.01 | 0.001 | 1.000000e-04 | 1.000000e-05 | 1.000000e-06 | 1.000000e-07 | 1.000000e-08 | 1.000000e-09 |
2 | 0.2 | 0.2 | 0.04 | 0.008 | 1.600000e-03 | 3.200000e-04 | 6.400000e-05 | 1.280000e-05 | 2.560000e-06 | 5.120000e-07 |
3 | 0.3 | 0.3 | 0.09 | 0.027 | 8.100000e-03 | 2.430000e-03 | 7.290000e-04 | 2.187000e-04 | 6.561000e-05 | 1.968300e-05 |
4 | 0.4 | 0.4 | 0.16 | 0.064 | 2.560000e-02 | 1.024000e-02 | 4.096000e-03 | 1.638400e-03 | 6.553600e-04 | 2.621440e-04 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | 39.5 | 39.5 | 1560.25 | 61629.875 | 2.434380e+06 | 9.615801e+07 | 3.798241e+09 | 1.500305e+11 | 5.926206e+12 | 2.340851e+14 |
396 | 39.6 | 39.6 | 1568.16 | 62099.136 | 2.459126e+06 | 9.738138e+07 | 3.856303e+09 | 1.527096e+11 | 6.047300e+12 | 2.394731e+14 |
397 | 39.7 | 39.7 | 1576.09 | 62570.773 | 2.484060e+06 | 9.861717e+07 | 3.915102e+09 | 1.554295e+11 | 6.170553e+12 | 2.449709e+14 |
398 | 39.8 | 39.8 | 1584.04 | 63044.792 | 2.509183e+06 | 9.986547e+07 | 3.974646e+09 | 1.581909e+11 | 6.295998e+12 | 2.505807e+14 |
399 | 39.9 | 39.9 | 1592.01 | 63521.199 | 2.534496e+06 | 1.011264e+08 | 4.034943e+09 | 1.609942e+11 | 6.423669e+12 | 2.563044e+14 |
400 rows × 10 columns
cols = []
for deg in range(1,10):
c = f"d{deg}"
cols.append(c)
df_train[c] = df_train["distance"]**deg
df_plot[c] = df_plot["distance"]**deg
cols = []
for deg in range(1,10):
c = f"d{deg}"
cols.append(c)
for x in [df_train, df_plot]:
x[c] = x["distance"]**deg
df_plot
distance | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.00 | 0.000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
1 | 0.1 | 0.1 | 0.01 | 0.001 | 1.000000e-04 | 1.000000e-05 | 1.000000e-06 | 1.000000e-07 | 1.000000e-08 | 1.000000e-09 |
2 | 0.2 | 0.2 | 0.04 | 0.008 | 1.600000e-03 | 3.200000e-04 | 6.400000e-05 | 1.280000e-05 | 2.560000e-06 | 5.120000e-07 |
3 | 0.3 | 0.3 | 0.09 | 0.027 | 8.100000e-03 | 2.430000e-03 | 7.290000e-04 | 2.187000e-04 | 6.561000e-05 | 1.968300e-05 |
4 | 0.4 | 0.4 | 0.16 | 0.064 | 2.560000e-02 | 1.024000e-02 | 4.096000e-03 | 1.638400e-03 | 6.553600e-04 | 2.621440e-04 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | 39.5 | 39.5 | 1560.25 | 61629.875 | 2.434380e+06 | 9.615801e+07 | 3.798241e+09 | 1.500305e+11 | 5.926206e+12 | 2.340851e+14 |
396 | 39.6 | 39.6 | 1568.16 | 62099.136 | 2.459126e+06 | 9.738138e+07 | 3.856303e+09 | 1.527096e+11 | 6.047300e+12 | 2.394731e+14 |
397 | 39.7 | 39.7 | 1576.09 | 62570.773 | 2.484060e+06 | 9.861717e+07 | 3.915102e+09 | 1.554295e+11 | 6.170553e+12 | 2.449709e+14 |
398 | 39.8 | 39.8 | 1584.04 | 63044.792 | 2.509183e+06 | 9.986547e+07 | 3.974646e+09 | 1.581909e+11 | 6.295998e+12 | 2.505807e+14 |
399 | 39.9 | 39.9 | 1592.01 | 63521.199 | 2.534496e+06 | 1.011264e+08 | 4.034943e+09 | 1.609942e+11 | 6.423669e+12 | 2.563044e+14 |
400 rows × 10 columns
df_plot.columns
Index(['distance', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9'], dtype='object')
df_plot["Pred"] = reg.predict(df_plot[cols])
The following looks pretty bad, because the scale of the fitted polynomial is so much larger, so we can’t see any detail in the curve.
Also notice that it doesn’t make sense. Our model is predicting that, when the taxi ride is 40 miles, the cost should be approximately negative two billion dollars.
c = alt.Chart(df_train).mark_circle().encode(
x="distance",
y="total"
)
c9 = alt.Chart(df_plot).mark_line(color="red").encode(
x="distance",
y="Pred"
)
c+c9
We will fix this by doing two things: specifying the domain for the y-axis, and specifying clip=True
(that is important) when calling mark_line()
. The clip=True
says to get rid of the points that are outside the axes ranges.
The following is a clear example of overfitting. Our polynomial has too much flexibility (the 9 coefficients) relative to the number of data points (40 in this case). For example, this flexibility allows the polynomial to nearly exactly pass through many of the data points.
c = alt.Chart(df_train).mark_circle().encode(
x="distance",
y="total"
)
c9 = alt.Chart(df_plot).mark_line(color="red", clip=True).encode(
x="distance",
y=alt.Y("Pred", scale=alt.Scale(domain=(0,200)))
)
c+c9