Polynomial Regression 2¶

I think the most important concept in machine learning is the concept of overfitting. The idea is if you have too flexible of a model (relative to the number of data points), then your model can match the data very closely, but it will do a poor job of predicting future values.

When performing polynomial regression, the higher the degree of the polynomial, the more flexible the model is. So when the degree of the polynomial is higher, there is a greater risk of overfitting.

The taxis dataset¶

The taxis dataset contains information about approximately 6000 taxi rides in New York City. Our goal is to try to model the total cost of the taxi ride using the distance.

import pandas as pd
import altair as alt
alt.data_transformers.enable('default', max_rows=10000)
import seaborn as sns

df = sns.load_dataset("taxis").dropna()

df.head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
0	2019-03-23 20:21:09	2019-03-23 20:27:24	1	1.60	7.0	2.15	12.95	yellow	credit card	Lenox Hill West	UN/Turtle Bay South	Manhattan	Manhattan
1	2019-03-04 16:11:55	2019-03-04 16:19:00	1	0.79	5.0	0.00	9.30	yellow	cash	Upper West Side South	Upper West Side South	Manhattan	Manhattan
2	2019-03-27 17:53:01	2019-03-27 18:00:25	1	1.37	7.5	2.36	14.16	yellow	credit card	Alphabet City	West Village	Manhattan	Manhattan
3	2019-03-10 01:23:59	2019-03-10 01:49:51	1	7.70	27.0	6.15	36.95	yellow	credit card	Hudson Sq	Yorkville West	Manhattan	Manhattan
4	2019-03-30 13:27:42	2019-03-30 13:37:14	3	2.16	9.0	1.10	13.40	yellow	credit card	Midtown East	Yorkville West	Manhattan	Manhattan

The following plot shows that the data follows a linear model very closely. We are going to instead try using polynomial regression with a high degree (degree 9 in this case).

alt.Chart(df).mark_circle().encode(
    x="distance",
    y="total"
)

If we fit a degree 9 polynomial to all 6000+ data points, it will look very good. Instead we will only use 40 of the data points. In general, the fewer the data points you use, the more risk there is for overfitting.

from sklearn.model_selection import train_test_split

The following is not the usual way to call train_test_split. Below we will use the usual way, which involves list unpacking.

Here train_size=40 says we want to choose 40 random rows from the DataFrame df. If we had instead used train_size=0.4, that would say we want to choose 40% of the rows.

# use 40 data points
a = train_test_split(df, train_size=40)

type(a)

list

len(a)

type(a[0])

pandas.core.frame.DataFrame

a[0].shape

(40, 14)

a[1].shape

(6301, 14)

Currently a[0] contains 40 rows from the DataFrame, and a[1] contains the remaining 6301 rows.

df.shape

(6341, 14)

The rows are chosen at random. (They aren’t even presented in order.)

a[0].head()

	pickup	dropoff	passengers	distance	fare	tip	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
1855	2019-03-29 14:20:18	2019-03-29 14:36:35	6	2.72	12.5	3.16	18.96	yellow	credit card	Hudson Sq	Murray Hill	Manhattan	Manhattan
5029	2019-03-21 06:51:40	2019-03-21 07:07:11	1	3.39	14.0	2.75	20.05	yellow	credit card	Central Park	Times Sq/Theatre District	Manhattan	Manhattan
5277	2019-03-19 04:54:04	2019-03-19 05:00:05	2	1.65	7.5	2.00	13.30	yellow	credit card	East Chelsea	Clinton East	Manhattan	Manhattan
247	2019-03-07 18:49:41	2019-03-07 19:07:05	1	2.75	13.0	3.46	20.76	yellow	credit card	Yorkville West	Lincoln Square East	Manhattan	Manhattan
912	2019-03-15 23:13:44	2019-03-15 23:24:47	1	1.70	9.5	2.65	15.95	yellow	credit card	Lincoln Square West	East Chelsea	Manhattan	Manhattan

The usual way to call train_test_split is to use list unpacking, which is what we do in the following code.

# list unpacking
df_train, df_test = train_test_split(df, train_size=40)

Now df_train will contain a different set of 40 rows from df.

df_train.head()

	pickup	dropoff	passengers	distance	fare	total	color	payment	pickup_zone	dropoff_zone	pickup_borough	dropoff_borough
1091	2019-03-05 20:17:31	2019-03-05 20:35:21	1	4.20	16.0	19.8	yellow	cash	Penn Station/Madison Sq West	Battery Park City	Manhattan	Manhattan
5459	2019-03-29 16:25:22	2019-03-29 17:15:57	1	12.76	39.5	41.3	green	credit card	Brighton Beach	Richmond Hill	Brooklyn	Queens
4164	2019-03-20 09:11:06	2019-03-20 09:21:44	2	0.83	8.0	11.3	yellow	cash	Midtown South	Midtown East	Manhattan	Manhattan
819	2019-03-17 19:30:19	2019-03-17 19:41:39	1	2.10	9.5	12.8	yellow	cash	Lincoln Square East	Upper West Side North	Manhattan	Manhattan
6136	2019-03-05 19:01:21	2019-03-05 19:05:56	1	0.87	5.0	6.8	green	cash	Elmhurst	Elmhurst	Queens	Queens

df_train.shape

(40, 14)

Fitting polynomial regression¶

cols = []
for deg in range(1,10):
    cols.append(f"d{deg}")
    df_train[f"d{deg}"] = df_train["distance"]**deg

Here is a more DRY approach, where we only have to type f"d{deg}" once.

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg

cols

['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']

Notice how 9 rows have been added to df_train. They contain the 1st, 2nd, …, 9th powers of the values in the “distance” column.

df_train.head()

	pickup	dropoff	passengers	distance	fare	total	color	payment	...	dropoff_borough	d1	d2	d3	d4	d5	d6	d7	d8	d9
1091	2019-03-05 20:17:31	2019-03-05 20:35:21	1	4.20	16.0	19.8	yellow	cash	...	Manhattan	4.20	17.6400	74.088000	311.169600	1306.912320	5.489032e+03	2.305393e+04	9.682652e+04	4.066714e+05
5459	2019-03-29 16:25:22	2019-03-29 17:15:57	1	12.76	39.5	41.3	green	credit card	...	Queens	12.76	162.8176	2077.552576	26509.570870	338262.124298	4.316225e+06	5.507503e+07	7.027573e+08	8.967184e+09
4164	2019-03-20 09:11:06	2019-03-20 09:21:44	2	0.83	8.0	11.3	yellow	cash	...	Manhattan	0.83	0.6889	0.571787	0.474583	0.393904	3.269404e-01	2.713605e-01	2.252292e-01	1.869403e-01
819	2019-03-17 19:30:19	2019-03-17 19:41:39	1	2.10	9.5	12.8	yellow	cash	...	Manhattan	2.10	4.4100	9.261000	19.448100	40.841010	8.576612e+01	1.801089e+02	3.782286e+02	7.942800e+02
6136	2019-03-05 19:01:21	2019-03-05 19:05:56	1	0.87	5.0	6.8	green	cash	...	Queens	0.87	0.7569	0.658503	0.572898	0.498421	4.336262e-01	3.772548e-01	3.282117e-01	2.855442e-01

5 rows × 23 columns

Remember, as we saw last time, polynomial regression can be viewed as a special case of linear regression. (The reverse is also true and more obvious.)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

It is important to only fit on df_train, because we want to only use 40 data points, so that we can demonstrate overfitting more effectively.

reg.fit(df_train[cols], df_train["total"])

LinearRegression()

df_train["Pred"] = reg.predict(df_train[cols])

The red line in the following shows the polynomial. It doesn’t look like a polynomial, because Altair (like Matlab and like Matplotlib) connects the points with straight lines. Below we will make a DataFrame with more input values, so that the polynomial will look smoother.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_train).mark_line(color="red").encode(
    x="distance",
    y="Pred"
)

c+c9

import numpy as np

df_plot = pd.DataFrame({"distance": np.arange(0,40,0.1)})

df_plot

	distance
0	0.0
1	0.1
2	0.2
3	0.3
4	0.4
...	...
395	39.5
396	39.6
397	39.7
398	39.8
399	39.9

400 rows × 1 columns

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg
    df_plot[c] = df_plot["distance"]**deg

df_plot

	distance	d1	d2	d3	d4	d5	d6	d7	d8	d9
0	0.0	0.0	0.00	0.000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
1	0.1	0.1	0.01	0.001	1.000000e-04	1.000000e-05	1.000000e-06	1.000000e-07	1.000000e-08	1.000000e-09
2	0.2	0.2	0.04	0.008	1.600000e-03	3.200000e-04	6.400000e-05	1.280000e-05	2.560000e-06	5.120000e-07
3	0.3	0.3	0.09	0.027	8.100000e-03	2.430000e-03	7.290000e-04	2.187000e-04	6.561000e-05	1.968300e-05
4	0.4	0.4	0.16	0.064	2.560000e-02	1.024000e-02	4.096000e-03	1.638400e-03	6.553600e-04	2.621440e-04
...	...	...	...	...	...	...	...	...	...	...
395	39.5	39.5	1560.25	61629.875	2.434380e+06	9.615801e+07	3.798241e+09	1.500305e+11	5.926206e+12	2.340851e+14
396	39.6	39.6	1568.16	62099.136	2.459126e+06	9.738138e+07	3.856303e+09	1.527096e+11	6.047300e+12	2.394731e+14
397	39.7	39.7	1576.09	62570.773	2.484060e+06	9.861717e+07	3.915102e+09	1.554295e+11	6.170553e+12	2.449709e+14
398	39.8	39.8	1584.04	63044.792	2.509183e+06	9.986547e+07	3.974646e+09	1.581909e+11	6.295998e+12	2.505807e+14
399	39.9	39.9	1592.01	63521.199	2.534496e+06	1.011264e+08	4.034943e+09	1.609942e+11	6.423669e+12	2.563044e+14

400 rows × 10 columns

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    df_train[c] = df_train["distance"]**deg
    df_plot[c] = df_plot["distance"]**deg

cols = []
for deg in range(1,10):
    c = f"d{deg}"
    cols.append(c)
    for x in [df_train, df_plot]:
        x[c] = x["distance"]**deg

df_plot

	distance	d1	d2	d3	d4	d5	d6	d7	d8	d9
0	0.0	0.0	0.00	0.000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
1	0.1	0.1	0.01	0.001	1.000000e-04	1.000000e-05	1.000000e-06	1.000000e-07	1.000000e-08	1.000000e-09
2	0.2	0.2	0.04	0.008	1.600000e-03	3.200000e-04	6.400000e-05	1.280000e-05	2.560000e-06	5.120000e-07
3	0.3	0.3	0.09	0.027	8.100000e-03	2.430000e-03	7.290000e-04	2.187000e-04	6.561000e-05	1.968300e-05
4	0.4	0.4	0.16	0.064	2.560000e-02	1.024000e-02	4.096000e-03	1.638400e-03	6.553600e-04	2.621440e-04
...	...	...	...	...	...	...	...	...	...	...
395	39.5	39.5	1560.25	61629.875	2.434380e+06	9.615801e+07	3.798241e+09	1.500305e+11	5.926206e+12	2.340851e+14
396	39.6	39.6	1568.16	62099.136	2.459126e+06	9.738138e+07	3.856303e+09	1.527096e+11	6.047300e+12	2.394731e+14
397	39.7	39.7	1576.09	62570.773	2.484060e+06	9.861717e+07	3.915102e+09	1.554295e+11	6.170553e+12	2.449709e+14
398	39.8	39.8	1584.04	63044.792	2.509183e+06	9.986547e+07	3.974646e+09	1.581909e+11	6.295998e+12	2.505807e+14
399	39.9	39.9	1592.01	63521.199	2.534496e+06	1.011264e+08	4.034943e+09	1.609942e+11	6.423669e+12	2.563044e+14

400 rows × 10 columns

df_plot.columns

Index(['distance', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9'], dtype='object')

df_plot["Pred"] = reg.predict(df_plot[cols])

The following looks pretty bad, because the scale of the fitted polynomial is so much larger, so we can’t see any detail in the curve.

Also notice that it doesn’t make sense. Our model is predicting that, when the taxi ride is 40 miles, the cost should be approximately negative two billion dollars.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_plot).mark_line(color="red").encode(
    x="distance",
    y="Pred"
)

c+c9

We will fix this by doing two things: specifying the domain for the y-axis, and specifying clip=True (that is important) when calling mark_line(). The clip=True says to get rid of the points that are outside the axes ranges.

The following is a clear example of overfitting. Our polynomial has too much flexibility (the 9 coefficients) relative to the number of data points (40 in this case). For example, this flexibility allows the polynomial to nearly exactly pass through many of the data points.

c = alt.Chart(df_train).mark_circle().encode(
    x="distance",
    y="total"
)

c9 = alt.Chart(df_plot).mark_line(color="red", clip=True).encode(
    x="distance",
    y=alt.Y("Pred", scale=alt.Scale(domain=(0,200)))
)

c+c9

UC Irvine Math 10 S22

Polynomial Regression 2

Contents

Polynomial Regression 2¶

The taxis dataset¶

Fitting polynomial regression¶