Overfitting¶

Announcements¶

I’ll post a sample midterm soon (Saturday at the latest). The midterm is Thursday Week 8 (May 19).

Be sure to attend tomorrow’s Discussion Section:

Yasmeen will introduce the dataset for the homework and help with the first few homework questions.
Note cards for the midterm will be handed out.

import pandas as pd
import altair as alt

Simulated data¶

We will work with the same simulated dataset as on Monday. The true underlying function is of the form \(f(x) = c_2 x^2 + c_1 x + c_0\). The true outputs are stored in the “y_true” column. We’ve hidden this true data by adding some random noise to each output and put the result in the “y” column.

df = pd.read_csv("../data/sim_data.csv")

df.head()

	x	y_true	y	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
0	-3.329208	-18.207589	-117.484900	-3.329208	11.083626	-36.899694	122.846756	-408.982395	1361.587441	-4533.007730	1.509133e+04	-5.024216e+04	1.672666e+05
1	6.465018	74.160562	73.954907	6.465018	41.796463	270.214901	1746.944309	11294.027098	73016.092970	472050.384357	3.051814e+06	1.973004e+07	1.275550e+08
2	-4.478046	-7.670062	-13.810089	-4.478046	20.052899	-89.797810	402.118751	-1800.706392	8063.646628	-36109.383086	1.616995e+05	-7.240978e+05	3.242544e+06
3	2.043272	-7.925152	19.461182	2.043272	4.174960	8.530580	17.430295	35.614834	72.770792	148.690523	3.038152e+02	6.207771e+02	1.268416e+03
4	4.850593	36.485466	22.375230	4.850593	23.528255	114.125996	553.578791	2685.185564	13024.743051	63177.731115	3.064495e+05	1.486462e+06	7.210222e+06

max_deg = 10
cols = [f"x{i}" for i in range(1, max_deg+1)]

c_true = alt.Chart(df).mark_circle(color="black").encode(
    x="x",
    y="y_true",
    tooltip=["x", "y_true", "y"]
)

c = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y",
    tooltip=["x", "y_true", "y"]
)

c_true

The black points represent the true underlying model, while the blue points represent the randomized version. Real-world data is almost always closer to the randomized version.

c+c_true

Dividing into a training set and a test set¶

The motto is to fit on a training set, and predict or evaluate on a test set.

from sklearn.model_selection import train_test_split

cols

['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']

We use random_state to guarantee we all get the same results. This guarantees that the data does exhibit the aspects that I want.

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["y"], train_size=0.24, random_state=6)

df[cols].shape

(50, 10)

The X_train variable will hold 24% of the rows from df[cols].

X_train.shape

(12, 10)

0.24*50

12.0

X_train

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
13	-8.982275	80.681271	-724.701394	6509.467488	-58469.829551	525192.110405	-4.717420e+06	4.237317e+07	-3.806075e+08	3.418721e+09
11	6.449735	41.599085	268.303082	1730.483838	11161.162581	71986.543546	4.642941e+05	2.994574e+06	1.931421e+07	1.245716e+08
1	6.465018	41.796463	270.214901	1746.944309	11294.027098	73016.092970	4.720504e+05	3.051814e+06	1.973004e+07	1.275550e+08
25	0.371593	0.138081	0.051310	0.019066	0.007085	0.002633	9.782985e-04	3.635287e-04	1.350847e-04	5.019650e-05
16	-3.942982	15.547104	-61.301947	241.712453	-953.067768	3757.928725	-1.481744e+04	5.842491e+04	-2.303683e+05	9.083382e+05
45	-4.457650	19.870645	-88.576385	394.842539	-1760.069919	7845.776000	-3.497372e+04	1.559006e+05	-6.949505e+05	3.097846e+06
15	5.481667	30.048674	164.716823	902.922780	4949.522041	27131.631828	1.487266e+05	8.152695e+05	4.469036e+06	2.449777e+07
42	-9.804880	96.135668	-942.598671	9242.066675	-90617.352934	888492.254125	-8.711560e+06	8.541580e+07	-8.374916e+08	8.211505e+09
20	-0.690041	0.476157	-0.328568	0.226726	-0.156450	0.107957	-7.449486e-02	5.140454e-02	-3.547126e-02	2.447664e-02
35	4.852230	23.544134	114.241552	554.326269	2689.718466	13051.132219	6.332709e+04	3.072776e+05	1.490982e+06	7.234585e+06
9	4.556122	20.758251	94.577131	430.904978	1963.255801	8944.833631	4.075376e+04	1.856791e+05	8.459767e+05	3.854373e+06
10	-6.299257	39.680640	-249.958554	1574.553197	-9918.515413	62479.278687	-3.935730e+05	2.479218e+06	-1.561723e+07	9.837695e+07

The X_test variable will hold the remaining rows.

X_test

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
49	6.068927	36.831870	223.529915	1356.586642	8233.024737	49965.622715	3.032377e+05	1.840327e+06	1.116881e+07	6.778270e+07
40	-0.364001	0.132497	-0.048229	0.017555	-0.006390	0.002326	-8.466826e-04	3.081936e-04	-1.121829e-04	4.083471e-05
38	-1.602675	2.568568	-4.116580	6.597541	-10.573716	16.946233	-2.715931e+01	4.352755e+01	-6.976053e+01	1.118035e+02
23	4.441058	19.722998	87.590980	388.996638	1727.556699	7672.179799	3.407260e+04	1.513184e+05	6.720137e+05	2.984452e+06
7	-5.656363	31.994443	-180.972185	1023.644381	-5790.104257	32750.931784	-1.852512e+05	1.047848e+06	-5.927008e+06	3.352531e+07
0	-3.329208	11.083626	-36.899694	122.846756	-408.982395	1361.587441	-4.533008e+03	1.509133e+04	-5.024216e+04	1.672666e+05
6	-6.557409	42.999609	-281.966007	1848.966339	-12124.427923	79504.828902	-5.213457e+05	3.418677e+06	-2.241766e+07	1.470018e+08
34	-1.384337	1.916390	-2.652930	3.672550	-5.084049	7.038039	-9.743020e+00	1.348763e+01	-1.867143e+01	2.584755e+01
14	-9.026659	81.480574	-735.497357	6639.083877	-59928.746571	540956.362638	-4.883029e+06	4.407743e+07	-3.978720e+08	3.591455e+09
31	6.893150	47.515521	327.531630	2257.724760	15562.836137	107276.968808	7.394763e+05	5.097321e+06	3.513660e+07	2.422019e+08
48	3.778329	14.275771	53.938563	203.797646	770.014590	2909.368585	1.099255e+04	4.153348e+04	1.569272e+05	5.929225e+05
24	-6.563454	43.078928	-282.746563	1855.794057	-12180.418898	79945.618965	-5.247194e+05	3.443972e+06	-2.260435e+07	1.483626e+08
19	-9.924588	98.497442	-977.546500	9701.745999	-96285.829291	955597.159837	-9.483908e+06	9.412388e+07	-9.341407e+08	9.270961e+09
3	2.043272	4.174960	8.530580	17.430295	35.614834	72.770792	1.486905e+02	3.038152e+02	6.207771e+02	1.268416e+03
41	6.656568	44.309900	294.951869	1963.367222	13069.287756	86996.604895	5.790988e+05	3.854811e+06	2.565981e+07	1.708063e+08
28	3.165616	10.021126	31.723040	100.422972	317.900593	1006.351284	3.185722e+03	1.008477e+04	3.192452e+04	1.010608e+05
43	1.236174	1.528127	1.889031	2.335171	2.886679	3.568438	4.411211e+00	5.453025e+00	6.740889e+00	8.332913e+00
30	-6.470932	41.872959	-270.957061	1753.344667	-11345.773804	73417.728768	-4.750811e+05	3.074218e+06	-1.989305e+07	1.287266e+08
47	-3.012487	9.075077	-27.338548	82.357015	-248.099419	747.396219	-2.251521e+03	6.782678e+03	-2.043273e+04	6.155332e+04
17	7.135269	50.912060	363.271234	2592.037885	18494.886953	131965.989144	9.416128e+05	6.718660e+06	4.793945e+07	3.420608e+08
21	4.510288	20.342695	91.751404	413.825225	1866.470799	8418.320181	3.796905e+04	1.712513e+05	7.723927e+05	3.483713e+06
29	5.222245	27.271843	142.420249	743.753442	3884.062733	20283.527393	1.059256e+05	5.531692e+05	2.888785e+06	1.508594e+07
39	-8.223553	67.626827	-556.132805	4573.387686	-37609.496782	309283.696327	-2.543411e+06	2.091587e+07	-1.720028e+08	1.414474e+09
18	4.196026	17.606638	73.877921	309.993714	1300.741838	5457.947214	2.290169e+04	9.609610e+04	4.032218e+05	1.691929e+06
22	-9.336060	87.162025	-813.749934	7597.218586	-70928.092042	662188.955593	-6.182236e+06	5.771773e+07	-5.388562e+08	5.030794e+09
44	-2.766259	7.652191	-21.167945	58.556028	-161.981161	448.081909	-1.239511e+03	3.428808e+03	-9.484973e+03	2.623790e+04
37	-2.738399	7.498829	-20.534785	56.232432	-153.986833	421.677380	-1.154721e+03	3.162086e+03	-8.659054e+03	2.371194e+04
5	-0.578506	0.334669	-0.193608	0.112004	-0.064795	0.037484	-2.168480e-02	1.254479e-02	-7.257238e-03	4.198356e-03
32	-3.493023	12.201207	-42.619094	148.869463	-520.004411	1816.387202	-6.344682e+03	2.216212e+04	-7.741278e+04	2.704046e+05
27	-2.965964	8.796943	-26.091415	77.386199	-229.524684	680.761955	-2.019115e+03	5.988624e+03	-1.776204e+04	5.268158e+04
46	4.825670	23.287096	112.375850	542.288824	2616.907175	12628.331726	6.094017e+04	2.940772e+05	1.419120e+06	6.848203e+06
12	3.660112	13.396416	49.032378	179.463974	656.858162	2404.174135	8.799545e+03	3.220732e+04	1.178824e+05	4.314626e+05
2	-4.478046	20.052899	-89.797810	402.118751	-1800.706392	8063.646628	-3.610938e+04	1.616995e+05	-7.240978e+05	3.242544e+06
8	0.108538	0.011781	0.001279	0.000139	0.000015	0.000002	1.774525e-07	1.926040e-08	2.090492e-09	2.268986e-10
36	3.727977	13.897811	51.810717	193.149152	720.055560	2684.350427	1.000720e+04	3.730660e+04	1.390781e+05	5.184800e+05
4	4.850593	23.528255	114.125996	553.578791	2685.185564	13024.743051	6.317773e+04	3.064495e+05	1.486462e+06	7.210222e+06
33	-8.747075	76.511319	-669.250243	5853.982010	-51205.219163	447895.887816	-3.917779e+06	3.426911e+07	-2.997544e+08	2.621974e+09
26	6.520774	42.520497	277.266559	1807.992633	11789.511780	76876.744660	5.012959e+05	3.268837e+06	2.131535e+07	1.389926e+08

Performing polynomial regression for each degree¶

from sklearn.linear_model import LinearRegression

The next cell isn’t doing anything; it’s just showing how sub_cols is changing.

for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    print(sub_cols)

['x1']
['x1', 'x2']
['x1', 'x2', 'x3']
['x1', 'x2', 'x3', 'x4']
['x1', 'x2', 'x3', 'x4', 'x5']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']

X_train

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
13	-8.982275	80.681271	-724.701394	6509.467488	-58469.829551	525192.110405	-4.717420e+06	4.237317e+07	-3.806075e+08	3.418721e+09
11	6.449735	41.599085	268.303082	1730.483838	11161.162581	71986.543546	4.642941e+05	2.994574e+06	1.931421e+07	1.245716e+08
1	6.465018	41.796463	270.214901	1746.944309	11294.027098	73016.092970	4.720504e+05	3.051814e+06	1.973004e+07	1.275550e+08
25	0.371593	0.138081	0.051310	0.019066	0.007085	0.002633	9.782985e-04	3.635287e-04	1.350847e-04	5.019650e-05
16	-3.942982	15.547104	-61.301947	241.712453	-953.067768	3757.928725	-1.481744e+04	5.842491e+04	-2.303683e+05	9.083382e+05
45	-4.457650	19.870645	-88.576385	394.842539	-1760.069919	7845.776000	-3.497372e+04	1.559006e+05	-6.949505e+05	3.097846e+06
15	5.481667	30.048674	164.716823	902.922780	4949.522041	27131.631828	1.487266e+05	8.152695e+05	4.469036e+06	2.449777e+07
42	-9.804880	96.135668	-942.598671	9242.066675	-90617.352934	888492.254125	-8.711560e+06	8.541580e+07	-8.374916e+08	8.211505e+09
20	-0.690041	0.476157	-0.328568	0.226726	-0.156450	0.107957	-7.449486e-02	5.140454e-02	-3.547126e-02	2.447664e-02
35	4.852230	23.544134	114.241552	554.326269	2689.718466	13051.132219	6.332709e+04	3.072776e+05	1.490982e+06	7.234585e+06
9	4.556122	20.758251	94.577131	430.904978	1963.255801	8944.833631	4.075376e+04	1.856791e+05	8.459767e+05	3.854373e+06
10	-6.299257	39.680640	-249.958554	1574.553197	-9918.515413	62479.278687	-3.935730e+05	2.479218e+06	-1.561723e+07	9.837695e+07

Notice how y_train holds the exact same rows (look at the index for both X_train and y_train).

y_train

    2.574974
  130.940520
    73.954907
    1.134863
  -35.901888
  -26.760377
   24.441828
  131.476240
   61.591699
  -13.988943
     1.089803
   -7.251101
Name: y, dtype: float64

Evaluating the performance, Part 1¶

from sklearn.metrics import mean_squared_error

mse_train_dict = {}

for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    reg.fit(X_train[sub_cols], y_train)
    # what if we predict also on the training set?
    mse_train_dict[i] = mean_squared_error(reg.predict(X_train[sub_cols]), y_train)

What you should think when you look at the following dictionary is that, because mean squared error is decreasing, it looks like the model is improving.

But there is a major problem: we both fit and predict on the training set. Especially for higher degree polynomials, there is a big risk taht we are over-fitting the data.

mse_train_dict

{1: 3006.2441403930584,
1845.1659476797647,
1820.054196587029,
590.3862081005426,
589.0984843939519,
523.8762601493494,
373.279496780059,
206.85098887525842,
204.99331332780278,
101.47175254241118}

Evaluating the performance, Part 2¶

To get meaningful values of MSE (Mean Squared Error), we should always evaluate on a test set, that was not used during training (fitting).

mse_test_dict = {}

for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    reg.fit(X_train[sub_cols], y_train)
    df[f"Pred{i}"] = reg.predict(df[sub_cols])
    mse_test_dict[i] = mean_squared_error(reg.predict(X_test[sub_cols]), y_test)

Notice how degree 2 is where we find the best MSE (unlike in the previous dictionary, where degree 10 looked best). This reflects the fact that the true underlying function has degree 2. (There is some randomness in this procedure. When I tried other values like random_state=4, I did not always find that degree 2 was the best. Sometimes degree 3 or degree 4 appeared the best.)

mse_test_dict

{1: 4960.848889431734,
3724.5263419375206,
3742.5488620098768,
6252.874102524025,
6253.412384993094,
7016.575636378879,
5609.641936177731,
9481.527365671222,
12142.080457403497,
103395.24260458215}

Plotting the polynomial fits¶

This code is similar to what we did Friday of last week, but first we’re going to add a column called “In_train” recording whether the point is in the training set or the test set.

df["In_train"] = "test"

X_train.index

Int64Index([13, 11, 1, 25, 16, 45, 15, 42, 20, 35, 9, 10], dtype='int64')

df.loc[X_train.index, "In_train"] = "train"

df.head()

	x	y_true	y	x1	x2	x3	x4	x5	x6	x7	...	Pred2	Pred3	Pred4	Pred5	Pred6	Pred7	Pred8	Pred9	Pred10	In_train
0	-3.329208	-18.207589	-117.484900	-3.329208	11.083626	-36.899694	122.846756	-408.982395	1361.587441	-4533.007730	...	-16.647390	-19.954037	6.715567	5.215954	-7.995404	-18.576185	3.146424	-5.566199	-15.676053	test
1	6.465018	74.160562	73.954907	6.465018	41.796463	270.214901	1746.944309	11294.027098	73016.092970	472050.384357	...	70.916916	65.455116	101.808679	102.350612	105.135066	100.392589	103.474571	103.854390	97.078223	train
2	-4.478046	-7.670062	-13.810089	-4.478046	20.052899	-89.797810	402.118751	-1800.706392	8063.646628	-36109.383086	...	-9.287827	-14.954228	-16.790957	-18.030386	-17.560873	-25.152785	-38.972803	-35.768769	-25.492479	test
3	2.043272	-7.925152	19.461182	2.043272	4.174960	8.530580	17.430295	35.614834	72.770792	148.690523	...	-1.930187	6.789503	-5.399120	-3.567399	30.239546	-11.206589	-73.991439	-94.222921	94.585113	test
4	4.850593	36.485466	22.375230	4.850593	23.528255	114.125996	553.578791	2685.185564	13024.743051	63177.731115	...	37.963743	42.179199	-1.465158	-2.100513	-2.486611	-5.545758	0.475225	2.001401	-21.612846	test

5 rows × 24 columns

c = alt.Chart(df).mark_circle().encode(
    x="x", 
    y="y",
    color="In_train"
)

c_true = alt.Chart(df).mark_line(color="black").encode(
    x="x",
    y="y_true",
)

chart_list = []
for i in range(1,max_deg+1):
    c_temp = alt.Chart(df).mark_line(color="red", clip=True).encode(
        x="x",
        y=alt.Y(f"Pred{i}", scale=alt.Scale(domain=(-100,300))),
    )
    chart_list.append(c_temp)

all_charts = [c+c_true+d for d in chart_list]

Both the training error and the test error are high when degree is 1. That corresponds to underfitting, and in the picture, it corresponds to the straight line not being able to fit the data well.

The training error is low but the test error is high for large degrees (like degree 8). This corresponds to overfitting. Look at the 8th plot below, and notice how the red degree 8 polynomial is very closely following the training data (the orange points).

Notice how the polynomial fits the true secret polynomial (the black quadratic curve) closely when we are using degree 2 or degree 3.

alt.vconcat(*all_charts)

If we wanted the polynomials to look smoother, we could replace df with something like df_plot that we used on Friday of last week.

UC Irvine Math 10 S22

Overfitting

Contents