Over-fitting and polynomial regression¶

We will use the following constants.

m = 50 (number of data points)
x-values randomly distributed in the interval \([-5,5]\)
y_true values given by \(c_2 x^2 + c_1 x + c_0\), where \(c_2 = -1.4\), \(c_1 = 6.5\), and \(c_0 = 3.2\)
y values given by y_true plus normally distributed random values with a mean of 0 and a standard deviation of 30.
d: the degree we use for the polynomial regression. Minimum allowed value: 1. Maximum allowed value: 20.
df: a pandas DataFrame holding all of our data.

import numpy as np
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng()

Define and plot the initial data¶

Define m, x, c, y_true, y. Make a pandas DataFrame df with columns for x, y_true, and y.

m = 50
x = 10*rng.random(size=50) - 5
# 𝑐2=−1.4 ,  𝑐1=6.5 , and  𝑐0=3.2
c = [3.2,6.5,-1.4]

y_true = c[0] + c[1]*x + c[2]*x**2

y = y_true + rng.normal(loc = 0, scale = 30, size = m)

df = pd.DataFrame({"x":x,"y_true":y_true, "y": y})

df

	x	y_true	y
0	-4.900352	-62.271116	-80.214720
1	4.236751	5.608797	24.964830
2	3.128746	9.832176	27.688988
3	2.915629	10.250340	15.783512
4	-3.751502	-40.888037	-44.236319
5	-1.330997	-7.931659	9.999005
6	2.041902	10.635253	-7.676105
7	-2.322127	-19.443008	-19.361647
8	3.291763	9.426474	-7.688856
9	3.410108	9.085331	-20.883321
10	-0.339651	0.830762	36.057586
11	-0.401302	0.366075	-17.010790
12	2.904492	10.268695	3.732706
13	-0.053634	2.847349	-34.359309
14	3.778631	7.771830	29.311126
15	3.797665	7.693658	20.264761
16	-2.148780	-17.231225	9.019845
17	3.537900	8.672920	3.776920
18	-1.857388	-13.702864	22.194729
19	-3.965154	-44.584928	-26.446755
20	3.054540	9.992209	-0.424804
21	4.022897	6.691648	16.108043
22	3.440408	8.991681	20.950999
23	4.209701	5.752841	36.118528
24	-1.268537	-7.298356	3.504744
25	4.432180	4.507262	18.268370
26	-0.156168	2.150761	-16.623551
27	1.487581	9.771221	-39.149535
28	3.103047	9.889345	42.404592
29	3.067461	9.965452	1.500292
30	-1.504739	-9.750734	-6.590996
31	1.480935	9.755641	-31.225299
32	0.963425	8.162801	19.517357
33	-3.096909	-30.357099	-11.146812
34	-0.239334	1.564135	28.001757
35	-2.434673	-20.924066	-8.356333
36	-4.541341	-55.192009	-69.038385
37	4.852255	1.777528	5.054251
38	-4.991041	-64.116463	-83.757761
39	-4.044227	-45.985550	-22.373779
40	1.977772	10.579303	-14.424426
41	4.350993	4.977858	-6.815410
42	-1.783182	-12.842319	0.088739
43	2.594333	10.640376	55.229200
44	-3.364536	-34.517620	-62.697253
45	-1.568917	-10.444058	48.087066
46	0.794089	7.478772	-14.120384
47	1.397830	9.550395	10.711448
48	-4.347525	-51.520279	-9.869421
49	-3.376635	-34.710450	-19.330468

chart_data = alt.Chart(df).mark_circle().encode(
    x = "x",
    y = "y"
)

chart_true = alt.Chart(df).mark_line().encode(
    x = "x",
    y = "y_true",
    color = alt.value("black"),
)

chart_data + chart_true

Plot x vs y_true in one Altair chart, and x vs y in another chart. Save these charts using the names chart_true and chart_data, respectively. For chart_true, use alt.value('black') and plot the data using a black line. For chart_data, plot the data using disks (in the default blue color).

Include powers of x in `df`¶

Put more columns in df, corresponding to the powers of the x-column, up to and including degree 20.

for i in range(1,21):
    df["x"+str(i)] = df["x"]**i

Define a function performing the polynomial regression¶

Define a function poly_reg which takes as input df and a degree d, then performs polynomial regression using that degree, and as output returns the fit LinearRegression object.

def poly_reg(df, d):
    reg = LinearRegression()
    X = df[[f"x{i}" for i in range(1,d+1)]]
    reg.fit(X,df["y"])
    return reg

reg = poly_reg(df,2)

reg.predict(df[["x1","x2"]])

array([-52.92876935,   8.89299986,  12.28654622,  12.59460986,
       -33.92129846,  -4.45073566,  12.69513709, -14.78387552,
        11.97584418,  11.70952355,   3.47883237,   3.0558604 ,
        12.60765109,   5.3193789 ,  10.6605696 ,  10.59735911,
       -12.8030892 ,  11.38343901,  -9.63921282, -37.21153319,
        12.40642995,   9.78205973,  11.63582219,   9.01164325,
        -3.8801097 ,   7.98262563,   4.68259133,  11.78988806,
        12.32959117,  12.38652512,  -6.08824952,  11.77446983,
        10.2419052 , -24.53633336,   4.14718491, -16.10929915,
       -46.64189545,   5.70926163, -54.56681816, -38.45758212,
        12.62889082,   8.3721771 ,  -8.86671555,  12.84885467,
       -28.24654297,  -6.71182115,   9.59805182,  11.57255428,
       -43.37901892, -28.41841962])

[f"x{i}" for i in range(1,5)]

['x1', 'x2', 'x3', 'x4']

df[[f"x{i}" for i in range(1,5)]]

	x1	x2	x3	x4
0	-4.900352	24.013449	-117.674352	576.645739
1	4.236751	17.950062	76.049952	322.204742
2	3.128746	9.789053	30.627463	95.825559
3	2.915629	8.500890	24.785437	72.265127
4	-3.751502	14.073767	-52.797766	198.070927
5	-1.330997	1.771554	-2.357934	3.138404
6	2.041902	4.169362	8.513426	17.383577
7	-2.322127	5.392274	-12.521544	29.076614
8	3.291763	10.835703	35.668567	117.412469
9	3.410108	11.628835	39.655581	135.229804
10	-0.339651	0.115363	-0.039183	0.013309
11	-0.401302	0.161043	-0.064627	0.025935
12	2.904492	8.436071	24.502499	71.167302
13	-0.053634	0.002877	-0.000154	0.000008
14	3.778631	14.278049	53.951473	203.862687
15	3.797665	14.422262	54.770923	208.001634
16	-2.148780	4.617255	-9.921464	21.319040
17	3.537900	12.516734	44.282947	156.668620
18	-1.857388	3.449889	-6.407781	11.901733
19	-3.965154	15.722447	-62.341925	247.195341
20	3.054540	9.330217	28.499526	87.052955
21	4.022897	16.183704	65.105381	261.912270
22	3.440408	11.836410	40.722082	140.100592
23	4.209701	17.721582	74.602559	314.054459
24	-1.268537	1.609187	-2.041314	2.589484
25	4.432180	19.644221	87.066727	385.895419
26	-0.156168	0.024389	-0.003809	0.000595
27	1.487581	2.212898	3.291865	4.896916
28	3.103047	9.628899	29.878925	92.715703
29	3.067461	9.409318	28.862717	88.535265
30	-1.504739	2.264238	-3.407086	5.126774
31	1.480935	2.193168	3.247938	4.809984
32	0.963425	0.928189	0.894241	0.861534
33	-3.096909	9.590848	-29.701988	91.984368
34	-0.239334	0.057281	-0.013709	0.003281
35	-2.434673	5.927635	-14.431855	35.136854
36	-4.541341	20.623780	-93.659619	425.340282
37	4.852255	23.544378	114.243325	554.337741
38	-4.991041	24.910495	-124.329315	620.532771
39	-4.044227	16.355769	-66.146437	267.511183
40	1.977772	3.911584	7.736223	15.300488
41	4.350993	18.931140	82.369259	358.388070
42	-1.783182	3.179739	-5.670054	10.110739
43	2.594333	6.730562	17.461317	45.300467
44	-3.364536	11.320099	-38.086877	128.144653
45	-1.568917	2.461500	-3.861888	6.058980
46	0.794089	0.630578	0.500735	0.397628
47	1.397830	1.953929	2.731260	3.817837
48	-4.347525	18.900975	-82.172465	357.246857
49	-3.376635	11.401661	-38.499243	129.997875

Define a function plotting the fit polynomial¶

Define a function make_chart which takes as input df and a degree d, and as output returns an Altair chart showing the fit polynomial. Use a red line for the polynomial. Use .copy() at the beginning of the function so you are not changing the original DataFrame. Give the chart a title using .properties(title = ...).

def make_chart(df,d):
    df2 = df.copy()
    reg = poly_reg(df,d)
    X = df[[f"x{i}" for i in range(1,d+1)]]
    df2["y_pred"] = reg.predict(X)
    chart = alt.Chart(df2).mark_line().encode(
        x = "x1",
        y = "y_pred"
    )
    return chart

make_chart(df,20) + chart_true + chart_data

Put the charts together¶

Make a length 20 list chart_list = [make_chart(df,1) + chart_true + chart_data, make_chart(df,2) + chart_true + chart_data, ..., make_chart(df,20) + chart_true + chart_data]. Use alt.vconcat to display all of these charts.

UC Irvine Math 10