Over-fitting and polynomial regression

We will use the following constants.

  • m = 50 (number of data points)

  • x-values randomly distributed in the interval \([-5,5]\)

  • y_true values given by \(c_2 x^2 + c_1 x + c_0\), where \(c_2 = -1.4\), \(c_1 = 6.5\), and \(c_0 = 3.2\)

  • y values given by y_true plus normally distributed random values with a mean of 0 and a standard deviation of 30.

  • d: the degree we use for the polynomial regression. Minimum allowed value: 1. Maximum allowed value: 20.

  • df: a pandas DataFrame holding all of our data.

import numpy as np
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng()

Define and plot the initial data

Define m, x, c, y_true, y. Make a pandas DataFrame df with columns for x, y_true, and y.

m = 50
x = 10*rng.random(size=50) - 5
# 𝑐2=−1.4 ,  𝑐1=6.5 , and  𝑐0=3.2
c = [3.2,6.5,-1.4]
y_true = c[0] + c[1]*x + c[2]*x**2
y = y_true + rng.normal(loc = 0, scale = 30, size = m)
df = pd.DataFrame({"x":x,"y_true":y_true, "y": y})
df
x y_true y
0 -4.900352 -62.271116 -80.214720
1 4.236751 5.608797 24.964830
2 3.128746 9.832176 27.688988
3 2.915629 10.250340 15.783512
4 -3.751502 -40.888037 -44.236319
5 -1.330997 -7.931659 9.999005
6 2.041902 10.635253 -7.676105
7 -2.322127 -19.443008 -19.361647
8 3.291763 9.426474 -7.688856
9 3.410108 9.085331 -20.883321
10 -0.339651 0.830762 36.057586
11 -0.401302 0.366075 -17.010790
12 2.904492 10.268695 3.732706
13 -0.053634 2.847349 -34.359309
14 3.778631 7.771830 29.311126
15 3.797665 7.693658 20.264761
16 -2.148780 -17.231225 9.019845
17 3.537900 8.672920 3.776920
18 -1.857388 -13.702864 22.194729
19 -3.965154 -44.584928 -26.446755
20 3.054540 9.992209 -0.424804
21 4.022897 6.691648 16.108043
22 3.440408 8.991681 20.950999
23 4.209701 5.752841 36.118528
24 -1.268537 -7.298356 3.504744
25 4.432180 4.507262 18.268370
26 -0.156168 2.150761 -16.623551
27 1.487581 9.771221 -39.149535
28 3.103047 9.889345 42.404592
29 3.067461 9.965452 1.500292
30 -1.504739 -9.750734 -6.590996
31 1.480935 9.755641 -31.225299
32 0.963425 8.162801 19.517357
33 -3.096909 -30.357099 -11.146812
34 -0.239334 1.564135 28.001757
35 -2.434673 -20.924066 -8.356333
36 -4.541341 -55.192009 -69.038385
37 4.852255 1.777528 5.054251
38 -4.991041 -64.116463 -83.757761
39 -4.044227 -45.985550 -22.373779
40 1.977772 10.579303 -14.424426
41 4.350993 4.977858 -6.815410
42 -1.783182 -12.842319 0.088739
43 2.594333 10.640376 55.229200
44 -3.364536 -34.517620 -62.697253
45 -1.568917 -10.444058 48.087066
46 0.794089 7.478772 -14.120384
47 1.397830 9.550395 10.711448
48 -4.347525 -51.520279 -9.869421
49 -3.376635 -34.710450 -19.330468
chart_data = alt.Chart(df).mark_circle().encode(
    x = "x",
    y = "y"
)
chart_true = alt.Chart(df).mark_line().encode(
    x = "x",
    y = "y_true",
    color = alt.value("black"),
)
chart_data + chart_true

Plot x vs y_true in one Altair chart, and x vs y in another chart. Save these charts using the names chart_true and chart_data, respectively. For chart_true, use alt.value('black') and plot the data using a black line. For chart_data, plot the data using disks (in the default blue color).

Include powers of x in df

Put more columns in df, corresponding to the powers of the x-column, up to and including degree 20.

for i in range(1,21):
    df["x"+str(i)] = df["x"]**i

Define a function performing the polynomial regression

Define a function poly_reg which takes as input df and a degree d, then performs polynomial regression using that degree, and as output returns the fit LinearRegression object.

def poly_reg(df, d):
    reg = LinearRegression()
    X = df[[f"x{i}" for i in range(1,d+1)]]
    reg.fit(X,df["y"])
    return reg
reg = poly_reg(df,2)
reg.predict(df[["x1","x2"]])
array([-52.92876935,   8.89299986,  12.28654622,  12.59460986,
       -33.92129846,  -4.45073566,  12.69513709, -14.78387552,
        11.97584418,  11.70952355,   3.47883237,   3.0558604 ,
        12.60765109,   5.3193789 ,  10.6605696 ,  10.59735911,
       -12.8030892 ,  11.38343901,  -9.63921282, -37.21153319,
        12.40642995,   9.78205973,  11.63582219,   9.01164325,
        -3.8801097 ,   7.98262563,   4.68259133,  11.78988806,
        12.32959117,  12.38652512,  -6.08824952,  11.77446983,
        10.2419052 , -24.53633336,   4.14718491, -16.10929915,
       -46.64189545,   5.70926163, -54.56681816, -38.45758212,
        12.62889082,   8.3721771 ,  -8.86671555,  12.84885467,
       -28.24654297,  -6.71182115,   9.59805182,  11.57255428,
       -43.37901892, -28.41841962])
[f"x{i}" for i in range(1,5)]
['x1', 'x2', 'x3', 'x4']
df[[f"x{i}" for i in range(1,5)]]
x1 x2 x3 x4
0 -4.900352 24.013449 -117.674352 576.645739
1 4.236751 17.950062 76.049952 322.204742
2 3.128746 9.789053 30.627463 95.825559
3 2.915629 8.500890 24.785437 72.265127
4 -3.751502 14.073767 -52.797766 198.070927
5 -1.330997 1.771554 -2.357934 3.138404
6 2.041902 4.169362 8.513426 17.383577
7 -2.322127 5.392274 -12.521544 29.076614
8 3.291763 10.835703 35.668567 117.412469
9 3.410108 11.628835 39.655581 135.229804
10 -0.339651 0.115363 -0.039183 0.013309
11 -0.401302 0.161043 -0.064627 0.025935
12 2.904492 8.436071 24.502499 71.167302
13 -0.053634 0.002877 -0.000154 0.000008
14 3.778631 14.278049 53.951473 203.862687
15 3.797665 14.422262 54.770923 208.001634
16 -2.148780 4.617255 -9.921464 21.319040
17 3.537900 12.516734 44.282947 156.668620
18 -1.857388 3.449889 -6.407781 11.901733
19 -3.965154 15.722447 -62.341925 247.195341
20 3.054540 9.330217 28.499526 87.052955
21 4.022897 16.183704 65.105381 261.912270
22 3.440408 11.836410 40.722082 140.100592
23 4.209701 17.721582 74.602559 314.054459
24 -1.268537 1.609187 -2.041314 2.589484
25 4.432180 19.644221 87.066727 385.895419
26 -0.156168 0.024389 -0.003809 0.000595
27 1.487581 2.212898 3.291865 4.896916
28 3.103047 9.628899 29.878925 92.715703
29 3.067461 9.409318 28.862717 88.535265
30 -1.504739 2.264238 -3.407086 5.126774
31 1.480935 2.193168 3.247938 4.809984
32 0.963425 0.928189 0.894241 0.861534
33 -3.096909 9.590848 -29.701988 91.984368
34 -0.239334 0.057281 -0.013709 0.003281
35 -2.434673 5.927635 -14.431855 35.136854
36 -4.541341 20.623780 -93.659619 425.340282
37 4.852255 23.544378 114.243325 554.337741
38 -4.991041 24.910495 -124.329315 620.532771
39 -4.044227 16.355769 -66.146437 267.511183
40 1.977772 3.911584 7.736223 15.300488
41 4.350993 18.931140 82.369259 358.388070
42 -1.783182 3.179739 -5.670054 10.110739
43 2.594333 6.730562 17.461317 45.300467
44 -3.364536 11.320099 -38.086877 128.144653
45 -1.568917 2.461500 -3.861888 6.058980
46 0.794089 0.630578 0.500735 0.397628
47 1.397830 1.953929 2.731260 3.817837
48 -4.347525 18.900975 -82.172465 357.246857
49 -3.376635 11.401661 -38.499243 129.997875

Define a function plotting the fit polynomial

Define a function make_chart which takes as input df and a degree d, and as output returns an Altair chart showing the fit polynomial. Use a red line for the polynomial. Use .copy() at the beginning of the function so you are not changing the original DataFrame. Give the chart a title using .properties(title = ...).

def make_chart(df,d):
    df2 = df.copy()
    reg = poly_reg(df,d)
    X = df[[f"x{i}" for i in range(1,d+1)]]
    df2["y_pred"] = reg.predict(X)
    chart = alt.Chart(df2).mark_line().encode(
        x = "x1",
        y = "y_pred"
    )
    return chart
make_chart(df,20) + chart_true + chart_data

Put the charts together

Make a length 20 list chart_list = [make_chart(df,1) + chart_true + chart_data, make_chart(df,2) + chart_true + chart_data, ..., make_chart(df,20) + chart_true + chart_data]. Use alt.vconcat to display all of these charts.