Over-fitting and polynomial regression¶
We will use the following constants.
m = 50
(number of data points)x
-values randomly distributed in the interval \([-5,5]\)y_true
values given by \(c_2 x^2 + c_1 x + c_0\), where \(c_2 = -1.4\), \(c_1 = 6.5\), and \(c_0 = 3.2\)y
values given byy_true
plus normally distributed random values with a mean of 0 and a standard deviation of 30.d
: the degree we use for the polynomial regression. Minimum allowed value: 1. Maximum allowed value: 20.df
: a pandas DataFrame holding all of our data.
import numpy as np
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression
rng = np.random.default_rng()
Define and plot the initial data¶
Define m
, x
, c
, y_true
, y
. Make a pandas DataFrame df
with columns for x
, y_true
, and y
.
m = 50
x = 10*rng.random(size=50) - 5
# 𝑐2=−1.4 , 𝑐1=6.5 , and 𝑐0=3.2
c = [3.2,6.5,-1.4]
y_true = c[0] + c[1]*x + c[2]*x**2
y = y_true + rng.normal(loc = 0, scale = 30, size = m)
df = pd.DataFrame({"x":x,"y_true":y_true, "y": y})
df
x | y_true | y | |
---|---|---|---|
0 | -4.900352 | -62.271116 | -80.214720 |
1 | 4.236751 | 5.608797 | 24.964830 |
2 | 3.128746 | 9.832176 | 27.688988 |
3 | 2.915629 | 10.250340 | 15.783512 |
4 | -3.751502 | -40.888037 | -44.236319 |
5 | -1.330997 | -7.931659 | 9.999005 |
6 | 2.041902 | 10.635253 | -7.676105 |
7 | -2.322127 | -19.443008 | -19.361647 |
8 | 3.291763 | 9.426474 | -7.688856 |
9 | 3.410108 | 9.085331 | -20.883321 |
10 | -0.339651 | 0.830762 | 36.057586 |
11 | -0.401302 | 0.366075 | -17.010790 |
12 | 2.904492 | 10.268695 | 3.732706 |
13 | -0.053634 | 2.847349 | -34.359309 |
14 | 3.778631 | 7.771830 | 29.311126 |
15 | 3.797665 | 7.693658 | 20.264761 |
16 | -2.148780 | -17.231225 | 9.019845 |
17 | 3.537900 | 8.672920 | 3.776920 |
18 | -1.857388 | -13.702864 | 22.194729 |
19 | -3.965154 | -44.584928 | -26.446755 |
20 | 3.054540 | 9.992209 | -0.424804 |
21 | 4.022897 | 6.691648 | 16.108043 |
22 | 3.440408 | 8.991681 | 20.950999 |
23 | 4.209701 | 5.752841 | 36.118528 |
24 | -1.268537 | -7.298356 | 3.504744 |
25 | 4.432180 | 4.507262 | 18.268370 |
26 | -0.156168 | 2.150761 | -16.623551 |
27 | 1.487581 | 9.771221 | -39.149535 |
28 | 3.103047 | 9.889345 | 42.404592 |
29 | 3.067461 | 9.965452 | 1.500292 |
30 | -1.504739 | -9.750734 | -6.590996 |
31 | 1.480935 | 9.755641 | -31.225299 |
32 | 0.963425 | 8.162801 | 19.517357 |
33 | -3.096909 | -30.357099 | -11.146812 |
34 | -0.239334 | 1.564135 | 28.001757 |
35 | -2.434673 | -20.924066 | -8.356333 |
36 | -4.541341 | -55.192009 | -69.038385 |
37 | 4.852255 | 1.777528 | 5.054251 |
38 | -4.991041 | -64.116463 | -83.757761 |
39 | -4.044227 | -45.985550 | -22.373779 |
40 | 1.977772 | 10.579303 | -14.424426 |
41 | 4.350993 | 4.977858 | -6.815410 |
42 | -1.783182 | -12.842319 | 0.088739 |
43 | 2.594333 | 10.640376 | 55.229200 |
44 | -3.364536 | -34.517620 | -62.697253 |
45 | -1.568917 | -10.444058 | 48.087066 |
46 | 0.794089 | 7.478772 | -14.120384 |
47 | 1.397830 | 9.550395 | 10.711448 |
48 | -4.347525 | -51.520279 | -9.869421 |
49 | -3.376635 | -34.710450 | -19.330468 |
chart_data = alt.Chart(df).mark_circle().encode(
x = "x",
y = "y"
)
chart_true = alt.Chart(df).mark_line().encode(
x = "x",
y = "y_true",
color = alt.value("black"),
)
chart_data + chart_true
Plot x
vs y_true
in one Altair chart, and x
vs y
in another chart. Save these charts using the names chart_true
and chart_data
, respectively. For chart_true
, use alt.value('black')
and plot the data using a black line. For chart_data
, plot the data using disks (in the default blue color).
Include powers of x in df
¶
Put more columns in df, corresponding to the powers of the x
-column, up to and including degree 20.
for i in range(1,21):
df["x"+str(i)] = df["x"]**i
Define a function performing the polynomial regression¶
Define a function poly_reg
which takes as input df
and a degree d
, then performs polynomial regression using that degree, and as output returns the fit LinearRegression object.
def poly_reg(df, d):
reg = LinearRegression()
X = df[[f"x{i}" for i in range(1,d+1)]]
reg.fit(X,df["y"])
return reg
reg = poly_reg(df,2)
reg.predict(df[["x1","x2"]])
array([-52.92876935, 8.89299986, 12.28654622, 12.59460986,
-33.92129846, -4.45073566, 12.69513709, -14.78387552,
11.97584418, 11.70952355, 3.47883237, 3.0558604 ,
12.60765109, 5.3193789 , 10.6605696 , 10.59735911,
-12.8030892 , 11.38343901, -9.63921282, -37.21153319,
12.40642995, 9.78205973, 11.63582219, 9.01164325,
-3.8801097 , 7.98262563, 4.68259133, 11.78988806,
12.32959117, 12.38652512, -6.08824952, 11.77446983,
10.2419052 , -24.53633336, 4.14718491, -16.10929915,
-46.64189545, 5.70926163, -54.56681816, -38.45758212,
12.62889082, 8.3721771 , -8.86671555, 12.84885467,
-28.24654297, -6.71182115, 9.59805182, 11.57255428,
-43.37901892, -28.41841962])
[f"x{i}" for i in range(1,5)]
['x1', 'x2', 'x3', 'x4']
df[[f"x{i}" for i in range(1,5)]]
x1 | x2 | x3 | x4 | |
---|---|---|---|---|
0 | -4.900352 | 24.013449 | -117.674352 | 576.645739 |
1 | 4.236751 | 17.950062 | 76.049952 | 322.204742 |
2 | 3.128746 | 9.789053 | 30.627463 | 95.825559 |
3 | 2.915629 | 8.500890 | 24.785437 | 72.265127 |
4 | -3.751502 | 14.073767 | -52.797766 | 198.070927 |
5 | -1.330997 | 1.771554 | -2.357934 | 3.138404 |
6 | 2.041902 | 4.169362 | 8.513426 | 17.383577 |
7 | -2.322127 | 5.392274 | -12.521544 | 29.076614 |
8 | 3.291763 | 10.835703 | 35.668567 | 117.412469 |
9 | 3.410108 | 11.628835 | 39.655581 | 135.229804 |
10 | -0.339651 | 0.115363 | -0.039183 | 0.013309 |
11 | -0.401302 | 0.161043 | -0.064627 | 0.025935 |
12 | 2.904492 | 8.436071 | 24.502499 | 71.167302 |
13 | -0.053634 | 0.002877 | -0.000154 | 0.000008 |
14 | 3.778631 | 14.278049 | 53.951473 | 203.862687 |
15 | 3.797665 | 14.422262 | 54.770923 | 208.001634 |
16 | -2.148780 | 4.617255 | -9.921464 | 21.319040 |
17 | 3.537900 | 12.516734 | 44.282947 | 156.668620 |
18 | -1.857388 | 3.449889 | -6.407781 | 11.901733 |
19 | -3.965154 | 15.722447 | -62.341925 | 247.195341 |
20 | 3.054540 | 9.330217 | 28.499526 | 87.052955 |
21 | 4.022897 | 16.183704 | 65.105381 | 261.912270 |
22 | 3.440408 | 11.836410 | 40.722082 | 140.100592 |
23 | 4.209701 | 17.721582 | 74.602559 | 314.054459 |
24 | -1.268537 | 1.609187 | -2.041314 | 2.589484 |
25 | 4.432180 | 19.644221 | 87.066727 | 385.895419 |
26 | -0.156168 | 0.024389 | -0.003809 | 0.000595 |
27 | 1.487581 | 2.212898 | 3.291865 | 4.896916 |
28 | 3.103047 | 9.628899 | 29.878925 | 92.715703 |
29 | 3.067461 | 9.409318 | 28.862717 | 88.535265 |
30 | -1.504739 | 2.264238 | -3.407086 | 5.126774 |
31 | 1.480935 | 2.193168 | 3.247938 | 4.809984 |
32 | 0.963425 | 0.928189 | 0.894241 | 0.861534 |
33 | -3.096909 | 9.590848 | -29.701988 | 91.984368 |
34 | -0.239334 | 0.057281 | -0.013709 | 0.003281 |
35 | -2.434673 | 5.927635 | -14.431855 | 35.136854 |
36 | -4.541341 | 20.623780 | -93.659619 | 425.340282 |
37 | 4.852255 | 23.544378 | 114.243325 | 554.337741 |
38 | -4.991041 | 24.910495 | -124.329315 | 620.532771 |
39 | -4.044227 | 16.355769 | -66.146437 | 267.511183 |
40 | 1.977772 | 3.911584 | 7.736223 | 15.300488 |
41 | 4.350993 | 18.931140 | 82.369259 | 358.388070 |
42 | -1.783182 | 3.179739 | -5.670054 | 10.110739 |
43 | 2.594333 | 6.730562 | 17.461317 | 45.300467 |
44 | -3.364536 | 11.320099 | -38.086877 | 128.144653 |
45 | -1.568917 | 2.461500 | -3.861888 | 6.058980 |
46 | 0.794089 | 0.630578 | 0.500735 | 0.397628 |
47 | 1.397830 | 1.953929 | 2.731260 | 3.817837 |
48 | -4.347525 | 18.900975 | -82.172465 | 357.246857 |
49 | -3.376635 | 11.401661 | -38.499243 | 129.997875 |
Define a function plotting the fit polynomial¶
Define a function make_chart
which takes as input df
and a degree d
, and as output returns an Altair chart showing the fit polynomial. Use a red line for the polynomial. Use .copy()
at the beginning of the function so you are not changing the original DataFrame. Give the chart a title using .properties(title = ...)
.
def make_chart(df,d):
df2 = df.copy()
reg = poly_reg(df,d)
X = df[[f"x{i}" for i in range(1,d+1)]]
df2["y_pred"] = reg.predict(X)
chart = alt.Chart(df2).mark_line().encode(
x = "x1",
y = "y_pred"
)
return chart
make_chart(df,20) + chart_true + chart_data
Put the charts together¶
Make a length 20 list chart_list = [make_chart(df,1) + chart_true + chart_data, make_chart(df,2) + chart_true + chart_data, ..., make_chart(df,20) + chart_true + chart_data]
. Use alt.vconcat
to display all of these charts.