Over-fitting and polynomial regression¶
We will use the following constants.
m = 50(number of data points)x-values randomly distributed in the interval \([-5,5]\)y_truevalues given by \(c_2 x^2 + c_1 x + c_0\), where \(c_2 = -1.4\), \(c_1 = 6.5\), and \(c_0 = 3.2\)yvalues given byy_trueplus normally distributed random values with a mean of 0 and a standard deviation of 30.d: the degree we use for the polynomial regression. Minimum allowed value: 1. Maximum allowed value: 20.df: a pandas DataFrame holding all of our data.
import numpy as np
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression
rng = np.random.default_rng()
Define and plot the initial data¶
Define m, x, c, y_true, y. Make a pandas DataFrame df with columns for x, y_true, and y.
m = 50
x = 10*rng.random(size=50) - 5
# 𝑐2=−1.4 , 𝑐1=6.5 , and 𝑐0=3.2
c = [3.2,6.5,-1.4]
y_true = c[0] + c[1]*x + c[2]*x**2
y = y_true + rng.normal(loc = 0, scale = 30, size = m)
df = pd.DataFrame({"x":x,"y_true":y_true, "y": y})
df
| x | y_true | y | |
|---|---|---|---|
| 0 | -4.900352 | -62.271116 | -80.214720 |
| 1 | 4.236751 | 5.608797 | 24.964830 |
| 2 | 3.128746 | 9.832176 | 27.688988 |
| 3 | 2.915629 | 10.250340 | 15.783512 |
| 4 | -3.751502 | -40.888037 | -44.236319 |
| 5 | -1.330997 | -7.931659 | 9.999005 |
| 6 | 2.041902 | 10.635253 | -7.676105 |
| 7 | -2.322127 | -19.443008 | -19.361647 |
| 8 | 3.291763 | 9.426474 | -7.688856 |
| 9 | 3.410108 | 9.085331 | -20.883321 |
| 10 | -0.339651 | 0.830762 | 36.057586 |
| 11 | -0.401302 | 0.366075 | -17.010790 |
| 12 | 2.904492 | 10.268695 | 3.732706 |
| 13 | -0.053634 | 2.847349 | -34.359309 |
| 14 | 3.778631 | 7.771830 | 29.311126 |
| 15 | 3.797665 | 7.693658 | 20.264761 |
| 16 | -2.148780 | -17.231225 | 9.019845 |
| 17 | 3.537900 | 8.672920 | 3.776920 |
| 18 | -1.857388 | -13.702864 | 22.194729 |
| 19 | -3.965154 | -44.584928 | -26.446755 |
| 20 | 3.054540 | 9.992209 | -0.424804 |
| 21 | 4.022897 | 6.691648 | 16.108043 |
| 22 | 3.440408 | 8.991681 | 20.950999 |
| 23 | 4.209701 | 5.752841 | 36.118528 |
| 24 | -1.268537 | -7.298356 | 3.504744 |
| 25 | 4.432180 | 4.507262 | 18.268370 |
| 26 | -0.156168 | 2.150761 | -16.623551 |
| 27 | 1.487581 | 9.771221 | -39.149535 |
| 28 | 3.103047 | 9.889345 | 42.404592 |
| 29 | 3.067461 | 9.965452 | 1.500292 |
| 30 | -1.504739 | -9.750734 | -6.590996 |
| 31 | 1.480935 | 9.755641 | -31.225299 |
| 32 | 0.963425 | 8.162801 | 19.517357 |
| 33 | -3.096909 | -30.357099 | -11.146812 |
| 34 | -0.239334 | 1.564135 | 28.001757 |
| 35 | -2.434673 | -20.924066 | -8.356333 |
| 36 | -4.541341 | -55.192009 | -69.038385 |
| 37 | 4.852255 | 1.777528 | 5.054251 |
| 38 | -4.991041 | -64.116463 | -83.757761 |
| 39 | -4.044227 | -45.985550 | -22.373779 |
| 40 | 1.977772 | 10.579303 | -14.424426 |
| 41 | 4.350993 | 4.977858 | -6.815410 |
| 42 | -1.783182 | -12.842319 | 0.088739 |
| 43 | 2.594333 | 10.640376 | 55.229200 |
| 44 | -3.364536 | -34.517620 | -62.697253 |
| 45 | -1.568917 | -10.444058 | 48.087066 |
| 46 | 0.794089 | 7.478772 | -14.120384 |
| 47 | 1.397830 | 9.550395 | 10.711448 |
| 48 | -4.347525 | -51.520279 | -9.869421 |
| 49 | -3.376635 | -34.710450 | -19.330468 |
chart_data = alt.Chart(df).mark_circle().encode(
x = "x",
y = "y"
)
chart_true = alt.Chart(df).mark_line().encode(
x = "x",
y = "y_true",
color = alt.value("black"),
)
chart_data + chart_true
Plot x vs y_true in one Altair chart, and x vs y in another chart. Save these charts using the names chart_true and chart_data, respectively. For chart_true, use alt.value('black') and plot the data using a black line. For chart_data, plot the data using disks (in the default blue color).
Include powers of x in df¶
Put more columns in df, corresponding to the powers of the x-column, up to and including degree 20.
for i in range(1,21):
df["x"+str(i)] = df["x"]**i
Define a function performing the polynomial regression¶
Define a function poly_reg which takes as input df and a degree d, then performs polynomial regression using that degree, and as output returns the fit LinearRegression object.
def poly_reg(df, d):
reg = LinearRegression()
X = df[[f"x{i}" for i in range(1,d+1)]]
reg.fit(X,df["y"])
return reg
reg = poly_reg(df,2)
reg.predict(df[["x1","x2"]])
array([-52.92876935, 8.89299986, 12.28654622, 12.59460986,
-33.92129846, -4.45073566, 12.69513709, -14.78387552,
11.97584418, 11.70952355, 3.47883237, 3.0558604 ,
12.60765109, 5.3193789 , 10.6605696 , 10.59735911,
-12.8030892 , 11.38343901, -9.63921282, -37.21153319,
12.40642995, 9.78205973, 11.63582219, 9.01164325,
-3.8801097 , 7.98262563, 4.68259133, 11.78988806,
12.32959117, 12.38652512, -6.08824952, 11.77446983,
10.2419052 , -24.53633336, 4.14718491, -16.10929915,
-46.64189545, 5.70926163, -54.56681816, -38.45758212,
12.62889082, 8.3721771 , -8.86671555, 12.84885467,
-28.24654297, -6.71182115, 9.59805182, 11.57255428,
-43.37901892, -28.41841962])
[f"x{i}" for i in range(1,5)]
['x1', 'x2', 'x3', 'x4']
df[[f"x{i}" for i in range(1,5)]]
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | -4.900352 | 24.013449 | -117.674352 | 576.645739 |
| 1 | 4.236751 | 17.950062 | 76.049952 | 322.204742 |
| 2 | 3.128746 | 9.789053 | 30.627463 | 95.825559 |
| 3 | 2.915629 | 8.500890 | 24.785437 | 72.265127 |
| 4 | -3.751502 | 14.073767 | -52.797766 | 198.070927 |
| 5 | -1.330997 | 1.771554 | -2.357934 | 3.138404 |
| 6 | 2.041902 | 4.169362 | 8.513426 | 17.383577 |
| 7 | -2.322127 | 5.392274 | -12.521544 | 29.076614 |
| 8 | 3.291763 | 10.835703 | 35.668567 | 117.412469 |
| 9 | 3.410108 | 11.628835 | 39.655581 | 135.229804 |
| 10 | -0.339651 | 0.115363 | -0.039183 | 0.013309 |
| 11 | -0.401302 | 0.161043 | -0.064627 | 0.025935 |
| 12 | 2.904492 | 8.436071 | 24.502499 | 71.167302 |
| 13 | -0.053634 | 0.002877 | -0.000154 | 0.000008 |
| 14 | 3.778631 | 14.278049 | 53.951473 | 203.862687 |
| 15 | 3.797665 | 14.422262 | 54.770923 | 208.001634 |
| 16 | -2.148780 | 4.617255 | -9.921464 | 21.319040 |
| 17 | 3.537900 | 12.516734 | 44.282947 | 156.668620 |
| 18 | -1.857388 | 3.449889 | -6.407781 | 11.901733 |
| 19 | -3.965154 | 15.722447 | -62.341925 | 247.195341 |
| 20 | 3.054540 | 9.330217 | 28.499526 | 87.052955 |
| 21 | 4.022897 | 16.183704 | 65.105381 | 261.912270 |
| 22 | 3.440408 | 11.836410 | 40.722082 | 140.100592 |
| 23 | 4.209701 | 17.721582 | 74.602559 | 314.054459 |
| 24 | -1.268537 | 1.609187 | -2.041314 | 2.589484 |
| 25 | 4.432180 | 19.644221 | 87.066727 | 385.895419 |
| 26 | -0.156168 | 0.024389 | -0.003809 | 0.000595 |
| 27 | 1.487581 | 2.212898 | 3.291865 | 4.896916 |
| 28 | 3.103047 | 9.628899 | 29.878925 | 92.715703 |
| 29 | 3.067461 | 9.409318 | 28.862717 | 88.535265 |
| 30 | -1.504739 | 2.264238 | -3.407086 | 5.126774 |
| 31 | 1.480935 | 2.193168 | 3.247938 | 4.809984 |
| 32 | 0.963425 | 0.928189 | 0.894241 | 0.861534 |
| 33 | -3.096909 | 9.590848 | -29.701988 | 91.984368 |
| 34 | -0.239334 | 0.057281 | -0.013709 | 0.003281 |
| 35 | -2.434673 | 5.927635 | -14.431855 | 35.136854 |
| 36 | -4.541341 | 20.623780 | -93.659619 | 425.340282 |
| 37 | 4.852255 | 23.544378 | 114.243325 | 554.337741 |
| 38 | -4.991041 | 24.910495 | -124.329315 | 620.532771 |
| 39 | -4.044227 | 16.355769 | -66.146437 | 267.511183 |
| 40 | 1.977772 | 3.911584 | 7.736223 | 15.300488 |
| 41 | 4.350993 | 18.931140 | 82.369259 | 358.388070 |
| 42 | -1.783182 | 3.179739 | -5.670054 | 10.110739 |
| 43 | 2.594333 | 6.730562 | 17.461317 | 45.300467 |
| 44 | -3.364536 | 11.320099 | -38.086877 | 128.144653 |
| 45 | -1.568917 | 2.461500 | -3.861888 | 6.058980 |
| 46 | 0.794089 | 0.630578 | 0.500735 | 0.397628 |
| 47 | 1.397830 | 1.953929 | 2.731260 | 3.817837 |
| 48 | -4.347525 | 18.900975 | -82.172465 | 357.246857 |
| 49 | -3.376635 | 11.401661 | -38.499243 | 129.997875 |
Define a function plotting the fit polynomial¶
Define a function make_chart which takes as input df and a degree d, and as output returns an Altair chart showing the fit polynomial. Use a red line for the polynomial. Use .copy() at the beginning of the function so you are not changing the original DataFrame. Give the chart a title using .properties(title = ...).
def make_chart(df,d):
df2 = df.copy()
reg = poly_reg(df,d)
X = df[[f"x{i}" for i in range(1,d+1)]]
df2["y_pred"] = reg.predict(X)
chart = alt.Chart(df2).mark_line().encode(
x = "x1",
y = "y_pred"
)
return chart
make_chart(df,20) + chart_true + chart_data
Put the charts together¶
Make a length 20 list chart_list = [make_chart(df,1) + chart_true + chart_data, make_chart(df,2) + chart_true + chart_data, ..., make_chart(df,20) + chart_true + chart_data]. Use alt.vconcat to display all of these charts.