# Over-fitting and polynomial regression
We will use the following constants.
* `m = 50` (number of data points)
* `x`-values randomly distributed in the interval $[-5,5]$
* `y_true` values given by $c_2 x^2 + c_1 x + c_0$, where $c_2 = -1.4$, $c_1 = 6.5$, and $c_0 = 3.2$
* `y` values given by `y_true` plus normally distributed random values with a mean of 0 and a standard deviation of 30.
* `d`: the degree we use for the polynomial regression.  Minimum allowed value: 1.  Maximum allowed value: 20.
* `df`: a pandas DataFrame holding all of our data.

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng()

## Define and plot the initial data
Define `m`, `x`, `c`, `y_true`, `y`.  Make a pandas DataFrame `df` with columns for `x`, `y_true`, and `y`.

In [3]:
m = 50
x = 10*rng.random(size=50) - 5
# ùëê2=‚àí1.4 ,  ùëê1=6.5 , and  ùëê0=3.2
c = [3.2,6.5,-1.4]

In [6]:
y_true = c[0] + c[1]*x + c[2]*x**2

In [11]:
y = y_true + rng.normal(loc = 0, scale = 30, size = m)

In [12]:
df = pd.DataFrame({"x":x,"y_true":y_true, "y": y})

In [14]:
df

Unnamed: 0,x,y_true,y
0,-3.62191,-38.707934,-28.824435
1,-0.297771,1.140354,-13.455134
2,-2.778235,-25.664549,-60.245202
3,0.588938,6.542509,-8.683601
4,-1.722267,-12.147423,-49.293762
5,-0.004124,3.173168,24.758263
6,0.733193,7.213155,59.070922
7,4.232086,5.633786,-0.697422
8,-3.533901,-37.254201,-53.411365
9,-3.707973,-40.150509,-8.414555


In [16]:
chart_data = alt.Chart(df).mark_circle().encode(
    x = "x",
    y = "y"
)

In [17]:
chart_true = alt.Chart(df).mark_line().encode(
    x = "x",
    y = "y_true",
    color = alt.value("black"),
)

In [18]:
chart_data + chart_true

Plot `x` vs `y_true` in one Altair chart, and `x` vs `y` in another chart.  Save these charts using the names `chart_true` and `chart_data`, respectively.  For `chart_true`, use `alt.value('black')` and plot the data using a black line.  For `chart_data`, plot the data using disks (in the default blue color).

## Include powers of x in `df`
Put more columns in df, corresponding to the powers of the `x`-column, up to and including degree 20.

In [22]:
for i in range(1,21):
    df["x"+str(i)] = df["x"]**i

## Define a function performing the polynomial regression
Define a function `poly_reg` which takes as input `df` and a degree `d`, then performs polynomial regression using that degree, and as output returns the fit LinearRegression object.

In [35]:
def poly_reg(df, d):
    reg = LinearRegression()
    X = df[[f"x{i}" for i in range(1,d+1)]]
    reg.fit(X,df["y"])
    return reg

In [41]:
reg = poly_reg(df,2)

In [42]:
reg.predict(df[["x1","x2"]])

array([-32.05019435, -13.40758896, -26.17866244, -10.46866949,
       -19.92242374, -12.33945452, -10.07156262,  -7.38366638,
       -31.40147486, -32.69273649, -29.40236897, -32.43753575,
       -39.25618433, -16.04687097,  -7.83605554,  -7.8221931 ,
        -6.88180525, -33.34735702, -22.27685855,  -6.98875249,
        -6.79032091, -32.91358912, -12.39482689,  -7.8749104 ,
        -7.37394207, -23.22585323, -14.63429907, -36.26351765,
        -6.85255297,  -7.49814221,  -7.10973598, -37.73868477,
        -8.59666284, -34.16993409,  -6.79426144,  -6.85042453,
       -23.59932606,  -7.58667003, -19.79568968, -35.90546229,
       -27.57036253,  -9.71313098, -25.61289545, -10.3868084 ,
       -42.54487361, -24.86547186,  -9.01775371, -25.99094437,
       -11.26941155, -14.38025159])

In [27]:
[f"x{i}" for i in range(1,5)]

['x1', 'x2', 'x3', 'x4']

In [26]:
df[[f"x{i}" for i in range(1,5)]]

Unnamed: 0,x1,x2,x3,x4
0,-3.62191,13.118229,-47.51304,172.0879
1,-0.297771,0.088668,-0.02640264,0.007861942
2,-2.778235,7.718588,-21.44405,59.5766
3,0.588938,0.346848,0.2042717,0.1203033
4,-1.722267,2.966205,-5.108597,8.798369
5,-0.004124,1.7e-05,-7.015296e-08,2.893308e-10
6,0.733193,0.537573,0.3941446,0.2889843
7,4.232086,17.910552,75.79899,320.7879
8,-3.533901,12.488459,-44.13298,155.9616
9,-3.707973,13.749062,-50.98115,189.0367


## Define a function plotting the fit polynomial
Define a function `make_chart` which takes as input `df` and a degree `d`, and as output returns an Altair chart showing the fit polynomial.  Use a red line for the polynomial.  Use `.copy()` at the beginning of the function so you are not changing the original DataFrame.  Give the chart a title using `.properties(title = ...)`.

In [47]:
def make_chart(df,d):
    df2 = df.copy()
    reg = poly_reg(df,d)
    X = df[[f"x{i}" for i in range(1,d+1)]]
    df2["y_pred"] = reg.predict(X)
    chart = alt.Chart(df2).mark_line().encode(
        x = "x1",
        y = "y_pred"
    )
    return chart

In [50]:
make_chart(df,20) + chart_true + chart_data

## Put the charts together
Make a length 20 list `chart_list = [make_chart(df,1) + chart_true + chart_data, make_chart(df,2) + chart_true + chart_data, ..., make_chart(df,20) + chart_true + chart_data]`.  Use `alt.vconcat` to display all of these charts.