Week 6 Wednesday#

Announcements#

  • Midterm scores were high (mean 86%!) so there won’t be any curving.

  • Friday office hours are cancelled this week, will have office hours on Monday before lecture instead.

  • Guest lecture by Prof. Peijie Zhou on Friday. Professor Zhou will introduce our first example of a machine learning algorithm for classification: logistic regression. (Confusing: even though “regression” is in the name, logistic regression is not an algorithm for regression, it’s an algorithm for classification.)

import numpy as np
import pandas as pd
import altair as alt

More on polynomial regression#

Let’s go more slowly through the material from the end of class on Monday.

Generating random data#

Here we make some data that follows a random polynomial. Can we use scikit-learn to estimate the underlying polynomial?

Here are some comments about the code:

  • It’s written so that if you change deg to another integer, the rest should work the same.

  • The “y_true” column values follow a degree 3 polynomial exactly.

  • The “y” column values are obtained by adding random noise to the “y_true” values.

  • We use two different size keyword arguments, one for getting the coefficients, and one for getting a different random value for each row in the DataFrame.

  • It’s better to use normally distributed random values, rather than uniformly distributed values in [0,1], so that the data points are not all within a band of width 1 from the true polynomial.

  • In general in Python, if you find yourself writing range(len(???)), you’re probably not writing your code in a “Pythonic” way. We will see an elegant way to replace range(len(???)) below.

deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for i in range(len(m)):
    df["y_true"] += m[i]*df["x"]**i

df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5  1 -3 -2]

At the end of that process, here is how df looks.

df
x y_true y
0 -3.000000e+00 19.000 23.824406
1 -2.900000e+00 15.648 10.237108
2 -2.800000e+00 12.584 16.919087
3 -2.700000e+00 9.796 8.955196
4 -2.600000e+00 7.272 6.323695
5 -2.500000e+00 5.000 10.602832
6 -2.400000e+00 2.968 0.784105
7 -2.300000e+00 1.164 -5.234227
8 -2.200000e+00 -0.424 -2.771499
9 -2.100000e+00 -1.808 -7.792136
10 -2.000000e+00 -3.000 -12.199286
11 -1.900000e+00 -4.012 -4.739785
12 -1.800000e+00 -4.856 -2.864605
13 -1.700000e+00 -5.544 -16.354306
14 -1.600000e+00 -6.088 -6.015613
15 -1.500000e+00 -6.500 -5.224009
16 -1.400000e+00 -6.792 -5.926045
17 -1.300000e+00 -6.976 -13.326468
18 -1.200000e+00 -7.064 -8.618807
19 -1.100000e+00 -7.068 -7.999078
20 -1.000000e+00 -7.000 -4.835120
21 -9.000000e-01 -6.872 -13.308757
22 -8.000000e-01 -6.696 -4.778936
23 -7.000000e-01 -6.484 -1.524600
24 -6.000000e-01 -6.248 -17.686227
25 -5.000000e-01 -6.000 -4.880304
26 -4.000000e-01 -5.752 -13.385067
27 -3.000000e-01 -5.516 -6.368335
28 -2.000000e-01 -5.304 -8.704742
29 -1.000000e-01 -5.128 1.240175
30 2.664535e-15 -5.000 -4.714457
31 1.000000e-01 -4.932 -0.955557
32 2.000000e-01 -4.936 -10.169895
33 3.000000e-01 -5.024 -2.721364
34 4.000000e-01 -5.208 1.903864
35 5.000000e-01 -5.500 -9.036285
36 6.000000e-01 -5.912 -2.582631
37 7.000000e-01 -6.456 -4.049151
38 8.000000e-01 -7.144 -7.467370
39 9.000000e-01 -7.988 -14.245467
40 1.000000e+00 -9.000 -7.584490
41 1.100000e+00 -10.192 -11.543440
42 1.200000e+00 -11.576 -8.565901
43 1.300000e+00 -13.164 -9.269588
44 1.400000e+00 -14.968 -7.295300
45 1.500000e+00 -17.000 -21.182783
46 1.600000e+00 -19.272 -14.946791
47 1.700000e+00 -21.796 -18.291899
48 1.800000e+00 -24.584 -24.206884
49 1.900000e+00 -27.648 -24.589037

Aside: If you are using range(len(???)) in Python, there is almost always a more elegant way to accomplish the same thing.

  • Rewrite the code above using enumerate(m) instead of range(len(m)).

Recall that m holds the four randomly chosen coefficients for our true polynomial. Why couldn’t we use just for c in m: above? Because we needed to know both the value in m and its index. For example, we needed to know that -3 corresponded to the x**2 column (m[2] is -3).

This is such a common pattern in Python, that a function is provided to help accomplish this, called enumerate. When we iterate through enumerate(m), pairs of elements are returned: the index, and the value. For example in our case m = [-5,  1, -3, -2], and so the initial pair returned will be (0, -5), the next pair will be (1, 1), the next pair will be (2, -3), and the last pair will be (3, -2). We assign the values in these pairs to i and c, respectively.

If this enumerate is confusing, don’t focus too much on it, it’s a Python topic, not a Data Science or Machine Learning topic, but I wanted to show it because it leads to more elegant code in this case.

deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for i,c in enumerate(m):
    df["y_true"] += c*df["x"]**i

df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5  1 -3 -2]

Here we check that the resulting DataFrame looks the same.

df
x y_true y
0 -3.000000e+00 19.000 23.824406
1 -2.900000e+00 15.648 10.237108
2 -2.800000e+00 12.584 16.919087
3 -2.700000e+00 9.796 8.955196
4 -2.600000e+00 7.272 6.323695
5 -2.500000e+00 5.000 10.602832
6 -2.400000e+00 2.968 0.784105
7 -2.300000e+00 1.164 -5.234227
8 -2.200000e+00 -0.424 -2.771499
9 -2.100000e+00 -1.808 -7.792136
10 -2.000000e+00 -3.000 -12.199286
11 -1.900000e+00 -4.012 -4.739785
12 -1.800000e+00 -4.856 -2.864605
13 -1.700000e+00 -5.544 -16.354306
14 -1.600000e+00 -6.088 -6.015613
15 -1.500000e+00 -6.500 -5.224009
16 -1.400000e+00 -6.792 -5.926045
17 -1.300000e+00 -6.976 -13.326468
18 -1.200000e+00 -7.064 -8.618807
19 -1.100000e+00 -7.068 -7.999078
20 -1.000000e+00 -7.000 -4.835120
21 -9.000000e-01 -6.872 -13.308757
22 -8.000000e-01 -6.696 -4.778936
23 -7.000000e-01 -6.484 -1.524600
24 -6.000000e-01 -6.248 -17.686227
25 -5.000000e-01 -6.000 -4.880304
26 -4.000000e-01 -5.752 -13.385067
27 -3.000000e-01 -5.516 -6.368335
28 -2.000000e-01 -5.304 -8.704742
29 -1.000000e-01 -5.128 1.240175
30 2.664535e-15 -5.000 -4.714457
31 1.000000e-01 -4.932 -0.955557
32 2.000000e-01 -4.936 -10.169895
33 3.000000e-01 -5.024 -2.721364
34 4.000000e-01 -5.208 1.903864
35 5.000000e-01 -5.500 -9.036285
36 6.000000e-01 -5.912 -2.582631
37 7.000000e-01 -6.456 -4.049151
38 8.000000e-01 -7.144 -7.467370
39 9.000000e-01 -7.988 -14.245467
40 1.000000e+00 -9.000 -7.584490
41 1.100000e+00 -10.192 -11.543440
42 1.200000e+00 -11.576 -8.565901
43 1.300000e+00 -13.164 -9.269588
44 1.400000e+00 -14.968 -7.295300
45 1.500000e+00 -17.000 -21.182783
46 1.600000e+00 -19.272 -14.946791
47 1.700000e+00 -21.796 -18.291899
48 1.800000e+00 -24.584 -24.206884
49 1.900000e+00 -27.648 -24.589037
  • Here is what the data looks like.

Based on the values in m above, we know these points are approximately following the curve \(y = -2x^3 - 3x^2 + x - 5\). For example, because the leading coefficient is negative, we know the outputs should be getting more negative as x increases, which seems to match what we see in the plotted data.

c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

c1

Polynomial regression using PolynomialFeatures#

We saw how to perform polynomial regression “by hand” last week. The process is much easier if we take advantage of some additional functionality in scikit-learn.

  • Using PolynomialFeatures from sklearn.preprocessing, make a new pandas DataFrame df_pow containing df["x"]**i for i = 1, 2, 3.

  • Use the include_bias keyword argument so we do not get a column for \(x^0\). (I forgot to do this on Monday.)

Here we import PolynomialFeatures.

from sklearn.preprocessing import PolynomialFeatures

Here we instantiate it. Think of this step the same as you think about LinearRegression() or np.random.default_rng(). We need to pass a degree argument to the constructor, so that scikit-learn knows how many powers we want. (I guess it would have been more robust to use degree=deg here.)

poly = PolynomialFeatures(degree=3)

Now we fit this PolynomialFeatures object to the data. Notice how we do not pass df["y"] to this function. That is because there are no “true” outputs in this step (this is just a preprocessing step).

poly.fit(df[["x"]])
PolynomialFeatures(degree=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We now call poly.transform instead of poly.predict, because we are not “predicting” anything. We convert the resulting NumPy array into a pandas DataFrame and round all the entries to the nearest decimal place.

Aside: this rounding could also be done using applymap and a lambda function. This is a natural place to use applymap (as opposed to apply or map), because we are doing the same thing to every element in a pandas DataFrame.

Notice how these values correspond to powers of the “x” column, including the zero-th power (all 1s).

# .round(1) is same as .applymap(lambda z: round(z,1))
pd.DataFrame(poly.transform(df[["x"]])).round(1)
0 1 2 3
0 1.0 -3.0 9.0 -27.0
1 1.0 -2.9 8.4 -24.4
2 1.0 -2.8 7.8 -22.0
3 1.0 -2.7 7.3 -19.7
4 1.0 -2.6 6.8 -17.6
5 1.0 -2.5 6.2 -15.6
6 1.0 -2.4 5.8 -13.8
7 1.0 -2.3 5.3 -12.2
8 1.0 -2.2 4.8 -10.6
9 1.0 -2.1 4.4 -9.3
10 1.0 -2.0 4.0 -8.0
11 1.0 -1.9 3.6 -6.9
12 1.0 -1.8 3.2 -5.8
13 1.0 -1.7 2.9 -4.9
14 1.0 -1.6 2.6 -4.1
15 1.0 -1.5 2.2 -3.4
16 1.0 -1.4 2.0 -2.7
17 1.0 -1.3 1.7 -2.2
18 1.0 -1.2 1.4 -1.7
19 1.0 -1.1 1.2 -1.3
20 1.0 -1.0 1.0 -1.0
21 1.0 -0.9 0.8 -0.7
22 1.0 -0.8 0.6 -0.5
23 1.0 -0.7 0.5 -0.3
24 1.0 -0.6 0.4 -0.2
25 1.0 -0.5 0.2 -0.1
26 1.0 -0.4 0.2 -0.1
27 1.0 -0.3 0.1 -0.0
28 1.0 -0.2 0.0 -0.0
29 1.0 -0.1 0.0 -0.0
30 1.0 0.0 0.0 0.0
31 1.0 0.1 0.0 0.0
32 1.0 0.2 0.0 0.0
33 1.0 0.3 0.1 0.0
34 1.0 0.4 0.2 0.1
35 1.0 0.5 0.3 0.1
36 1.0 0.6 0.4 0.2
37 1.0 0.7 0.5 0.3
38 1.0 0.8 0.6 0.5
39 1.0 0.9 0.8 0.7
40 1.0 1.0 1.0 1.0
41 1.0 1.1 1.2 1.3
42 1.0 1.2 1.4 1.7
43 1.0 1.3 1.7 2.2
44 1.0 1.4 2.0 2.7
45 1.0 1.5 2.3 3.4
46 1.0 1.6 2.6 4.1
47 1.0 1.7 2.9 4.9
48 1.0 1.8 3.2 5.8
49 1.0 1.9 3.6 6.9

Let’s get rid of the column of all 1 values. We do this by setting include_bias=False when we instantiate the PolynomialFeatures object. That’s the only substantive change between the following and the above code, but we’ll combine the steps in a different way here, just to see an alternative approach. (I think this approach is probably the easier one to understand.)

poly = PolynomialFeatures(degree=3, include_bias=False)

We can perform both the fit and the transform steps in the same step using fit_transform. This “do both at once” possibility is available for some data types in scikit-learn but not all, and I don’t know exactly what the pattern is. (For example, there is no fit_predict method of a LinearRegression object, I’m not sure why not.)

arr = poly.fit_transform(df[["x"]])

Here are the first three rows. Notice how there is no column of all values equal to 1.

arr[:3]
array([[ -3.   ,   9.   , -27.   ],
       [ -2.9  ,   8.41 , -24.389],
       [ -2.8  ,   7.84 , -21.952]])

Here we convert arr into a pandas DataFrame.

df_pow = pd.DataFrame(arr)

It seems to be working as expected. The column names (notice that pandas defaults to integer column labels) are not helpful at the moment.

df_pow.head(3)
0 1 2
0 -3.0 9.00 -27.000
1 -2.9 8.41 -24.389
2 -2.8 7.84 -21.952
  • Name the columns using the get_feature_names_out method of the PolynomialFeatures object.

df_pow.columns = poly.get_feature_names_out()

The PolynomialFeatures class will also record appropriate column labels. Here we got them from the get_feature_names_out method.

df_pow.head(3)
x x^2 x^3
0 -3.0 9.00 -27.000
1 -2.9 8.41 -24.389
2 -2.8 7.84 -21.952
  • Concatenate the “y” and “y_true” columns from df onto the end of df_pow using pd.concat((???, ???), axis=???). Name the result df_both.

I believe this is our first time concatenating pandas DataFrames in Math 10 (meaning, we put two or more DataFrames side-by-side or on top of each other).

Notice how we use axis=1, because the column labels are changing but the row labels are staying the same.

Aside: I believe we would get an error if we used axis=0 (at least it would not work as expected), because these DataFrames df_pow and df[["y", "y_true"]] do not have the same columns, so it doesn’t really make sense to stack them on top of each other. (The other possibility I can imagine is that the result would have many missing values.)

df_both = pd.concat((df_pow, df[["y", "y_true"]]), axis=1)
df_both
x x^2 x^3 y y_true
0 -3.000000e+00 9.000000e+00 -2.700000e+01 23.824406 19.000
1 -2.900000e+00 8.410000e+00 -2.438900e+01 10.237108 15.648
2 -2.800000e+00 7.840000e+00 -2.195200e+01 16.919087 12.584
3 -2.700000e+00 7.290000e+00 -1.968300e+01 8.955196 9.796
4 -2.600000e+00 6.760000e+00 -1.757600e+01 6.323695 7.272
5 -2.500000e+00 6.250000e+00 -1.562500e+01 10.602832 5.000
6 -2.400000e+00 5.760000e+00 -1.382400e+01 0.784105 2.968
7 -2.300000e+00 5.290000e+00 -1.216700e+01 -5.234227 1.164
8 -2.200000e+00 4.840000e+00 -1.064800e+01 -2.771499 -0.424
9 -2.100000e+00 4.410000e+00 -9.261000e+00 -7.792136 -1.808
10 -2.000000e+00 4.000000e+00 -8.000000e+00 -12.199286 -3.000
11 -1.900000e+00 3.610000e+00 -6.859000e+00 -4.739785 -4.012
12 -1.800000e+00 3.240000e+00 -5.832000e+00 -2.864605 -4.856
13 -1.700000e+00 2.890000e+00 -4.913000e+00 -16.354306 -5.544
14 -1.600000e+00 2.560000e+00 -4.096000e+00 -6.015613 -6.088
15 -1.500000e+00 2.250000e+00 -3.375000e+00 -5.224009 -6.500
16 -1.400000e+00 1.960000e+00 -2.744000e+00 -5.926045 -6.792
17 -1.300000e+00 1.690000e+00 -2.197000e+00 -13.326468 -6.976
18 -1.200000e+00 1.440000e+00 -1.728000e+00 -8.618807 -7.064
19 -1.100000e+00 1.210000e+00 -1.331000e+00 -7.999078 -7.068
20 -1.000000e+00 1.000000e+00 -1.000000e+00 -4.835120 -7.000
21 -9.000000e-01 8.100000e-01 -7.290000e-01 -13.308757 -6.872
22 -8.000000e-01 6.400000e-01 -5.120000e-01 -4.778936 -6.696
23 -7.000000e-01 4.900000e-01 -3.430000e-01 -1.524600 -6.484
24 -6.000000e-01 3.600000e-01 -2.160000e-01 -17.686227 -6.248
25 -5.000000e-01 2.500000e-01 -1.250000e-01 -4.880304 -6.000
26 -4.000000e-01 1.600000e-01 -6.400000e-02 -13.385067 -5.752
27 -3.000000e-01 9.000000e-02 -2.700000e-02 -6.368335 -5.516
28 -2.000000e-01 4.000000e-02 -8.000000e-03 -8.704742 -5.304
29 -1.000000e-01 1.000000e-02 -1.000000e-03 1.240175 -5.128
30 2.664535e-15 7.099748e-30 1.891753e-44 -4.714457 -5.000
31 1.000000e-01 1.000000e-02 1.000000e-03 -0.955557 -4.932
32 2.000000e-01 4.000000e-02 8.000000e-03 -10.169895 -4.936
33 3.000000e-01 9.000000e-02 2.700000e-02 -2.721364 -5.024
34 4.000000e-01 1.600000e-01 6.400000e-02 1.903864 -5.208
35 5.000000e-01 2.500000e-01 1.250000e-01 -9.036285 -5.500
36 6.000000e-01 3.600000e-01 2.160000e-01 -2.582631 -5.912
37 7.000000e-01 4.900000e-01 3.430000e-01 -4.049151 -6.456
38 8.000000e-01 6.400000e-01 5.120000e-01 -7.467370 -7.144
39 9.000000e-01 8.100000e-01 7.290000e-01 -14.245467 -7.988
40 1.000000e+00 1.000000e+00 1.000000e+00 -7.584490 -9.000
41 1.100000e+00 1.210000e+00 1.331000e+00 -11.543440 -10.192
42 1.200000e+00 1.440000e+00 1.728000e+00 -8.565901 -11.576
43 1.300000e+00 1.690000e+00 2.197000e+00 -9.269588 -13.164
44 1.400000e+00 1.960000e+00 2.744000e+00 -7.295300 -14.968
45 1.500000e+00 2.250000e+00 3.375000e+00 -21.182783 -17.000
46 1.600000e+00 2.560000e+00 4.096000e+00 -14.946791 -19.272
47 1.700000e+00 2.890000e+00 4.913000e+00 -18.291899 -21.796
48 1.800000e+00 3.240000e+00 5.832000e+00 -24.206884 -24.584
49 1.900000e+00 3.610000e+00 6.859000e+00 -24.589037 -27.648
  • Find the “best” coefficient values for modeling \(y \approx c_3 x^3 + c_2 x^2 + c_1 x + c_0\).

We’ll go faster through this part, because it is the same process we’ve been using throughout the linear regression part of Math 10.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df_both[["x", "x^2", "x^3"]], df_both["y"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg.coef_
array([ 3.33524206, -3.03442169, -2.32952623])
reg.intercept_
-5.263954801170672
  • How do these values compare to the true coefficient values?

The true values follow the polynomial \(y = -2x^3 - 3x^2 + x - 5\). In our case, we have found approximately \(-2.3 x^3 - 3x^2 + 3.3 x - 5.26\). These two sequences of coefficients are remarkably similar.

Here I just looked at the coefficients and checked what terms they corresponded to. For example, I saw that 3.3 (approximately) was the initial coefficient, and that one corresponded to the “x” column. We could make this more robust (also finding a way to not have to type ["x", "x^2", "x^3"]), but we didn’t bother, because we will see a more efficient way to do all of these steps below, using Pipeline.

Using Pipeline to combine multiple steps#

The above process is a little awkward. We can achieve the same thing much more efficiently by using another data type defined by scikit-learn, Pipeline. (The tradeoff is that it is less explicit what is happening.)

  • Import the Pipeline class from sklearn.pipeline.

from sklearn.pipeline import Pipeline
  • Make an instance of this Pipeline class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (like PolynomialFeatures(???)).

The following is not correct, but understanding what we’re doing here will help you understand the correct approach. Here we are trying to pass a list of steps for scikit-learn to follow: first use PolynomialFeatures to transform the data, then use LinearRegression. We are passing the Pipeline constructor a length-2 list containing these steps.

# Wrong
pipe = Pipeline(
    [PolynomialFeatures(degree=3, include_bias=False), LinearRegression()]
)

The correct approach is very similar (the spacing in the following is to make the result more readable, but you can use essentially any spacing you want inside parentheses in Python). We are still passing a length-2 list containing the steps, but each element in this length-2 list is itself a length-2 tuple, that specifies the step, but also a name to give the step.

# Right
pipe = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=3, include_bias=False)), 
        ("reg", LinearRegression())
    ]
)
  • Fit this object to the data.

This is where we really benefit from Pipeline. The following call of pipe.fit first fits and transforms the data using PolynomialFeatures, and then fits that transformed data using LinearRegression.

pipe.fit(df[["x"]], df["y"])
Pipeline(steps=[('poly', PolynomialFeatures(degree=3, include_bias=False)),
                ('reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Do the coefficients match what we found above? Use the named_steps attribute, or just use the name directly.

You might try calling pipe.coef_, but we get an error message. It’s not the Pipeline object itself that has the fit coefficients, but the LinearRegression object within it.

pipe.coef_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [36], line 1
----> 1 pipe.coef_

AttributeError: 'Pipeline' object has no attribute 'coef_'

We can access that LinearRegression object by using indexing with the name "reg" we assigned to it when we instantiated pipe.

pipe["reg"]
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Notice that this is indeed a LinearRegression object.

type(pipe["reg"])
sklearn.linear_model._base.LinearRegression

The same works for the PolynomialFeatures object, again, we need to use the name "poly" we assigned above.

type(pipe["poly"])
sklearn.preprocessing._polynomial.PolynomialFeatures

This information is also recorded in a Python dictionary stored in the named_steps attribute of our Pipeline object.

pipe.named_steps
{'poly': PolynomialFeatures(degree=3, include_bias=False),
 'reg': LinearRegression()}

The point of all that is, now that we know how to access the LinearRegression object, we can get its coef_ attribute just like usual when performing linear regression. (Remember that this attribute only exists after we call fit.)

Notice that these are the exact same values as what we found above. It’s worth looking over both procedures and noticing how much shorter this procedure using Pipeline is.

Also I want to emphasize that these aren’t just approximately the same, but they are exactly the same values (at least up to maybe some numerical precision issues).

pipe["reg"].coef_
array([ 3.33524206, -3.03442169, -2.32952623])
  • Call the predict method, and add the resulting values to a new column in df named “pred”.

The following simple code is evaluating our “best fit” degree three polynomial \(-2.3 x^3 - 3x^2 + 3.3 x - 5.26\) for every value of in the “x” column. Notice how we don’t need to explicitly type "x^2" or anything like that, the polynomial part of this polynomial regression is happening “under the hood”.

df["pred"] = pipe.predict(df[["x"]])
  • Plot the resulting predictions using a red line. Name the chart c2.

What’s wrong with the following? We are using the “y” values (which were plotted in the scatter plot high above). Notice how these points do not lie perfectly on a cubic polynomial (it goes up and down far too many times), that is because of the random noise we added.

# Wrong

alt.Chart(df).mark_line(color="red").encode(
    x="x",
    y="y"
)

This one does lie perfectly on a cubic polynomial, more specifically, that cubic polynomial is approximately \(-2.3 x^3 - 3x^2 + 3.3 x - 5.26\). This is our cubic polynomial of “best fit” (meaning the Mean Squared Error between the data and this polynomial is minimized). For the given data, using Mean Squared Error, this polynomial fits the data “better” than the true underlying polynomial \(-2x^3 - 3x^2 + x - 5\).

c2 = alt.Chart(df).mark_line(color="red").encode(
    x="x",
    y="pred"
)

c2
  • Plot the true values using a dashed black line, using strokeDash=[10,5] as an argument to mark_line. Name the chart c3.

Don’t focus too much on the strokeDash=[10,5] part, I just wanted to show you an example of an option that exists. Here the dashes are made with 10 black pixels followed by a gap of 5 pixels.

This curve represents the true underlying polynomial that we used to generate the data (before adding the random noise to it).

c3 = alt.Chart(df).mark_line(color="black", strokeDash=[10,5]).encode(
    x="x",
    y="y_true"
)

c3
  • Layer these plots on top of the above scatter plot c1.

Notice how similar our two polynomial curves are. If we had used more data points or less standard deviation for our random noise, we would expect the curves to be even closer to each other.

c1+c2+c3

Extra time#

I doubt there will be extra time, but if there is, we will introduce train_test_split from sklearn.model_selection.