Week 6 Monday#

Announcements#

  • Quiz tomorrow is based on Worksheets 9-10.

  • I think the midterm is going to be returned Tuesday in discussion. See any clear mistakes in the grading? Leave the exam with Jinghao and write a note to me on or with the midterm.

  • I’m going to try the same schedule as Friday:

A. 15-20 minutes lecture (performance measures for regression) B. 10-15 minutes time for worksheets (two worksheets due tomorrow, a new worksheet posted today). Maya and I are here to answer questions. C. 15-20 minutes lecture (polynomial regression, continuation from Friday)

import numpy as np
import pandas as pd
import altair as alt

Performance measures for regression#

Here is a dataset with only 4 data points.

df = pd.DataFrame({
    "x":np.arange(4),
    "y":[0,2,-10,6]},
)

df
x y
0 0 0
1 1 2
2 2 -10
3 3 6

Here is how the data looks.

alt.Chart(df).mark_circle(size=150).encode(
    x="x",
    y="y"
)

Let’s figure out which of the following linear models better fits the data:

  • Line A: \(f(x) = 2x\)

  • Line B: \(g(x) = 0.6x - 1.4\)

  • Add columns to df corresponding to these lines. Name the new columns “lineA” and “lineB”.

The following works, but is overly complicated. There’s no need to use the map method for these basic examples.

# correct but not needed
df["lineA"] = df["x"].map(lambda x: 2*x)

This is the most natural way to include the Line A points.

# better
df["lineA"] = 2*df["x"]

Similarly for Line B: no map method is needed.

df["lineB"] = 0.6*df["x"] - 1.4
  • Plot the data together with these lines, using the color red for Line A and the color black for Line B. Use a base chart so that you are not repeating the same code three times.

Here is a trick that will save us some typing. This trick is not that important and I don’t expect it to show up on a quiz or exam, but it is nice to know it exists. We define a base chart that contains the common components to all our charts.

base = alt.Chart(df).encode(
    x="x"
)

If we try to display base on its own, we get an error, because this by itself is not a valid chart (it’s missing the mark code).

base
---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
   2018     copy.data = core.InlineData(values=[{}])
   2019     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:393, in TopLevelMixin.to_dict(self, *args, **kwargs)
    391 if dct is None:
    392     kwargs["validate"] = "deep"
--> 393     dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    395 # TODO: following entries are added after validation. Should they be validated?
    396 if is_top_level:
    397     # since this is top-level we add $schema if it's missing

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:360, in SchemaBase.to_dict(self, validate, ignore, context)
    358         self.validate(result)
    359     except jsonschema.ValidationError as err:
--> 360         raise SchemaValidationError(self, err)
    361 return result

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'mark' is a required property
        
alt.Chart(...)

Here we make the scatter plot chart using the “y” column.

c1 = base.mark_circle(size=150).encode(y="y")

Here we make the two line charts using the “lineA” and “lineB” columns.

c2 = base.mark_line(color="red").encode(y="lineA")
c3 = base.mark_line(color="black").encode(y="lineB")

Now we can layer these three charts on top of each other.

c1+c2+c3
  • Which line fits the data better?

There’s no single correct answer to this, as we will see, it depends what “performance measure” we are using.

  • Using scikit-learn, find the line of best fit for this data. How does it compare to the above lines?

Here we return to the routine we used several times last week: Import, instantiate, fit.

from sklearn.linear_model import LinearRegression

Here we make an instance of the LinearRegression class.

reg = LinearRegression()

Here we fit it to the data in the “y” column. It wouldn’t make much sense to fit it to the “lineA” or “lineB” data, because that data already lies on a line, and we already know the equation. Linear regression is really only interesting when the data does not perfectly lie on a line.

reg.fit(df[["x"]], df["y"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here is the coefficient of the line of “best” fit to this data.

reg.coef_
array([0.6])

Here is the y-intercept.

reg.intercept_
-1.4

Notice that this describes exactly the black line above. So from the point of view of scikit-learn’s linear regression, the black line (Line B) fits the data better than the red line, but also better than every other possible line.

  • Import the mean_squared_error function from sklearn.metrics. Which of our lines fits the data better according to this metric?

from sklearn.metrics import mean_squared_error

Here is an example of computing the mean squared error (MSE) between the true data and Line A.

Important: when computing errors (or loss functions), a smaller number is better.

mean_squared_error(df["y"], df["lineA"])
49.0

Here is the same computation with Line B. Because the Line B value is smaller, from the point of view of Mean Squared Error, Line B fits the data better than Line A.

When scikit-learn performs linear regression, it is seeking the line that minimizes the mean squared error.

mean_squared_error(df["y"], df["lineB"])
34.300000000000004

The best hypothetical error is 0, but there is no line that achieves a mean squared error of 0 for this data (because no line passes through all four data points).

mean_squared_error(df["y"], df["y"])
0.0
  • Import the mean_absolute_error function from sklearn.metrics. Which of our lines fits the data better according to this metric?

from sklearn.metrics import mean_absolute_error

This loss function mean absolute error (MAE) is similar, but we take the average of the absolute values between the true values and the predicted values.

The following number is much smaller than the MSE values we saw above, but because we are using two different loss functions, I don’t think that comparison is meaningful.

mean_absolute_error(df["y"], df["lineA"])
3.5

Notice that from the point of view of mean absolute error, Line A is a better fit than Line B (because the MAE value is smaller for Line A).

There is no correct answer to which line fits the data better: it depends what performance measure is used.

mean_absolute_error(df["y"], df["lineB"])
4.9

Summary: Line B is better (actually the best possible) wrt mean squared error Line A is better with respect to mean absolute error.

Different performance measures (different loss functions) will lead to different lines of “best” fit.

Ten minutes to work on Worksheets#

Maya and I are here to answer questions. If you’re already done with the worksheets due tomorrow, today’s worksheet (Worksheet 11) is posted. Worksheet 11 is about how finding the line of best fit using Mean Squared Error can be greatly influenced by outliers in the data.

More on polynomial regression#

Generating random data#

Here we make some data that follows a random polynomial. Can we use scikit-learn to estimate the underlying polynomial?

deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for i in range(len(m)):
    df["y_true"] += m[i]*df["x"]**i

df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5  1 -3 -2]

Notice how this data approximately follows a cubic curve (but certainly doesn’t lie on it perfectly).

c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

c1

Aside: If you are using range(len(???)) in Python, there is almost always a more elegant way to accomplish the same thing.

  • Rewrite the code above using enumerate(m) instead of range(len(m)).

(We skipped this step in class and continued below.)

PolynomialFeatures#

We saw how to perform polynomial regression “by hand” last week. The process is much easier if we take advantage of some additional functionality in scikit-learn.

  • Demonstrate the PolynomialFeatures class from sklearn.preprocessing by evaluating it on the “x” column in df. Use a degree value of 3.

Note: this class uses transform rather than predict in the final step.

We again have our routine: import, instantiate, fit, and in this case, transform rather than predict. Here is the import step.

from sklearn.preprocessing import PolynomialFeatures

Here is the instantiate step. We specify as an argument to the constructor what degree of polynomial we’re interested in.

poly = PolynomialFeatures(degree=3)

I believe this is the second object class we’ve seen in scikit-learn (the first being LinearRegression).

type(poly)
sklearn.preprocessing._polynomial.PolynomialFeatures

Now we fit the data. This fit method does not need the output values.

poly.fit(df[["x"]])
PolynomialFeatures(degree=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Notice that we have to use a term other than predict here.

poly.predict(df[["x"]])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [36], line 1
----> 1 poly.predict(df[["x"]])

AttributeError: 'PolynomialFeatures' object has no attribute 'predict'

It is hard to read the output of the transform method in this case; we will look at it more closely below.

poly.transform(df[["x"]])
array([[ 1.00000000e+00, -3.00000000e+00,  9.00000000e+00,
        -2.70000000e+01],
       [ 1.00000000e+00, -2.90000000e+00,  8.41000000e+00,
        -2.43890000e+01],
       [ 1.00000000e+00, -2.80000000e+00,  7.84000000e+00,
        -2.19520000e+01],
       [ 1.00000000e+00, -2.70000000e+00,  7.29000000e+00,
        -1.96830000e+01],
       [ 1.00000000e+00, -2.60000000e+00,  6.76000000e+00,
        -1.75760000e+01],
       [ 1.00000000e+00, -2.50000000e+00,  6.25000000e+00,
        -1.56250000e+01],
       [ 1.00000000e+00, -2.40000000e+00,  5.76000000e+00,
        -1.38240000e+01],
       [ 1.00000000e+00, -2.30000000e+00,  5.29000000e+00,
        -1.21670000e+01],
       [ 1.00000000e+00, -2.20000000e+00,  4.84000000e+00,
        -1.06480000e+01],
       [ 1.00000000e+00, -2.10000000e+00,  4.41000000e+00,
        -9.26100000e+00],
       [ 1.00000000e+00, -2.00000000e+00,  4.00000000e+00,
        -8.00000000e+00],
       [ 1.00000000e+00, -1.90000000e+00,  3.61000000e+00,
        -6.85900000e+00],
       [ 1.00000000e+00, -1.80000000e+00,  3.24000000e+00,
        -5.83200000e+00],
       [ 1.00000000e+00, -1.70000000e+00,  2.89000000e+00,
        -4.91300000e+00],
       [ 1.00000000e+00, -1.60000000e+00,  2.56000000e+00,
        -4.09600000e+00],
       [ 1.00000000e+00, -1.50000000e+00,  2.25000000e+00,
        -3.37500000e+00],
       [ 1.00000000e+00, -1.40000000e+00,  1.96000000e+00,
        -2.74400000e+00],
       [ 1.00000000e+00, -1.30000000e+00,  1.69000000e+00,
        -2.19700000e+00],
       [ 1.00000000e+00, -1.20000000e+00,  1.44000000e+00,
        -1.72800000e+00],
       [ 1.00000000e+00, -1.10000000e+00,  1.21000000e+00,
        -1.33100000e+00],
       [ 1.00000000e+00, -1.00000000e+00,  1.00000000e+00,
        -1.00000000e+00],
       [ 1.00000000e+00, -9.00000000e-01,  8.10000000e-01,
        -7.29000000e-01],
       [ 1.00000000e+00, -8.00000000e-01,  6.40000000e-01,
        -5.12000000e-01],
       [ 1.00000000e+00, -7.00000000e-01,  4.90000000e-01,
        -3.43000000e-01],
       [ 1.00000000e+00, -6.00000000e-01,  3.60000000e-01,
        -2.16000000e-01],
       [ 1.00000000e+00, -5.00000000e-01,  2.50000000e-01,
        -1.25000000e-01],
       [ 1.00000000e+00, -4.00000000e-01,  1.60000000e-01,
        -6.40000000e-02],
       [ 1.00000000e+00, -3.00000000e-01,  9.00000000e-02,
        -2.70000000e-02],
       [ 1.00000000e+00, -2.00000000e-01,  4.00000000e-02,
        -8.00000000e-03],
       [ 1.00000000e+00, -1.00000000e-01,  1.00000000e-02,
        -1.00000000e-03],
       [ 1.00000000e+00,  2.66453526e-15,  7.09974815e-30,
         1.89175293e-44],
       [ 1.00000000e+00,  1.00000000e-01,  1.00000000e-02,
         1.00000000e-03],
       [ 1.00000000e+00,  2.00000000e-01,  4.00000000e-02,
         8.00000000e-03],
       [ 1.00000000e+00,  3.00000000e-01,  9.00000000e-02,
         2.70000000e-02],
       [ 1.00000000e+00,  4.00000000e-01,  1.60000000e-01,
         6.40000000e-02],
       [ 1.00000000e+00,  5.00000000e-01,  2.50000000e-01,
         1.25000000e-01],
       [ 1.00000000e+00,  6.00000000e-01,  3.60000000e-01,
         2.16000000e-01],
       [ 1.00000000e+00,  7.00000000e-01,  4.90000000e-01,
         3.43000000e-01],
       [ 1.00000000e+00,  8.00000000e-01,  6.40000000e-01,
         5.12000000e-01],
       [ 1.00000000e+00,  9.00000000e-01,  8.10000000e-01,
         7.29000000e-01],
       [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
         1.00000000e+00],
       [ 1.00000000e+00,  1.10000000e+00,  1.21000000e+00,
         1.33100000e+00],
       [ 1.00000000e+00,  1.20000000e+00,  1.44000000e+00,
         1.72800000e+00],
       [ 1.00000000e+00,  1.30000000e+00,  1.69000000e+00,
         2.19700000e+00],
       [ 1.00000000e+00,  1.40000000e+00,  1.96000000e+00,
         2.74400000e+00],
       [ 1.00000000e+00,  1.50000000e+00,  2.25000000e+00,
         3.37500000e+00],
       [ 1.00000000e+00,  1.60000000e+00,  2.56000000e+00,
         4.09600000e+00],
       [ 1.00000000e+00,  1.70000000e+00,  2.89000000e+00,
         4.91300000e+00],
       [ 1.00000000e+00,  1.80000000e+00,  3.24000000e+00,
         5.83200000e+00],
       [ 1.00000000e+00,  1.90000000e+00,  3.61000000e+00,
         6.85900000e+00]])
  • To make the result easier to read, put the transformed data into a pandas DataFrame named df_poly, and round all of the entries to the nearest 0.1 value. (Rounding every entry is a natural place to use applymap, but in fact, there is also a round DataFrame method which can be used directly.)

Aside: notice that the default column labels are integers, just like the default row labels.

df_poly = pd.DataFrame(poly.transform(df[["x"]]))
df_poly
0 1 2 3
0 1.0 -3.000000e+00 9.000000e+00 -2.700000e+01
1 1.0 -2.900000e+00 8.410000e+00 -2.438900e+01
2 1.0 -2.800000e+00 7.840000e+00 -2.195200e+01
3 1.0 -2.700000e+00 7.290000e+00 -1.968300e+01
4 1.0 -2.600000e+00 6.760000e+00 -1.757600e+01
5 1.0 -2.500000e+00 6.250000e+00 -1.562500e+01
6 1.0 -2.400000e+00 5.760000e+00 -1.382400e+01
7 1.0 -2.300000e+00 5.290000e+00 -1.216700e+01
8 1.0 -2.200000e+00 4.840000e+00 -1.064800e+01
9 1.0 -2.100000e+00 4.410000e+00 -9.261000e+00
10 1.0 -2.000000e+00 4.000000e+00 -8.000000e+00
11 1.0 -1.900000e+00 3.610000e+00 -6.859000e+00
12 1.0 -1.800000e+00 3.240000e+00 -5.832000e+00
13 1.0 -1.700000e+00 2.890000e+00 -4.913000e+00
14 1.0 -1.600000e+00 2.560000e+00 -4.096000e+00
15 1.0 -1.500000e+00 2.250000e+00 -3.375000e+00
16 1.0 -1.400000e+00 1.960000e+00 -2.744000e+00
17 1.0 -1.300000e+00 1.690000e+00 -2.197000e+00
18 1.0 -1.200000e+00 1.440000e+00 -1.728000e+00
19 1.0 -1.100000e+00 1.210000e+00 -1.331000e+00
20 1.0 -1.000000e+00 1.000000e+00 -1.000000e+00
21 1.0 -9.000000e-01 8.100000e-01 -7.290000e-01
22 1.0 -8.000000e-01 6.400000e-01 -5.120000e-01
23 1.0 -7.000000e-01 4.900000e-01 -3.430000e-01
24 1.0 -6.000000e-01 3.600000e-01 -2.160000e-01
25 1.0 -5.000000e-01 2.500000e-01 -1.250000e-01
26 1.0 -4.000000e-01 1.600000e-01 -6.400000e-02
27 1.0 -3.000000e-01 9.000000e-02 -2.700000e-02
28 1.0 -2.000000e-01 4.000000e-02 -8.000000e-03
29 1.0 -1.000000e-01 1.000000e-02 -1.000000e-03
30 1.0 2.664535e-15 7.099748e-30 1.891753e-44
31 1.0 1.000000e-01 1.000000e-02 1.000000e-03
32 1.0 2.000000e-01 4.000000e-02 8.000000e-03
33 1.0 3.000000e-01 9.000000e-02 2.700000e-02
34 1.0 4.000000e-01 1.600000e-01 6.400000e-02
35 1.0 5.000000e-01 2.500000e-01 1.250000e-01
36 1.0 6.000000e-01 3.600000e-01 2.160000e-01
37 1.0 7.000000e-01 4.900000e-01 3.430000e-01
38 1.0 8.000000e-01 6.400000e-01 5.120000e-01
39 1.0 9.000000e-01 8.100000e-01 7.290000e-01
40 1.0 1.000000e+00 1.000000e+00 1.000000e+00
41 1.0 1.100000e+00 1.210000e+00 1.331000e+00
42 1.0 1.200000e+00 1.440000e+00 1.728000e+00
43 1.0 1.300000e+00 1.690000e+00 2.197000e+00
44 1.0 1.400000e+00 1.960000e+00 2.744000e+00
45 1.0 1.500000e+00 2.250000e+00 3.375000e+00
46 1.0 1.600000e+00 2.560000e+00 4.096000e+00
47 1.0 1.700000e+00 2.890000e+00 4.913000e+00
48 1.0 1.800000e+00 3.240000e+00 5.832000e+00
49 1.0 1.900000e+00 3.610000e+00 6.859000e+00

Using Pipeline to combine multiple steps#

  • Import the Pipeline class from sklearn.pipeline.

Here is another class in scikit-learn, the Pipeline class. We also import LinearRegression.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
  • Make an instance of this Pipeline class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (like PolynomialFeatures(???)).

pipe = Pipeline([
    ("poly", PolynomialFeatures(degree=3)),
    ("reg", LinearRegression())
])
  • Fit this object to the data.

pipe.fit(df[["x"]], df["y"])
Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
                ('reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Call the predict method, and add the resulting values to a new column named “pred”.

We will continue with this in the next class.

pipe["reg"].coef_
array([ 0.        ,  3.33524206, -3.03442169, -2.32952623])
  • Plot the resulting predictions using a red line. Layer this plot on top of the above scatter plot.

  • Check the coefficients and intercept that were found during this fit process. (You can use the named_steps attribute, or just use the names directly.)

  • How do these numbers compare to the true coefficients (stored in m)?