Week 6 Monday#
Announcements#
Quiz tomorrow is based on Worksheets 9-10.
I think the midterm is going to be returned Tuesday in discussion. See any clear mistakes in the grading? Leave the exam with Jinghao and write a note to me on or with the midterm.
I’m going to try the same schedule as Friday:
A. 15-20 minutes lecture (performance measures for regression) B. 10-15 minutes time for worksheets (two worksheets due tomorrow, a new worksheet posted today). Maya and I are here to answer questions. C. 15-20 minutes lecture (polynomial regression, continuation from Friday)
import numpy as np
import pandas as pd
import altair as alt
Performance measures for regression#
Here is a dataset with only 4 data points.
df = pd.DataFrame({
"x":np.arange(4),
"y":[0,2,-10,6]},
)
df
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | -10 |
3 | 3 | 6 |
Here is how the data looks.
alt.Chart(df).mark_circle(size=150).encode(
x="x",
y="y"
)
Let’s figure out which of the following linear models better fits the data:
Line A: \(f(x) = 2x\)
Line B: \(g(x) = 0.6x - 1.4\)
Add columns to
df
corresponding to these lines. Name the new columns “lineA” and “lineB”.
The following works, but is overly complicated. There’s no need to use the map
method for these basic examples.
# correct but not needed
df["lineA"] = df["x"].map(lambda x: 2*x)
This is the most natural way to include the Line A points.
# better
df["lineA"] = 2*df["x"]
Similarly for Line B: no map
method is needed.
df["lineB"] = 0.6*df["x"] - 1.4
Plot the data together with these lines, using the color red for Line A and the color black for Line B. Use a base chart so that you are not repeating the same code three times.
Here is a trick that will save us some typing. This trick is not that important and I don’t expect it to show up on a quiz or exam, but it is nice to know it exists. We define a base chart that contains the common components to all our charts.
base = alt.Chart(df).encode(
x="x"
)
If we try to display base
on its own, we get an error, because this by itself is not a valid chart (it’s missing the mark
code).
base
---------------------------------------------------------------------------
SchemaValidationError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:393, in TopLevelMixin.to_dict(self, *args, **kwargs)
391 if dct is None:
392 kwargs["validate"] = "deep"
--> 393 dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
395 # TODO: following entries are added after validation. Should they be validated?
396 if is_top_level:
397 # since this is top-level we add $schema if it's missing
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:360, in SchemaBase.to_dict(self, validate, ignore, context)
358 self.validate(result)
359 except jsonschema.ValidationError as err:
--> 360 raise SchemaValidationError(self, err)
361 return result
SchemaValidationError: Invalid specification
altair.vegalite.v4.api.Chart, validating 'required'
'mark' is a required property
alt.Chart(...)
Here we make the scatter plot chart using the “y” column.
c1 = base.mark_circle(size=150).encode(y="y")
Here we make the two line charts using the “lineA” and “lineB” columns.
c2 = base.mark_line(color="red").encode(y="lineA")
c3 = base.mark_line(color="black").encode(y="lineB")
Now we can layer these three charts on top of each other.
c1+c2+c3
Which line fits the data better?
There’s no single correct answer to this, as we will see, it depends what “performance measure” we are using.
Using scikit-learn, find the line of best fit for this data. How does it compare to the above lines?
Here we return to the routine we used several times last week: Import, instantiate, fit.
from sklearn.linear_model import LinearRegression
Here we make an instance of the LinearRegression
class.
reg = LinearRegression()
Here we fit it to the data in the “y” column. It wouldn’t make much sense to fit it to the “lineA” or “lineB” data, because that data already lies on a line, and we already know the equation. Linear regression is really only interesting when the data does not perfectly lie on a line.
reg.fit(df[["x"]], df["y"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Here is the coefficient of the line of “best” fit to this data.
reg.coef_
array([0.6])
Here is the y-intercept.
reg.intercept_
-1.4
Notice that this describes exactly the black line above. So from the point of view of scikit-learn’s linear regression, the black line (Line B) fits the data better than the red line, but also better than every other possible line.
Import the
mean_squared_error
function fromsklearn.metrics
. Which of our lines fits the data better according to this metric?
from sklearn.metrics import mean_squared_error
Here is an example of computing the mean squared error (MSE) between the true data and Line A.
Important: when computing errors (or loss functions), a smaller number is better.
mean_squared_error(df["y"], df["lineA"])
49.0
Here is the same computation with Line B. Because the Line B value is smaller, from the point of view of Mean Squared Error, Line B fits the data better than Line A.
When scikit-learn performs linear regression, it is seeking the line that minimizes the mean squared error.
mean_squared_error(df["y"], df["lineB"])
34.300000000000004
The best hypothetical error is 0
, but there is no line that achieves a mean squared error of 0
for this data (because no line passes through all four data points).
mean_squared_error(df["y"], df["y"])
0.0
Import the
mean_absolute_error
function fromsklearn.metrics
. Which of our lines fits the data better according to this metric?
from sklearn.metrics import mean_absolute_error
This loss function mean absolute error (MAE) is similar, but we take the average of the absolute values between the true values and the predicted values.
The following number is much smaller than the MSE values we saw above, but because we are using two different loss functions, I don’t think that comparison is meaningful.
mean_absolute_error(df["y"], df["lineA"])
3.5
Notice that from the point of view of mean absolute error, Line A is a better fit than Line B (because the MAE value is smaller for Line A).
There is no correct answer to which line fits the data better: it depends what performance measure is used.
mean_absolute_error(df["y"], df["lineB"])
4.9
Summary: Line B is better (actually the best possible) wrt mean squared error Line A is better with respect to mean absolute error.
Different performance measures (different loss functions) will lead to different lines of “best” fit.
Ten minutes to work on Worksheets#
Maya and I are here to answer questions. If you’re already done with the worksheets due tomorrow, today’s worksheet (Worksheet 11) is posted. Worksheet 11 is about how finding the line of best fit using Mean Squared Error can be greatly influenced by outliers in the data.
More on polynomial regression#
Generating random data#
Here we make some data that follows a random polynomial. Can we use scikit-learn to estimate the underlying polynomial?
deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for i in range(len(m)):
df["y_true"] += m[i]*df["x"]**i
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
[-5 1 -3 -2]
Notice how this data approximately follows a cubic curve (but certainly doesn’t lie on it perfectly).
c1 = alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
c1
Aside: If you are using range(len(???))
in Python, there is almost always a more elegant way to accomplish the same thing.
Rewrite the code above using
enumerate(m)
instead ofrange(len(m))
.
(We skipped this step in class and continued below.)
PolynomialFeatures#
We saw how to perform polynomial regression “by hand” last week. The process is much easier if we take advantage of some additional functionality in scikit-learn.
Demonstrate the
PolynomialFeatures
class fromsklearn.preprocessing
by evaluating it on the “x” column indf
. Use adegree
value of3
.
Note: this class uses transform
rather than predict
in the final step.
We again have our routine: import, instantiate, fit, and in this case, transform rather than predict. Here is the import step.
from sklearn.preprocessing import PolynomialFeatures
Here is the instantiate step. We specify as an argument to the constructor what degree of polynomial we’re interested in.
poly = PolynomialFeatures(degree=3)
I believe this is the second object class we’ve seen in scikit-learn (the first being LinearRegression
).
type(poly)
sklearn.preprocessing._polynomial.PolynomialFeatures
Now we fit the data. This fit
method does not need the output values.
poly.fit(df[["x"]])
PolynomialFeatures(degree=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PolynomialFeatures(degree=3)
Notice that we have to use a term other than predict
here.
poly.predict(df[["x"]])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [36], line 1
----> 1 poly.predict(df[["x"]])
AttributeError: 'PolynomialFeatures' object has no attribute 'predict'
It is hard to read the output of the transform
method in this case; we will look at it more closely below.
poly.transform(df[["x"]])
array([[ 1.00000000e+00, -3.00000000e+00, 9.00000000e+00,
-2.70000000e+01],
[ 1.00000000e+00, -2.90000000e+00, 8.41000000e+00,
-2.43890000e+01],
[ 1.00000000e+00, -2.80000000e+00, 7.84000000e+00,
-2.19520000e+01],
[ 1.00000000e+00, -2.70000000e+00, 7.29000000e+00,
-1.96830000e+01],
[ 1.00000000e+00, -2.60000000e+00, 6.76000000e+00,
-1.75760000e+01],
[ 1.00000000e+00, -2.50000000e+00, 6.25000000e+00,
-1.56250000e+01],
[ 1.00000000e+00, -2.40000000e+00, 5.76000000e+00,
-1.38240000e+01],
[ 1.00000000e+00, -2.30000000e+00, 5.29000000e+00,
-1.21670000e+01],
[ 1.00000000e+00, -2.20000000e+00, 4.84000000e+00,
-1.06480000e+01],
[ 1.00000000e+00, -2.10000000e+00, 4.41000000e+00,
-9.26100000e+00],
[ 1.00000000e+00, -2.00000000e+00, 4.00000000e+00,
-8.00000000e+00],
[ 1.00000000e+00, -1.90000000e+00, 3.61000000e+00,
-6.85900000e+00],
[ 1.00000000e+00, -1.80000000e+00, 3.24000000e+00,
-5.83200000e+00],
[ 1.00000000e+00, -1.70000000e+00, 2.89000000e+00,
-4.91300000e+00],
[ 1.00000000e+00, -1.60000000e+00, 2.56000000e+00,
-4.09600000e+00],
[ 1.00000000e+00, -1.50000000e+00, 2.25000000e+00,
-3.37500000e+00],
[ 1.00000000e+00, -1.40000000e+00, 1.96000000e+00,
-2.74400000e+00],
[ 1.00000000e+00, -1.30000000e+00, 1.69000000e+00,
-2.19700000e+00],
[ 1.00000000e+00, -1.20000000e+00, 1.44000000e+00,
-1.72800000e+00],
[ 1.00000000e+00, -1.10000000e+00, 1.21000000e+00,
-1.33100000e+00],
[ 1.00000000e+00, -1.00000000e+00, 1.00000000e+00,
-1.00000000e+00],
[ 1.00000000e+00, -9.00000000e-01, 8.10000000e-01,
-7.29000000e-01],
[ 1.00000000e+00, -8.00000000e-01, 6.40000000e-01,
-5.12000000e-01],
[ 1.00000000e+00, -7.00000000e-01, 4.90000000e-01,
-3.43000000e-01],
[ 1.00000000e+00, -6.00000000e-01, 3.60000000e-01,
-2.16000000e-01],
[ 1.00000000e+00, -5.00000000e-01, 2.50000000e-01,
-1.25000000e-01],
[ 1.00000000e+00, -4.00000000e-01, 1.60000000e-01,
-6.40000000e-02],
[ 1.00000000e+00, -3.00000000e-01, 9.00000000e-02,
-2.70000000e-02],
[ 1.00000000e+00, -2.00000000e-01, 4.00000000e-02,
-8.00000000e-03],
[ 1.00000000e+00, -1.00000000e-01, 1.00000000e-02,
-1.00000000e-03],
[ 1.00000000e+00, 2.66453526e-15, 7.09974815e-30,
1.89175293e-44],
[ 1.00000000e+00, 1.00000000e-01, 1.00000000e-02,
1.00000000e-03],
[ 1.00000000e+00, 2.00000000e-01, 4.00000000e-02,
8.00000000e-03],
[ 1.00000000e+00, 3.00000000e-01, 9.00000000e-02,
2.70000000e-02],
[ 1.00000000e+00, 4.00000000e-01, 1.60000000e-01,
6.40000000e-02],
[ 1.00000000e+00, 5.00000000e-01, 2.50000000e-01,
1.25000000e-01],
[ 1.00000000e+00, 6.00000000e-01, 3.60000000e-01,
2.16000000e-01],
[ 1.00000000e+00, 7.00000000e-01, 4.90000000e-01,
3.43000000e-01],
[ 1.00000000e+00, 8.00000000e-01, 6.40000000e-01,
5.12000000e-01],
[ 1.00000000e+00, 9.00000000e-01, 8.10000000e-01,
7.29000000e-01],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00],
[ 1.00000000e+00, 1.10000000e+00, 1.21000000e+00,
1.33100000e+00],
[ 1.00000000e+00, 1.20000000e+00, 1.44000000e+00,
1.72800000e+00],
[ 1.00000000e+00, 1.30000000e+00, 1.69000000e+00,
2.19700000e+00],
[ 1.00000000e+00, 1.40000000e+00, 1.96000000e+00,
2.74400000e+00],
[ 1.00000000e+00, 1.50000000e+00, 2.25000000e+00,
3.37500000e+00],
[ 1.00000000e+00, 1.60000000e+00, 2.56000000e+00,
4.09600000e+00],
[ 1.00000000e+00, 1.70000000e+00, 2.89000000e+00,
4.91300000e+00],
[ 1.00000000e+00, 1.80000000e+00, 3.24000000e+00,
5.83200000e+00],
[ 1.00000000e+00, 1.90000000e+00, 3.61000000e+00,
6.85900000e+00]])
To make the result easier to read, put the transformed data into a pandas DataFrame named
df_poly
, and round all of the entries to the nearest0.1
value. (Rounding every entry is a natural place to useapplymap
, but in fact, there is also around
DataFrame method which can be used directly.)
Aside: notice that the default column labels are integers, just like the default row labels.
df_poly = pd.DataFrame(poly.transform(df[["x"]]))
df_poly
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | -3.000000e+00 | 9.000000e+00 | -2.700000e+01 |
1 | 1.0 | -2.900000e+00 | 8.410000e+00 | -2.438900e+01 |
2 | 1.0 | -2.800000e+00 | 7.840000e+00 | -2.195200e+01 |
3 | 1.0 | -2.700000e+00 | 7.290000e+00 | -1.968300e+01 |
4 | 1.0 | -2.600000e+00 | 6.760000e+00 | -1.757600e+01 |
5 | 1.0 | -2.500000e+00 | 6.250000e+00 | -1.562500e+01 |
6 | 1.0 | -2.400000e+00 | 5.760000e+00 | -1.382400e+01 |
7 | 1.0 | -2.300000e+00 | 5.290000e+00 | -1.216700e+01 |
8 | 1.0 | -2.200000e+00 | 4.840000e+00 | -1.064800e+01 |
9 | 1.0 | -2.100000e+00 | 4.410000e+00 | -9.261000e+00 |
10 | 1.0 | -2.000000e+00 | 4.000000e+00 | -8.000000e+00 |
11 | 1.0 | -1.900000e+00 | 3.610000e+00 | -6.859000e+00 |
12 | 1.0 | -1.800000e+00 | 3.240000e+00 | -5.832000e+00 |
13 | 1.0 | -1.700000e+00 | 2.890000e+00 | -4.913000e+00 |
14 | 1.0 | -1.600000e+00 | 2.560000e+00 | -4.096000e+00 |
15 | 1.0 | -1.500000e+00 | 2.250000e+00 | -3.375000e+00 |
16 | 1.0 | -1.400000e+00 | 1.960000e+00 | -2.744000e+00 |
17 | 1.0 | -1.300000e+00 | 1.690000e+00 | -2.197000e+00 |
18 | 1.0 | -1.200000e+00 | 1.440000e+00 | -1.728000e+00 |
19 | 1.0 | -1.100000e+00 | 1.210000e+00 | -1.331000e+00 |
20 | 1.0 | -1.000000e+00 | 1.000000e+00 | -1.000000e+00 |
21 | 1.0 | -9.000000e-01 | 8.100000e-01 | -7.290000e-01 |
22 | 1.0 | -8.000000e-01 | 6.400000e-01 | -5.120000e-01 |
23 | 1.0 | -7.000000e-01 | 4.900000e-01 | -3.430000e-01 |
24 | 1.0 | -6.000000e-01 | 3.600000e-01 | -2.160000e-01 |
25 | 1.0 | -5.000000e-01 | 2.500000e-01 | -1.250000e-01 |
26 | 1.0 | -4.000000e-01 | 1.600000e-01 | -6.400000e-02 |
27 | 1.0 | -3.000000e-01 | 9.000000e-02 | -2.700000e-02 |
28 | 1.0 | -2.000000e-01 | 4.000000e-02 | -8.000000e-03 |
29 | 1.0 | -1.000000e-01 | 1.000000e-02 | -1.000000e-03 |
30 | 1.0 | 2.664535e-15 | 7.099748e-30 | 1.891753e-44 |
31 | 1.0 | 1.000000e-01 | 1.000000e-02 | 1.000000e-03 |
32 | 1.0 | 2.000000e-01 | 4.000000e-02 | 8.000000e-03 |
33 | 1.0 | 3.000000e-01 | 9.000000e-02 | 2.700000e-02 |
34 | 1.0 | 4.000000e-01 | 1.600000e-01 | 6.400000e-02 |
35 | 1.0 | 5.000000e-01 | 2.500000e-01 | 1.250000e-01 |
36 | 1.0 | 6.000000e-01 | 3.600000e-01 | 2.160000e-01 |
37 | 1.0 | 7.000000e-01 | 4.900000e-01 | 3.430000e-01 |
38 | 1.0 | 8.000000e-01 | 6.400000e-01 | 5.120000e-01 |
39 | 1.0 | 9.000000e-01 | 8.100000e-01 | 7.290000e-01 |
40 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
41 | 1.0 | 1.100000e+00 | 1.210000e+00 | 1.331000e+00 |
42 | 1.0 | 1.200000e+00 | 1.440000e+00 | 1.728000e+00 |
43 | 1.0 | 1.300000e+00 | 1.690000e+00 | 2.197000e+00 |
44 | 1.0 | 1.400000e+00 | 1.960000e+00 | 2.744000e+00 |
45 | 1.0 | 1.500000e+00 | 2.250000e+00 | 3.375000e+00 |
46 | 1.0 | 1.600000e+00 | 2.560000e+00 | 4.096000e+00 |
47 | 1.0 | 1.700000e+00 | 2.890000e+00 | 4.913000e+00 |
48 | 1.0 | 1.800000e+00 | 3.240000e+00 | 5.832000e+00 |
49 | 1.0 | 1.900000e+00 | 3.610000e+00 | 6.859000e+00 |
Using Pipeline
to combine multiple steps#
Import the
Pipeline
class fromsklearn.pipeline
.
Here is another class in scikit-learn, the Pipeline
class. We also import LinearRegression
.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
Make an instance of this
Pipeline
class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (likePolynomialFeatures(???)
).
pipe = Pipeline([
("poly", PolynomialFeatures(degree=3)),
("reg", LinearRegression())
])
Fit this object to the data.
pipe.fit(df[["x"]], df["y"])
Pipeline(steps=[('poly', PolynomialFeatures(degree=3)), ('reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly', PolynomialFeatures(degree=3)), ('reg', LinearRegression())])
PolynomialFeatures(degree=3)
LinearRegression()
Call the predict method, and add the resulting values to a new column named “pred”.
We will continue with this in the next class.
pipe["reg"].coef_
array([ 0. , 3.33524206, -3.03442169, -2.32952623])
Plot the resulting predictions using a red line. Layer this plot on top of the above scatter plot.
Check the coefficients and intercept that were found during this fit process. (You can use the
named_steps
attribute, or just use the names directly.)
How do these numbers compare to the true coefficients (stored in
m
)?