Polynomial Regression¶

Announcements¶

The quiz tomorrow will have 3 questions. The topics will be:

Performing K-Means clustering using scikit-learn. See the notebooks from Week 5.
Performing K-Means clustering by hand (assign to clusters, find centroids, repeat). We discussed this Wednesday of Week 5 (much of the discussion was at the whiteboard). (Here is a Wikipedia description and an animation on YouTube.)
Using a pandas DataFrame’s apply method using a lambda function with axis=0 (apply to one column at a time) or with axis=1 (apply to one row at a time). (We first introduced apply in this notebook, and it also was used in both worksheets last week.)

There isn’t time to go over this material today, but Yasmeen will review the material on Tuesday. Yasmeen and I also have Zoom office hours the next two days:

Yasmeen: 3:00-5:00pm Monday
Chris: 11:00am-12:30pm Tuesday

See Canvas for the Zoom links to the office hours.

Linear Regression with the cars dataset¶

Let’s try to express “Horsepower” in terms of “MPG” (miles per gallon). You should have the intuition that these two columns are negatively correlated.

import pandas as pd
import altair as alt

df = pd.read_csv("../data/cars.csv").dropna()
df.rename({"Miles_per_Gallon":"MPG"}, axis=1, inplace=True)

Looking at the following, we see that indeed, “MPG” and “Horsepower” are quite negatively correlated.

df.corr()

	MPG	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration
MPG	1.000000	-0.777618	-0.805127	-0.778427	-0.832244	0.423329
Cylinders	-0.777618	1.000000	0.950823	0.842983	0.897527	-0.504683
Displacement	-0.805127	0.950823	1.000000	0.897257	0.932994	-0.543800
Horsepower	-0.778427	0.842983	0.897257	1.000000	0.864538	-0.689196
Weight_in_lbs	-0.832244	0.897527	0.932994	0.864538	1.000000	-0.416839
Acceleration	0.423329	-0.504683	-0.543800	-0.689196	-0.416839	1.000000

This is reflected in the chart in the sense that as one coordinate value increases, the other tends to decrease.

alt.Chart(df).mark_circle().encode(
    x="MPG",
    y="Horsepower"
)

Let’s investigate the same thing using scikit-learn. We follow the usual pattern:

import
create/instantiate
fit
predict/transform

# import
from sklearn.linear_model import LinearRegression

# create/instantiate
reg = LinearRegression()

The following is probably the most common mistake with scikit-learn. The first input needs to be two-dimensional (in this case, the first input needs to be a DataFrame, not a Series).

reg.fit(df["MPG"], df["Horsepower"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_1317/2122096284.py in <module>
----> 1 reg.fit(df["MPG"], df["Horsepower"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)

       X, y = self._validate_data(
--> 663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
       )


~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
               y = check_array(y, **check_y_params)
           else:
--> 581                 X, y = check_X_y(X, y, **check_params)
           out = X, y


~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       ensure_min_samples=ensure_min_samples,
       ensure_min_features=ensure_min_features,
--> 976         estimator=estimator,
   )


~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
                   "Reshape your data either using array.reshape(-1, 1) if "
                   "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
               )


ValueError: Expected 2D array, got 1D array instead:
array=[18.  15.  18.  16.  17.  15.  14.  14.  14.  15.  15.  14.  15.  14.
 22.  18.  21.  27.  26.  25.  24.  25.  26.  21.  10.  10.  11.
 27.  28.  25.  19.  16.  17.  19.  18.  14.  14.  14.  14.  12.
 13.  18.  22.  19.  18.  23.  28.  30.  30.  31.  35.  27.  26.
 25.  23.  20.  21.  13.  14.  15.  14.  17.  11.  13.  12.  13.
 15.  13.  13.  14.  18.  22.  21.  26.  22.  28.  23.  28.  27.
 14.  13.  14.  15.  12.  13.  13.  14.  13.  12.  13.  18.  16.
 18.  23.  26.  11.  12.  13.  12.  18.  20.  21.  22.  18.  19.
 26.  15.  16.  29.  24.  20.  19.  15.  24.  20.  11.  20.  19.
 31.  26.  32.  25.  16.  16.  18.  16.  13.  14.  14.  14.  29.
 26.  31.  32.  28.  24.  26.  24.  26.  31.  19.  18.  15.  15.
 15.  16.  14.  17.  16.  15.  18.  21.  20.  13.  29.  23.  20.
 24.  25.  24.  18.  29.  19.  23.  23.  22.  25.  33.  28.  25.
 26.  27.  17.5 16.  15.5 14.5 22.  22.  24.  22.5 29.  24.5 29.
 20.  18.  18.5 17.5 29.5 32.  28.  26.5 20.  13.  19.  19.  16.5
5 13.  13.  13.  31.5 30.  36.  25.5 33.5 17.5 17.  15.5 15.  17.5
5 19.  18.5 16.  15.5 15.5 16.  29.  24.5 26.  25.5 30.5 33.5 30.
5 22.  21.5 21.5 43.1 36.1 32.8 39.4 36.1 19.9 19.4 20.2 19.2 20.5
2 25.1 20.5 19.4 20.6 20.8 18.6 18.1 19.2 17.7 18.1 17.5 30.  27.5
2 30.9 21.1 23.2 23.8 23.9 20.3 17.  21.6 16.2 31.5 29.5 21.5 19.8
3 20.2 20.6 17.  17.6 16.5 18.2 16.9 15.5 19.2 18.5 31.9 34.1 35.7
4 25.4 23.  27.2 23.9 34.2 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5
1 32.1 37.2 28.  26.4 24.3 19.1 34.3 29.8 31.3 37.  32.2 46.6 27.9
8 44.3 43.4 36.4 30.  44.6 33.8 29.8 32.7 23.7 35.  32.4 27.2 26.6
8 23.5 30.  39.1 39.  35.1 32.3 37.  37.7 34.1 34.7 34.4 29.9 33.
7 32.4 32.9 31.6 28.1 30.7 25.4 24.2 22.4 26.6 20.2 17.6 28.  27.
 31.  29.  27.  24.  36.  37.  31.  38.  36.  36.  36.  34.  38.
 38.  25.  38.  26.  22.  32.  36.  27.  27.  44.  32.  28.  31. ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The key to fixing this error in this case is to use a list of column names (in this case, it’s just the single column name “MPG”, but we still need it to be in a list).

# fit
reg.fit(df[["MPG"]], df["Horsepower"])

LinearRegression()

reg.predict(df[["MPG"]])

array([125.3756586 , 136.8923227 , 125.3756586 , 133.05343467,
       129.21454664, 136.8923227 , 140.73121073, 140.73121073,
       140.73121073, 136.8923227 , 136.8923227 , 140.73121073,
       136.8923227 , 140.73121073, 102.34233041, 110.02010647,
       125.3756586 , 113.8589945 ,  90.82566631,  94.66455434,
        98.50344237, 102.34233041,  98.50344237,  94.66455434,
       113.8589945 , 156.08676286, 156.08676286, 152.24787483,
       159.9256509 ,  90.82566631,  86.98677828,  98.50344237,
       121.53677057, 133.05343467, 129.21454664, 121.53677057,
       125.3756586 , 140.73121073, 140.73121073, 140.73121073,
       140.73121073, 148.4089868 , 144.57009877, 144.57009877,
       125.3756586 , 110.02010647, 121.53677057, 125.3756586 ,
       106.18121844,  86.98677828,  79.30900221,  79.30900221,
        75.47011418,  60.11456205,  90.82566631,  94.66455434,
       102.34233041,  98.50344237, 106.18121844, 117.69788254,
       113.8589945 , 144.57009877, 140.73121073, 136.8923227 ,
       140.73121073, 129.21454664, 152.24787483, 144.57009877,
       148.4089868 , 144.57009877, 121.53677057, 136.8923227 ,
       144.57009877, 144.57009877, 140.73121073, 125.3756586 ,
       110.02010647, 113.8589945 ,  94.66455434, 110.02010647,
        86.98677828, 106.18121844,  86.98677828,  90.82566631,
       144.57009877, 140.73121073, 144.57009877, 140.73121073,
       136.8923227 , 148.4089868 , 144.57009877, 144.57009877,
       140.73121073, 144.57009877, 148.4089868 , 144.57009877,
       125.3756586 , 133.05343467, 125.3756586 , 125.3756586 ,
       106.18121844,  94.66455434, 152.24787483, 148.4089868 ,
       144.57009877, 148.4089868 , 125.3756586 , 117.69788254,
       113.8589945 , 110.02010647, 125.3756586 , 121.53677057,
       113.8589945 ,  94.66455434, 136.8923227 , 133.05343467,
        83.14789024, 102.34233041, 117.69788254, 121.53677057,
       136.8923227 , 102.34233041, 117.69788254, 152.24787483,
       117.69788254, 121.53677057, 136.8923227 ,  75.47011418,
        94.66455434,  71.63122615,  98.50344237, 133.05343467,
       133.05343467, 125.3756586 , 133.05343467, 144.57009877,
       140.73121073, 140.73121073, 140.73121073,  83.14789024,
        94.66455434,  94.66455434,  75.47011418,  71.63122615,
        86.98677828, 102.34233041,  94.66455434, 102.34233041,
        94.66455434,  75.47011418, 121.53677057, 125.3756586 ,
       136.8923227 , 136.8923227 , 133.05343467, 136.8923227 ,
       133.05343467, 140.73121073, 129.21454664, 133.05343467,
       136.8923227 , 125.3756586 , 113.8589945 , 117.69788254,
       144.57009877,  83.14789024, 106.18121844, 117.69788254,
       106.18121844, 102.34233041,  98.50344237, 102.34233041,
       125.3756586 ,  83.14789024, 121.53677057, 106.18121844,
       106.18121844, 110.02010647,  98.50344237,  67.79233811,
        86.98677828,  98.50344237,  98.50344237,  94.66455434,
        90.82566631, 127.29510262, 133.05343467, 134.97287868,
       138.81176672, 110.02010647, 110.02010647, 102.34233041,
       108.10066246,  83.14789024, 100.42288639,  83.14789024,
        67.79233811, 117.69788254, 125.3756586 , 123.45621459,
       127.29510262,  81.22844623,  71.63122615,  86.98677828,
        92.74511032, 117.69788254, 144.57009877, 121.53677057,
       121.53677057, 131.13399065, 131.13399065, 144.57009877,
       144.57009877, 144.57009877,  73.55067016,  79.30900221,
        56.27567401,  96.58399836,  65.8728941 , 127.29510262,
       129.21454664, 134.97287868, 136.8923227 , 127.29510262,
       115.77843852, 121.53677057, 123.45621459, 133.05343467,
       134.97287868, 134.97287868, 133.05343467,  83.14789024,
       100.42288639,  94.66455434,  96.58399836,  77.38955819,
        65.8728941 ,  79.30900221,  77.38955819, 110.02010647,
       111.93955049, 111.93955049,  29.01956898,  55.89178521,
        68.56011572,  43.2234547 ,  55.89178521, 118.08177134,
       120.00121536, 116.93010493, 120.76899296, 115.77843852,
       116.93010493,  98.11955357, 115.77843852, 120.00121536,
       115.39454972, 114.62677211, 123.07232578, 124.9917698 ,
       120.76899296, 126.52732501, 124.9917698 , 127.29510262,
        79.30900221,  88.90622229,  90.0578887 ,  75.85400298,
       113.4751057 , 105.41344083, 103.11010801, 102.72621921,
       116.54621613, 129.21454664, 111.55566168, 132.28565706,
        73.55067016,  81.22844623, 111.93955049, 118.46566014,
       108.86844006, 116.93010493, 115.39454972, 129.21454664,
       126.91121382, 131.13399065, 124.607881  , 129.59843544,
       134.97287868, 120.76899296, 123.45621459,  72.01511495,
        63.56956128,  57.42734042,  89.2901111 ,  96.96788716,
       106.18121844,  90.0578887 , 102.72621921,  63.18567247,
        62.03400606,  72.39900375,  51.28511957,  85.45122306,
        83.91566785,  91.59344391,  65.8728941 ,  35.16178983,
        48.21400915,  71.24733734,  51.66900838,  86.98677828,
        93.12899913, 101.190664  , 121.15288177,  62.80178367,
        80.07677982,  74.31844777,  52.43678598,  70.86344854,
        15.58346087,  87.37066708,  37.84901146,  24.41290334,
        27.86790257,  54.7401188 ,  79.30900221,  23.26123693,
        64.72122769,  80.07677982,  68.94400452, 103.49399682,
        60.11456205,  70.09567093,  90.0578887 ,  92.36122152,
        95.43233195, 104.26177442,  79.30900221,  44.37512111,
        44.75900992,  59.73067324,  70.47955974,  52.43678598,
        49.74956436,  63.56956128,  61.26622846,  62.41789487,
        79.69289101,  67.79233811,  65.10511649,  70.09567093,
        68.17622692,  73.16678136,  86.60288947,  76.62178059,
        96.96788716, 101.5745528 , 108.48455126,  92.36122152,
       116.93010493, 126.91121382,  86.98677828,  90.82566631,
        63.95345008,  75.47011418,  83.14789024,  90.82566631,
       102.34233041,  56.27567401,  52.43678598,  75.47011418,
        48.59789795,  56.27567401,  56.27567401,  56.27567401,
        63.95345008,  48.59789795,  71.63122615,  48.59789795,
        98.50344237,  48.59789795,  94.66455434, 110.02010647,
        71.63122615,  56.27567401,  90.82566631,  90.82566631,
        25.56456975,  71.63122615,  86.98677828,  75.47011418])

#add a column to df
df["Pred"] = reg.predict(df[["MPG"]])

Notice how the predicted values show up on the right side of the DataFrame. And at least the first one is pretty accurate: the prediction is 125 whereas the true value is 130.

df.head()

	Name	MPG	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin	Pred
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA	125.375659
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA	136.892323
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA	125.375659
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA	133.053435
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA	129.214547

c = alt.Chart(df).mark_circle().encode(
    x="MPG",
    y="Horsepower"
)

We eventually want to plot the predicted values using a red line. In the following chart c1, we are using the same “y” value as before, so the data is the exact same for now.

c1 = alt.Chart(df).mark_line(color="red").encode(
    x="MPG",
    y="Horsepower"
)

We can display two Altair charts in the same position, one “layered” over the other, using +.

c+c1

Now let’s try the same thing, but using the “Pred” column for the y-values.

c1 = alt.Chart(df).mark_line(color="red").encode(
    x="MPG",
    y="Pred"
)

The following shows the line of best fit for this data. (We will describe what is meant by “best” in a later class.)

c+c1

type(reg)

sklearn.linear_model._base.LinearRegression

We can find the slope of the line using reg.coef_. The fact that this slope is negative corresponds to the fact that these variables are negatively correlated.

reg.coef_

array([-3.83888803])

The following is the y-intercept. These sorts of stored parameter values are usually named in scikit-learn using a trailing underscore, so for example, using intercept_ rather than the plain intercept.

reg.intercept_

194.47564319018676

Do you see how these values of -3.84 and 194.5 are reflected in the above line?

Polynomial Regression with the cars dataset¶

Motto: Polynomial regression is no more difficult than linear regression.

We discussed at the whiteboard how finding coefficients in a polynomial can be viewed as a special case of linear regression. (The reverse is also true, but here we are using the fact that we know how to find linear regression coefficients, to find polynomial coefficients.)

We just need to add some new columns to our DataFrame. Here we use f-strings and a for-loop to keep our code Pythonic and DRY.

for i in range(1,4):
    df[f"MPG{i}"] = df["MPG"]**i

Notice how the true MPG value in the first row is 18, and so at the end we see the values \(18\), \(18^2\), and \(18^3\). This method would easily adapt to higher degrees. (That is the main benefit of not just typing the three columns out one at a time.)

df.head()

	Name	MPG	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin	Pred	MPG1	MPG2	MPG3
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA	125.375659	18.0	324.0	5832.0
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA	136.892323	15.0	225.0	3375.0
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA	125.375659	18.0	324.0	5832.0
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA	133.053435	16.0	256.0	4096.0
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA	129.214547	17.0	289.0	4913.0

reg = LinearRegression()

Here we make a list of the three columns we’re interested in.

# cols = ["MPG1","MPG2","MPG3"]
# df.columns[-3:]
cols = [f"MPG{i}" for i in range(1,4)]

type(cols)

list

Because cols is already a list, it’s important to use df[cols] instead of df[[cols]].

reg.fit(df[cols],df["Horsepower"])

LinearRegression()

reg.coef_

array([-3.01312437e+01,  8.66793237e-01, -8.50566252e-03])

Here is a nice way to display which value corresponds to which coefficient.

pd.Series(reg.coef_, index=cols)

MPG1   -30.131244
MPG2     0.866793
MPG3    -0.008506
dtype: float64

reg.intercept_

429.5814606301147

The above values can be interpreted as saying that our model estimates the following formula, where H stands for horsepower and where M stands for miles per gallon.

\[ H \approx -0.0085 \cdot M^3 + 0.86 \cdot M^2 - 30.1 \cdot M + 429.6 \]

Let’s add the values predicted by our model into the DataFrame.

df["Pred3"] = reg.predict(df[cols])

Let’s get a new Altair chart. We can save some typing by copying our linear chart c1.

c3 = c1.copy()

c3 = c3.encode(y="Pred3")

Warning: it often happens in Machine Learning that the better the fit appears to be, the worse it will perform on future data (due to overfitting). So don’t be too impressed by the accuracy of this fit.

c+c3

Warning: Don’t misuse polynomial regression¶

For some reason, unreasonable cubic models often get shared in the media. The cubic polynomial that “fits best” can be interesting to look at, but don’t expect it to provide accurate predictions in the future. (This is a symptom of overfitting, which is a key concept in Machine Learning that we will return to soon.)

I don’t know of any natural occurring phenomenon that follows a cubic pattern, so don’t be deceived by the closeness of a cubic model fit.

Example from FiveThirtyEight¶

Here a cubic model is being used to model the percentage of Yes confirmation votes as a function of time.

cubic fit 1

Example from Trump’s Covid-19 team¶

Using a cubic model to predict deaths from Covid-19. This particular cubic model suggests deaths will get to 0 in mid-May 2020.

cubic fit 2

UC Irvine Math 10 S22

Polynomial Regression

Contents