Week 6 Friday#

Announcements#

  • Videos and video quizzes due.

  • Worksheets 9 and 10 due Tuesday.

  • In-class quiz Tuesday based on K-means clustering and also has one question on StandardScaler (William covered on Tuesday).

The goal today is to see some aspects of linear regression using the “mpg” dataset from Seaborn. I assume we won’t get through all the material listed below.

Linear Regression with one input variable#

Find the line of best fit using the mpg dataset from Seaborn to model “mpg” as a function of the one input variable “horsepower”. The input variables are often called features or predictors and the output variable is often called the target.

import pandas as pd
import numpy as np

import altair as alt
import seaborn as sns

We will get errors using scikit-learn if there are missing values (at least without some extra arguments), so here we drop all the rows which have missing values.

df = sns.load_dataset("mpg").dropna(axis=0)

Notice how the following chart shows (matching our intuition) that as horsepower increases, mpg decreases.

base = alt.Chart(df).mark_circle().encode(
    x="horsepower",
    y="mpg"
)

base

Let’s see the same thing using scikit-learn’s LinearRegression class. Linear regression is an example of supervised machine learning (as opposed to unsupervised machine learning, like the clustering we were doing before). The fact that it is supervised machine learning, means that we need to have answers for at least some of our data. (In this case, we have answers, i.e., we have the true “mpg” value, for all of the data.)

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
type(reg)
sklearn.linear_model._base.LinearRegression

Here is one of the most common errors when using scikit-learn. It wants the input to be two-dimensional, even if it’s just a single column in a DataFrame. (The reason is that, when there are multiple input columns, the input needs to be two-dimensional, so it’s easier for scikit-learn if the inputs are always two-dimensional.

reg.fit(df["horsepower"], df["mpg"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [7], line 1
----> 1 reg.fit(df["horsepower"], df["mpg"])

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
    680 n_jobs_ = self.n_jobs
    682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
    685     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    686 )
    688 sample_weight = _check_sample_weight(
    689     sample_weight, X, dtype=X.dtype, only_non_negative=True
    690 )
    692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
    693     X,
    694     y,
   (...)
    698     sample_weight=sample_weight,
    699 )

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    594         y = check_array(y, input_name="y", **check_y_params)
    595     else:
--> 596         X, y = check_X_y(X, y, **check_params)
    597     out = X, y
    599 if not no_val_X and check_params.get("ensure_2d", True):

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1069         estimator_name = _check_estimator_name(estimator)
   1070     raise ValueError(
   1071         f"{estimator_name} requires y to be passed, but the target y is None"
   1072     )
-> 1074 X = check_array(
   1075     X,
   1076     accept_sparse=accept_sparse,
   1077     accept_large_sparse=accept_large_sparse,
   1078     dtype=dtype,
   1079     order=order,
   1080     copy=copy,
   1081     force_all_finite=force_all_finite,
   1082     ensure_2d=ensure_2d,
   1083     allow_nd=allow_nd,
   1084     ensure_min_samples=ensure_min_samples,
   1085     ensure_min_features=ensure_min_features,
   1086     estimator=estimator,
   1087     input_name="X",
   1088 )
   1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1092 check_consistent_length(X, y)

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    877     # If input is 1D raise error
    878     if array.ndim == 1:
--> 879         raise ValueError(
    880             "Expected 2D array, got 1D array instead:\narray={}.\n"
    881             "Reshape your data either using array.reshape(-1, 1) if "
    882             "your data has a single feature or array.reshape(1, -1) "
    883             "if it contains a single sample.".format(array)
    884         )
    886 if dtype_numeric and array.dtype.kind in "USV":
    887     raise ValueError(
    888         "dtype='numeric' is not compatible with arrays of bytes/strings."
    889         "Convert your data to numeric values explicitly instead."
    890     )

ValueError: Expected 2D array, got 1D array instead:
array=[130. 165. 150. 150. 140. 198. 220. 215. 225. 190. 170. 160. 150. 225.
  95.  95.  97.  85.  88.  46.  87.  90.  95. 113.  90. 215. 200. 210.
 193.  88.  90.  95. 100. 105. 100.  88. 100. 165. 175. 153. 150. 180.
 170. 175. 110.  72. 100.  88.  86.  90.  70.  76.  65.  69.  60.  70.
  95.  80.  54.  90.  86. 165. 175. 150. 153. 150. 208. 155. 160. 190.
  97. 150. 130. 140. 150. 112.  76.  87.  69.  86.  92.  97.  80.  88.
 175. 150. 145. 137. 150. 198. 150. 158. 150. 215. 225. 175. 105. 100.
 100.  88.  95.  46. 150. 167. 170. 180. 100.  88.  72.  94.  90.  85.
 107.  90. 145. 230.  49.  75.  91. 112. 150. 110. 122. 180.  95. 100.
 100.  67.  80.  65.  75. 100. 110. 105. 140. 150. 150. 140. 150.  83.
  67.  78.  52.  61.  75.  75.  75.  97.  93.  67.  95. 105.  72.  72.
 170. 145. 150. 148. 110. 105. 110.  95. 110. 110. 129.  75.  83. 100.
  78.  96.  71.  97.  97.  70.  90.  95.  88.  98. 115.  53.  86.  81.
  92.  79.  83. 140. 150. 120. 152. 100. 105.  81.  90.  52.  60.  70.
  53. 100.  78. 110.  95.  71.  70.  75.  72. 102. 150.  88. 108. 120.
 180. 145. 130. 150.  68.  80.  58.  96.  70. 145. 110. 145. 130. 110.
 105. 100.  98. 180. 170. 190. 149.  78.  88.  75.  89.  63.  83.  67.
  78.  97. 110. 110.  48.  66.  52.  70.  60. 110. 140. 139. 105.  95.
  85.  88. 100.  90. 105.  85. 110. 120. 145. 165. 139. 140.  68.  95.
  97.  75.  95. 105.  85.  97. 103. 125. 115. 133.  71.  68. 115.  85.
  88.  90. 110. 130. 129. 138. 135. 155. 142. 125. 150.  71.  65.  80.
  80.  77. 125.  71.  90.  70.  70.  65.  69.  90. 115. 115.  90.  76.
  60.  70.  65.  90.  88.  90.  90.  78.  90.  75.  92.  75.  65. 105.
  65.  48.  48.  67.  67.  67.  67.  62. 132. 100.  88.  72.  84.  84.
  92. 110.  84.  58.  64.  60.  67.  65.  62.  68.  63.  65.  65.  74.
  75.  75. 100.  74.  80.  76. 116. 120. 110. 105.  88.  85.  88.  88.
  88.  85.  84.  90.  92.  74.  68.  68.  63.  70.  88.  75.  70.  67.
  67.  67. 110.  85.  92. 112.  96.  84.  90.  86.  52.  84.  79.  82.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Here is the input we tried to use.

df["horsepower"]
0      130.0
1      165.0
2      150.0
3      150.0
4      140.0
       ...  
393     86.0
394     52.0
395     84.0
396     79.0
397     82.0
Name: horsepower, Length: 392, dtype: float64

Here is the input we are going to use. It looks very similar, but because it is a pandas DataFrame (instead of a pandas Series), scikit-learn knows how to work with it as an input.

df[["horsepower"]]
horsepower
0 130.0
1 165.0
2 150.0
3 150.0
4 140.0
... ...
393 86.0
394 52.0
395 84.0
396 79.0
397 82.0

392 rows × 1 columns

type(df[["horsepower"]])
pandas.core.frame.DataFrame

Notice how the output (also called the target) df["mpg"] remains one-dimensional.

reg.fit(df[["horsepower"]], df["mpg"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can now make predictions for the miles per gallon using the “horsepower” column.

df["pred1"] = reg.predict(df[["horsepower"]])

Notice how our DataFrame now includes both the “mpg” column (the true values) as well as the “pred1” column all the way on the right (the predicted values).

df.head()
mpg cylinders displacement horsepower weight acceleration model_year origin name pred1
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu 19.416046
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320 13.891480
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite 16.259151
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst 16.259151
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino 17.837598

Let’s see how these predicted values look.

c1 = alt.Chart(df).mark_line().encode(
    x="horsepower",
    y="pred1"
)

The line on the following chart should be considered the “line of best fit” modeling miles-per-gallon as a linear function of horsepower.

base+c1 # alt.layer(base, c1)

If you look at the line, the following value of the y-intercept should be believable.

reg.intercept_
39.93586102117047

Notice how the coefficient is negative. This corresponds to the line having negative slope, and it matches our intuition that, as “horsepower” increases, “mpg” decreases. (The number is shown as a length-1 NumPy array because, typically we will be using multiple input columns, and an array will be used to store all of the coefficients together.)

reg.coef_
array([-0.15784473])

Linear Regression with multiple input variables#

Now model “mpg” as a function of the following input variables/predictors/features:

["horsepower", "weight", "model_year", "cylinders"]

The routine is very similar, just using multiple input columns (four in this case).

reg2 = LinearRegression()
cols = ["horsepower", "weight", "model_year", "cylinders"]

Notice how we write df[cols] here and we wrote df[["horsepower"]] above. This might seem contradictory, but these are analogues of each other, because cols is a list and ["horsepower"] is also a list (a length one list).

reg2.fit(df[cols], df["mpg"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here are the coefficients, stored in a NumPy array. There are four of these numbers because we used four columns.

reg2.coef_
array([-0.00361502, -0.00627463,  0.74663191, -0.1276871 ])

We want to know which coefficient corresponds to which column (otherwise the numbers are not very meaningful). We could look back at the cols list, but we can also get the same information from reg2 using its feature_names_in_ attribute.

One of the best features of linear regression is that it is very possible to interpret the values it produces. For example, the -0.0036 above should be interpreted as the partial derivative of “mpg” with respect to “horsepower” for our linear model. Notice that most of these are negative, but the coefficient for “model_year” is positive. It makes sense that cars tend to have higher mpg values as the model year increases.

reg2.feature_names_in_
array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)

Again we can make predictions using this data.

df["pred2"] = reg2.predict(df[cols])
c2 = alt.Chart(df).mark_line().encode(
    x="horsepower",
    y="pred2"
)

This chart looks pretty crazy. We will describe it more in the same cell.

base+c2

The predicted values do not look very linear, but that is because the line chart comes from data points which have four different values associated to them: “horsepower”, “weight”, “model_year”, “cylinders”. Our x-axis only shows “horsepower”, but the points on the line depend on all four values.

In the following cell, we add a tooltip to the base scatterplot. Put your mouse over the low point near horsepower 130 and over the highest point near horsepower 130. Even though these two points have roughly the same horsepower (130 and 132), the weights are very different (3870 and 2910, respectively), so that is why our line chart includes a lower miles per gallon point (for the higher weight) and a higher miles per gallon point.

This is a confusing point. You will get a chance to think about something similar to it on one of next week’s homeworks.

base = alt.Chart(df).mark_circle().encode(
    x="horsepower",
    y="mpg",
    tooltip=cols
)

base+c2

Linear Regression using rescaled features#

Use a StandardScaler object to rescale these four input features, and then perform the same linear regression.

We are going to change some of the values in the DataFrame (by rescaling them), so it seems safest to first make a copy of the DataFrame.

df2 = df.copy()

I believe William introduced this StandardScaler class on Tuesday.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

The syntax is the same as for KMeans and LinearRegression

scaler.fit(df[cols])
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

One difference is that we use transform instead of predict. That is because we are not predicting anything.

df2[cols] = scaler.transform(df[cols])

Notice how the four columns “horsepower”, “weight”, “model_year”, “cylinders” have changed dramatically.

df2
mpg cylinders displacement horsepower weight acceleration model_year origin name pred1 pred2
0 18.0 1.483947 307.0 0.664133 0.620540 12.0 -1.625315 usa chevrolet chevelle malibu 19.416046 15.263206
1 15.0 1.483947 350.0 1.574594 0.843334 11.5 -1.625315 usa buick skylark 320 13.891480 13.950775
2 18.0 1.483947 318.0 1.184397 0.540382 11.0 -1.625315 usa plymouth satellite 16.259151 15.617580
3 16.0 1.483947 304.0 1.184397 0.536845 12.0 -1.625315 usa amc rebel sst 16.259151 15.636404
4 17.0 1.483947 302.0 0.924265 0.555706 10.5 -1.625315 usa ford torino 17.837598 15.572160
... ... ... ... ... ... ... ... ... ... ... ...
393 27.0 -0.864014 140.0 -0.480448 -0.221125 15.6 1.636410 usa ford mustang gl 26.361214 29.372683
394 44.0 -0.864014 97.0 -1.364896 -0.999134 24.6 1.636410 europe vw pickup 31.727935 33.636849
395 32.0 -0.864014 135.0 -0.532474 -0.804632 11.6 1.636410 usa dodge rampage 26.676903 32.485855
396 28.0 -0.864014 120.0 -0.662540 -0.415627 18.6 1.636410 usa ford ranger 27.466127 30.433302
397 31.0 -0.864014 119.0 -0.584501 -0.303641 19.4 1.636410 usa chevy s-10 26.992593 29.826367

392 rows × 11 columns

Those four columns now have mean very close to 0. (For K-means clustering, there is no need to change the mean, because we are subtracting one from the other, so any shift by a constant amount will disappear.)

(I didn’t see any warnings on Deepnote, but the following raises warnings about using numeric columns only.)

df2.mean(axis=0)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/3639053033.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df2.mean(axis=0)
mpg             2.344592e+01
cylinders      -1.087565e-16
displacement    1.944120e+02
horsepower     -1.812609e-16
weight         -1.812609e-17
acceleration    1.554133e+01
model_year     -1.160070e-15
pred1           2.344592e+01
pred2           2.344592e+01
dtype: float64

Notice how the standard deviations are close to 1. (As far as I know, there is no specific meaning to the actual numbers that show up. All that’s important are that they are close to 1.)

df2.std(axis=0)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/713155030.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df2.std(axis=0)
mpg               7.805007
cylinders         1.001278
displacement    104.644004
horsepower        1.001278
weight            1.001278
acceleration      2.758864
model_year        1.001278
pred1             6.075627
pred2             7.017807
dtype: float64

We can now perform the same procedure as above.

reg3 = LinearRegression()
reg3.fit(df2[cols], df2["mpg"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg3.feature_names_in_
array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)

The relative magnitudes of the coefficients in reg2 were not meaningful (as far as I know), because the scales of the input features were different (in fact, they all had different units). By rescaling the data, the following magnitudes become meaningful. For example, because the scaled “weight” coefficient has the biggest absolute value, it should be interpreted as the most important of these four features with respect to mpg. That was not at all obvious from the numbers we saw above.

reg3.coef_
array([-0.1389689 , -5.32288337,  2.74688485, -0.21752852])

Here is an elegant way to group those numbers and feature names together into a pandas Series. As a first step, we can make a pandas Series.

pd.Series(reg3.coef_)
0   -0.138969
1   -5.322883
2    2.746885
3   -0.217529
dtype: float64

Here is how we can assign names to each of the numbers.

pd.Series(reg3.coef_, index=reg3.feature_names_in_)
horsepower   -0.138969
weight       -5.322883
model_year    2.746885
cylinders    -0.217529
dtype: float64

Here we sort the pandas Series, with the biggest values at the beginning. For this sorting, we only care about the size of the absolute value of the numbers; that is why we use key=abs in the following sort_values pandas Series method.

pd.Series(reg3.coef_, index=reg3.feature_names_in_).sort_values(ascending=False, key=abs)
weight       -5.322883
model_year    2.746885
cylinders    -0.217529
horsepower   -0.138969
dtype: float64

Linear Regression using a categorical variable#

  • Again perform linear regression, this time also including “origin” as a predictor. Use a OneHotEncoder object.

  • Remove the intercept (also called bias) when we instantiate the LinearRegression object.

(Aside. It’s not obvious to me whether we should rescale this new categorical feature. For now we won’t rescale it. It’s also not obvious to me if we should rescale the output variable. Some quick Google searches suggest there are pros and cons to both.)