Week 6 Friday#

Announcements#

Videos and video quizzes due.
Worksheets 9 and 10 due Tuesday.
In-class quiz Tuesday based on K-means clustering and also has one question on StandardScaler (William covered on Tuesday).

The goal today is to see some aspects of linear regression using the “mpg” dataset from Seaborn. I assume we won’t get through all the material listed below.

Linear Regression with one input variable#

Find the line of best fit using the mpg dataset from Seaborn to model “mpg” as a function of the one input variable “horsepower”. The input variables are often called features or predictors and the output variable is often called the target.

import pandas as pd
import numpy as np

import altair as alt
import seaborn as sns

We will get errors using scikit-learn if there are missing values (at least without some extra arguments), so here we drop all the rows which have missing values.

df = sns.load_dataset("mpg").dropna(axis=0)

Notice how the following chart shows (matching our intuition) that as horsepower increases, mpg decreases.

base = alt.Chart(df).mark_circle().encode(
    x="horsepower",
    y="mpg"
)

base

Let’s see the same thing using scikit-learn’s LinearRegression class. Linear regression is an example of supervised machine learning (as opposed to unsupervised machine learning, like the clustering we were doing before). The fact that it is supervised machine learning, means that we need to have answers for at least some of our data. (In this case, we have answers, i.e., we have the true “mpg” value, for all of the data.)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

type(reg)

sklearn.linear_model._base.LinearRegression

Here is one of the most common errors when using scikit-learn. It wants the input to be two-dimensional, even if it’s just a single column in a DataFrame. (The reason is that, when there are multiple input columns, the input needs to be two-dimensional, so it’s easier for scikit-learn if the inputs are always two-dimensional.

reg.fit(df["horsepower"], df["mpg"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [7], line 1
----> 1 reg.fit(df["horsepower"], df["mpg"])

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
n_jobs_ = self.n_jobs
accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
   X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
)
sample_weight = _check_sample_weight(
   sample_weight, X, dtype=X.dtype, only_non_negative=True
)
X, y, X_offset, y_offset, X_scale = _preprocess_data(
   X,
   y,
   (...)
   sample_weight=sample_weight,
)

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
       y = check_array(y, input_name="y", **check_y_params)
   else:
--> 596         X, y = check_X_y(X, y, **check_params)
   out = X, y
if not no_val_X and check_params.get("ensure_2d", True):

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       estimator_name = _check_estimator_name(estimator)
   raise ValueError(
       f"{estimator_name} requires y to be passed, but the target y is None"
   )
-> 1074 X = check_array(
   X,
   accept_sparse=accept_sparse,
   accept_large_sparse=accept_large_sparse,
   dtype=dtype,
   order=order,
   copy=copy,
   force_all_finite=force_all_finite,
   ensure_2d=ensure_2d,
   allow_nd=allow_nd,
   ensure_min_samples=ensure_min_samples,
   ensure_min_features=ensure_min_features,
   estimator=estimator,
   input_name="X",
)
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
check_consistent_length(X, y)

File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   # If input is 1D raise error
   if array.ndim == 1:
--> 879         raise ValueError(
           "Expected 2D array, got 1D array instead:\narray={}.\n"
           "Reshape your data either using array.reshape(-1, 1) if "
           "your data has a single feature or array.reshape(1, -1) "
           "if it contains a single sample.".format(array)
       )
if dtype_numeric and array.dtype.kind in "USV":
   raise ValueError(
       "dtype='numeric' is not compatible with arrays of bytes/strings."
       "Convert your data to numeric values explicitly instead."
   )

ValueError: Expected 2D array, got 1D array instead:
array=[130. 165. 150. 150. 140. 198. 220. 215. 225. 190. 170. 160. 150. 225.
 95.  97.  85.  88.  46.  87.  90.  95. 113.  90. 215. 200. 210.
 88.  90.  95. 100. 105. 100.  88. 100. 165. 175. 153. 150. 180.
175. 110.  72. 100.  88.  86.  90.  70.  76.  65.  69.  60.  70.
 80.  54.  90.  86. 165. 175. 150. 153. 150. 208. 155. 160. 190.
150. 130. 140. 150. 112.  76.  87.  69.  86.  92.  97.  80.  88.
150. 145. 137. 150. 198. 150. 158. 150. 215. 225. 175. 105. 100.
 88.  95.  46. 150. 167. 170. 180. 100.  88.  72.  94.  90.  85.
 90. 145. 230.  49.  75.  91. 112. 150. 110. 122. 180.  95. 100.
 67.  80.  65.  75. 100. 110. 105. 140. 150. 150. 140. 150.  83.
 78.  52.  61.  75.  75.  75.  97.  93.  67.  95. 105.  72.  72.
145. 150. 148. 110. 105. 110.  95. 110. 110. 129.  75.  83. 100.
 96.  71.  97.  97.  70.  90.  95.  88.  98. 115.  53.  86.  81.
 79.  83. 140. 150. 120. 152. 100. 105.  81.  90.  52.  60.  70.
100.  78. 110.  95.  71.  70.  75.  72. 102. 150.  88. 108. 120.
145. 130. 150.  68.  80.  58.  96.  70. 145. 110. 145. 130. 110.
100.  98. 180. 170. 190. 149.  78.  88.  75.  89.  63.  83.  67.
 97. 110. 110.  48.  66.  52.  70.  60. 110. 140. 139. 105.  95.
 88. 100.  90. 105.  85. 110. 120. 145. 165. 139. 140.  68.  95.
 75.  95. 105.  85.  97. 103. 125. 115. 133.  71.  68. 115.  85.
 90. 110. 130. 129. 138. 135. 155. 142. 125. 150.  71.  65.  80.
 77. 125.  71.  90.  70.  70.  65.  69.  90. 115. 115.  90.  76.
 70.  65.  90.  88.  90.  90.  78.  90.  75.  92.  75.  65. 105.
 48.  48.  67.  67.  67.  67.  62. 132. 100.  88.  72.  84.  84.
110.  84.  58.  64.  60.  67.  65.  62.  68.  63.  65.  65.  74.
 75. 100.  74.  80.  76. 116. 120. 110. 105.  88.  85.  88.  88.
 85.  84.  90.  92.  74.  68.  68.  63.  70.  88.  75.  70.  67.
 67. 110.  85.  92. 112.  96.  84.  90.  86.  52.  84.  79.  82.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Here is the input we tried to use.

df["horsepower"]

    130.0
    165.0
    150.0
    150.0
    140.0
       ...  
   86.0
   52.0
   84.0
   79.0
   82.0
Name: horsepower, Length: 392, dtype: float64

Here is the input we are going to use. It looks very similar, but because it is a pandas DataFrame (instead of a pandas Series), scikit-learn knows how to work with it as an input.

df[["horsepower"]]

	horsepower
0	130.0
1	165.0
2	150.0
3	150.0
4	140.0
...	...
393	86.0
394	52.0
395	84.0
396	79.0
397	82.0

392 rows × 1 columns

type(df[["horsepower"]])

pandas.core.frame.DataFrame

Notice how the output (also called the target) df["mpg"] remains one-dimensional.

reg.fit(df[["horsepower"]], df["mpg"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can now make predictions for the miles per gallon using the “horsepower” column.

df["pred1"] = reg.predict(df[["horsepower"]])

Notice how our DataFrame now includes both the “mpg” column (the true values) as well as the “pred1” column all the way on the right (the predicted values).

df.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name	pred1
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu	19.416046
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320	13.891480
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite	16.259151
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst	16.259151
4	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino	17.837598

Let’s see how these predicted values look.

c1 = alt.Chart(df).mark_line().encode(
    x="horsepower",
    y="pred1"
)

The line on the following chart should be considered the “line of best fit” modeling miles-per-gallon as a linear function of horsepower.

base+c1 # alt.layer(base, c1)

If you look at the line, the following value of the y-intercept should be believable.

reg.intercept_

39.93586102117047

Notice how the coefficient is negative. This corresponds to the line having negative slope, and it matches our intuition that, as “horsepower” increases, “mpg” decreases. (The number is shown as a length-1 NumPy array because, typically we will be using multiple input columns, and an array will be used to store all of the coefficients together.)

reg.coef_

array([-0.15784473])

Linear Regression with multiple input variables#

Now model “mpg” as a function of the following input variables/predictors/features:

["horsepower", "weight", "model_year", "cylinders"]

The routine is very similar, just using multiple input columns (four in this case).

reg2 = LinearRegression()

cols = ["horsepower", "weight", "model_year", "cylinders"]

Notice how we write df[cols] here and we wrote df[["horsepower"]] above. This might seem contradictory, but these are analogues of each other, because cols is a list and ["horsepower"] is also a list (a length one list).

reg2.fit(df[cols], df["mpg"])

LinearRegression()

Here are the coefficients, stored in a NumPy array. There are four of these numbers because we used four columns.

reg2.coef_

array([-0.00361502, -0.00627463,  0.74663191, -0.1276871 ])

We want to know which coefficient corresponds to which column (otherwise the numbers are not very meaningful). We could look back at the cols list, but we can also get the same information from reg2 using its feature_names_in_ attribute.

One of the best features of linear regression is that it is very possible to interpret the values it produces. For example, the -0.0036 above should be interpreted as the partial derivative of “mpg” with respect to “horsepower” for our linear model. Notice that most of these are negative, but the coefficient for “model_year” is positive. It makes sense that cars tend to have higher mpg values as the model year increases.

reg2.feature_names_in_

array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)

Again we can make predictions using this data.

df["pred2"] = reg2.predict(df[cols])

c2 = alt.Chart(df).mark_line().encode(
    x="horsepower",
    y="pred2"
)

This chart looks pretty crazy. We will describe it more in the same cell.

base+c2

The predicted values do not look very linear, but that is because the line chart comes from data points which have four different values associated to them: “horsepower”, “weight”, “model_year”, “cylinders”. Our x-axis only shows “horsepower”, but the points on the line depend on all four values.

In the following cell, we add a tooltip to the base scatterplot. Put your mouse over the low point near horsepower 130 and over the highest point near horsepower 130. Even though these two points have roughly the same horsepower (130 and 132), the weights are very different (3870 and 2910, respectively), so that is why our line chart includes a lower miles per gallon point (for the higher weight) and a higher miles per gallon point.

This is a confusing point. You will get a chance to think about something similar to it on one of next week’s homeworks.

base = alt.Chart(df).mark_circle().encode(
    x="horsepower",
    y="mpg",
    tooltip=cols
)

base+c2

Linear Regression using rescaled features#

Use a StandardScaler object to rescale these four input features, and then perform the same linear regression.

We are going to change some of the values in the DataFrame (by rescaling them), so it seems safest to first make a copy of the DataFrame.

df2 = df.copy()

I believe William introduced this StandardScaler class on Tuesday.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

The syntax is the same as for KMeans and LinearRegression

scaler.fit(df[cols])

StandardScaler()

One difference is that we use transform instead of predict. That is because we are not predicting anything.

df2[cols] = scaler.transform(df[cols])

Notice how the four columns “horsepower”, “weight”, “model_year”, “cylinders” have changed dramatically.

df2

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name	pred1	pred2
0	18.0	1.483947	307.0	0.664133	0.620540	12.0	-1.625315	usa	chevrolet chevelle malibu	19.416046	15.263206
1	15.0	1.483947	350.0	1.574594	0.843334	11.5	-1.625315	usa	buick skylark 320	13.891480	13.950775
2	18.0	1.483947	318.0	1.184397	0.540382	11.0	-1.625315	usa	plymouth satellite	16.259151	15.617580
3	16.0	1.483947	304.0	1.184397	0.536845	12.0	-1.625315	usa	amc rebel sst	16.259151	15.636404
4	17.0	1.483947	302.0	0.924265	0.555706	10.5	-1.625315	usa	ford torino	17.837598	15.572160
...	...	...	...	...	...	...	...	...	...	...	...
393	27.0	-0.864014	140.0	-0.480448	-0.221125	15.6	1.636410	usa	ford mustang gl	26.361214	29.372683
394	44.0	-0.864014	97.0	-1.364896	-0.999134	24.6	1.636410	europe	vw pickup	31.727935	33.636849
395	32.0	-0.864014	135.0	-0.532474	-0.804632	11.6	1.636410	usa	dodge rampage	26.676903	32.485855
396	28.0	-0.864014	120.0	-0.662540	-0.415627	18.6	1.636410	usa	ford ranger	27.466127	30.433302
397	31.0	-0.864014	119.0	-0.584501	-0.303641	19.4	1.636410	usa	chevy s-10	26.992593	29.826367

392 rows × 11 columns

Those four columns now have mean very close to 0. (For K-means clustering, there is no need to change the mean, because we are subtracting one from the other, so any shift by a constant amount will disappear.)

(I didn’t see any warnings on Deepnote, but the following raises warnings about using numeric columns only.)

df2.mean(axis=0)

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/3639053033.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df2.mean(axis=0)

mpg             2.344592e+01
cylinders      -1.087565e-16
displacement    1.944120e+02
horsepower     -1.812609e-16
weight         -1.812609e-17
acceleration    1.554133e+01
model_year     -1.160070e-15
pred1           2.344592e+01
pred2           2.344592e+01
dtype: float64

Notice how the standard deviations are close to 1. (As far as I know, there is no specific meaning to the actual numbers that show up. All that’s important are that they are close to 1.)

df2.std(axis=0)

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/713155030.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df2.std(axis=0)

mpg               7.805007
cylinders         1.001278
displacement    104.644004
horsepower        1.001278
weight            1.001278
acceleration      2.758864
model_year        1.001278
pred1             6.075627
pred2             7.017807
dtype: float64

We can now perform the same procedure as above.

reg3 = LinearRegression()

reg3.fit(df2[cols], df2["mpg"])

LinearRegression()

reg3.feature_names_in_

array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)

The relative magnitudes of the coefficients in reg2 were not meaningful (as far as I know), because the scales of the input features were different (in fact, they all had different units). By rescaling the data, the following magnitudes become meaningful. For example, because the scaled “weight” coefficient has the biggest absolute value, it should be interpreted as the most important of these four features with respect to mpg. That was not at all obvious from the numbers we saw above.

reg3.coef_

array([-0.1389689 , -5.32288337,  2.74688485, -0.21752852])

Here is an elegant way to group those numbers and feature names together into a pandas Series. As a first step, we can make a pandas Series.

pd.Series(reg3.coef_)

 -0.138969
 -5.322883
  2.746885
 -0.217529
dtype: float64

Here is how we can assign names to each of the numbers.

pd.Series(reg3.coef_, index=reg3.feature_names_in_)

horsepower   -0.138969
weight       -5.322883
model_year    2.746885
cylinders    -0.217529
dtype: float64

Here we sort the pandas Series, with the biggest values at the beginning. For this sorting, we only care about the size of the absolute value of the numbers; that is why we use key=abs in the following sort_values pandas Series method.

pd.Series(reg3.coef_, index=reg3.feature_names_in_).sort_values(ascending=False, key=abs)

weight       -5.322883
model_year    2.746885
cylinders    -0.217529
horsepower   -0.138969
dtype: float64

Linear Regression using a categorical variable#

Again perform linear regression, this time also including “origin” as a predictor. Use a OneHotEncoder object.
Remove the intercept (also called bias) when we instantiate the LinearRegression object.

(Aside. It’s not obvious to me whether we should rescale this new categorical feature. For now we won’t rescale it. It’s also not obvious to me if we should rescale the output variable. Some quick Google searches suggest there are pros and cons to both.)

UC Irvine Math 10, Fall 2022

Week 6 Friday

Contents