Week 6 Friday
Contents
Week 6 Friday#
Announcements#
Videos and video quizzes due.
Worksheets 9 and 10 due Tuesday.
In-class quiz Tuesday based on K-means clustering and also has one question on
StandardScaler
(William covered on Tuesday).
The goal today is to see some aspects of linear regression using the “mpg” dataset from Seaborn. I assume we won’t get through all the material listed below.
Linear Regression with one input variable#
Find the line of best fit using the mpg dataset from Seaborn to model “mpg” as a function of the one input variable “horsepower”. The input variables are often called features or predictors and the output variable is often called the target.
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
We will get errors using scikit-learn if there are missing values (at least without some extra arguments), so here we drop all the rows which have missing values.
df = sns.load_dataset("mpg").dropna(axis=0)
Notice how the following chart shows (matching our intuition) that as horsepower increases, mpg decreases.
base = alt.Chart(df).mark_circle().encode(
x="horsepower",
y="mpg"
)
base
Let’s see the same thing using scikit-learn’s LinearRegression class. Linear regression is an example of supervised machine learning (as opposed to unsupervised machine learning, like the clustering we were doing before). The fact that it is supervised machine learning, means that we need to have answers for at least some of our data. (In this case, we have answers, i.e., we have the true “mpg” value, for all of the data.)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
type(reg)
sklearn.linear_model._base.LinearRegression
Here is one of the most common errors when using scikit-learn. It wants the input to be two-dimensional, even if it’s just a single column in a DataFrame. (The reason is that, when there are multiple input columns, the input needs to be two-dimensional, so it’s easier for scikit-learn if the inputs are always two-dimensional.
reg.fit(df["horsepower"], df["mpg"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [7], line 1
----> 1 reg.fit(df["horsepower"], df["mpg"])
File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
680 n_jobs_ = self.n_jobs
682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
685 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
686 )
688 sample_weight = _check_sample_weight(
689 sample_weight, X, dtype=X.dtype, only_non_negative=True
690 )
692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
693 X,
694 y,
(...)
698 sample_weight=sample_weight,
699 )
File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
594 y = check_array(y, input_name="y", **check_y_params)
595 else:
--> 596 X, y = check_X_y(X, y, **check_params)
597 out = X, y
599 if not no_val_X and check_params.get("ensure_2d", True):
File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1069 estimator_name = _check_estimator_name(estimator)
1070 raise ValueError(
1071 f"{estimator_name} requires y to be passed, but the target y is None"
1072 )
-> 1074 X = check_array(
1075 X,
1076 accept_sparse=accept_sparse,
1077 accept_large_sparse=accept_large_sparse,
1078 dtype=dtype,
1079 order=order,
1080 copy=copy,
1081 force_all_finite=force_all_finite,
1082 ensure_2d=ensure_2d,
1083 allow_nd=allow_nd,
1084 ensure_min_samples=ensure_min_samples,
1085 ensure_min_features=ensure_min_features,
1086 estimator=estimator,
1087 input_name="X",
1088 )
1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1092 check_consistent_length(X, y)
File ~/miniconda3/envs/math10f22/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
877 # If input is 1D raise error
878 if array.ndim == 1:
--> 879 raise ValueError(
880 "Expected 2D array, got 1D array instead:\narray={}.\n"
881 "Reshape your data either using array.reshape(-1, 1) if "
882 "your data has a single feature or array.reshape(1, -1) "
883 "if it contains a single sample.".format(array)
884 )
886 if dtype_numeric and array.dtype.kind in "USV":
887 raise ValueError(
888 "dtype='numeric' is not compatible with arrays of bytes/strings."
889 "Convert your data to numeric values explicitly instead."
890 )
ValueError: Expected 2D array, got 1D array instead:
array=[130. 165. 150. 150. 140. 198. 220. 215. 225. 190. 170. 160. 150. 225.
95. 95. 97. 85. 88. 46. 87. 90. 95. 113. 90. 215. 200. 210.
193. 88. 90. 95. 100. 105. 100. 88. 100. 165. 175. 153. 150. 180.
170. 175. 110. 72. 100. 88. 86. 90. 70. 76. 65. 69. 60. 70.
95. 80. 54. 90. 86. 165. 175. 150. 153. 150. 208. 155. 160. 190.
97. 150. 130. 140. 150. 112. 76. 87. 69. 86. 92. 97. 80. 88.
175. 150. 145. 137. 150. 198. 150. 158. 150. 215. 225. 175. 105. 100.
100. 88. 95. 46. 150. 167. 170. 180. 100. 88. 72. 94. 90. 85.
107. 90. 145. 230. 49. 75. 91. 112. 150. 110. 122. 180. 95. 100.
100. 67. 80. 65. 75. 100. 110. 105. 140. 150. 150. 140. 150. 83.
67. 78. 52. 61. 75. 75. 75. 97. 93. 67. 95. 105. 72. 72.
170. 145. 150. 148. 110. 105. 110. 95. 110. 110. 129. 75. 83. 100.
78. 96. 71. 97. 97. 70. 90. 95. 88. 98. 115. 53. 86. 81.
92. 79. 83. 140. 150. 120. 152. 100. 105. 81. 90. 52. 60. 70.
53. 100. 78. 110. 95. 71. 70. 75. 72. 102. 150. 88. 108. 120.
180. 145. 130. 150. 68. 80. 58. 96. 70. 145. 110. 145. 130. 110.
105. 100. 98. 180. 170. 190. 149. 78. 88. 75. 89. 63. 83. 67.
78. 97. 110. 110. 48. 66. 52. 70. 60. 110. 140. 139. 105. 95.
85. 88. 100. 90. 105. 85. 110. 120. 145. 165. 139. 140. 68. 95.
97. 75. 95. 105. 85. 97. 103. 125. 115. 133. 71. 68. 115. 85.
88. 90. 110. 130. 129. 138. 135. 155. 142. 125. 150. 71. 65. 80.
80. 77. 125. 71. 90. 70. 70. 65. 69. 90. 115. 115. 90. 76.
60. 70. 65. 90. 88. 90. 90. 78. 90. 75. 92. 75. 65. 105.
65. 48. 48. 67. 67. 67. 67. 62. 132. 100. 88. 72. 84. 84.
92. 110. 84. 58. 64. 60. 67. 65. 62. 68. 63. 65. 65. 74.
75. 75. 100. 74. 80. 76. 116. 120. 110. 105. 88. 85. 88. 88.
88. 85. 84. 90. 92. 74. 68. 68. 63. 70. 88. 75. 70. 67.
67. 67. 110. 85. 92. 112. 96. 84. 90. 86. 52. 84. 79. 82.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the input we tried to use.
df["horsepower"]
0 130.0
1 165.0
2 150.0
3 150.0
4 140.0
...
393 86.0
394 52.0
395 84.0
396 79.0
397 82.0
Name: horsepower, Length: 392, dtype: float64
Here is the input we are going to use. It looks very similar, but because it is a pandas DataFrame (instead of a pandas Series), scikit-learn knows how to work with it as an input.
df[["horsepower"]]
horsepower | |
---|---|
0 | 130.0 |
1 | 165.0 |
2 | 150.0 |
3 | 150.0 |
4 | 140.0 |
... | ... |
393 | 86.0 |
394 | 52.0 |
395 | 84.0 |
396 | 79.0 |
397 | 82.0 |
392 rows × 1 columns
type(df[["horsepower"]])
pandas.core.frame.DataFrame
Notice how the output (also called the target) df["mpg"]
remains one-dimensional.
reg.fit(df[["horsepower"]], df["mpg"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We can now make predictions for the miles per gallon using the “horsepower” column.
df["pred1"] = reg.predict(df[["horsepower"]])
Notice how our DataFrame now includes both the “mpg” column (the true values) as well as the “pred1” column all the way on the right (the predicted values).
df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | pred1 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu | 19.416046 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 | 13.891480 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite | 16.259151 |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst | 16.259151 |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino | 17.837598 |
Let’s see how these predicted values look.
c1 = alt.Chart(df).mark_line().encode(
x="horsepower",
y="pred1"
)
The line on the following chart should be considered the “line of best fit” modeling miles-per-gallon as a linear function of horsepower.
base+c1 # alt.layer(base, c1)
If you look at the line, the following value of the y-intercept should be believable.
reg.intercept_
39.93586102117047
Notice how the coefficient is negative. This corresponds to the line having negative slope, and it matches our intuition that, as “horsepower” increases, “mpg” decreases. (The number is shown as a length-1 NumPy array because, typically we will be using multiple input columns, and an array will be used to store all of the coefficients together.)
reg.coef_
array([-0.15784473])
Linear Regression with multiple input variables#
Now model “mpg” as a function of the following input variables/predictors/features:
["horsepower", "weight", "model_year", "cylinders"]
The routine is very similar, just using multiple input columns (four in this case).
reg2 = LinearRegression()
cols = ["horsepower", "weight", "model_year", "cylinders"]
Notice how we write df[cols]
here and we wrote df[["horsepower"]]
above. This might seem contradictory, but these are analogues of each other, because cols
is a list and ["horsepower"]
is also a list (a length one list).
reg2.fit(df[cols], df["mpg"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Here are the coefficients, stored in a NumPy array. There are four of these numbers because we used four columns.
reg2.coef_
array([-0.00361502, -0.00627463, 0.74663191, -0.1276871 ])
We want to know which coefficient corresponds to which column (otherwise the numbers are not very meaningful). We could look back at the cols
list, but we can also get the same information from reg2
using its feature_names_in_
attribute.
One of the best features of linear regression is that it is very possible to interpret the values it produces. For example, the -0.0036
above should be interpreted as the partial derivative of “mpg” with respect to “horsepower” for our linear model. Notice that most of these are negative, but the coefficient for “model_year” is positive. It makes sense that cars tend to have higher mpg values as the model year increases.
reg2.feature_names_in_
array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)
Again we can make predictions using this data.
df["pred2"] = reg2.predict(df[cols])
c2 = alt.Chart(df).mark_line().encode(
x="horsepower",
y="pred2"
)
This chart looks pretty crazy. We will describe it more in the same cell.
base+c2
The predicted values do not look very linear, but that is because the line chart comes from data points which have four different values associated to them: “horsepower”, “weight”, “model_year”, “cylinders”. Our x-axis only shows “horsepower”, but the points on the line depend on all four values.
In the following cell, we add a tooltip to the base scatterplot. Put your mouse over the low point near horsepower 130 and over the highest point near horsepower 130. Even though these two points have roughly the same horsepower (130 and 132), the weights are very different (3870 and 2910, respectively), so that is why our line chart includes a lower miles per gallon point (for the higher weight) and a higher miles per gallon point.
This is a confusing point. You will get a chance to think about something similar to it on one of next week’s homeworks.
base = alt.Chart(df).mark_circle().encode(
x="horsepower",
y="mpg",
tooltip=cols
)
base+c2
Linear Regression using rescaled features#
Use a StandardScaler
object to rescale these four input features, and then perform the same linear regression.
We are going to change some of the values in the DataFrame (by rescaling them), so it seems safest to first make a copy of the DataFrame.
df2 = df.copy()
I believe William introduced this StandardScaler
class on Tuesday.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
The syntax is the same as for KMeans
and LinearRegression
scaler.fit(df[cols])
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
One difference is that we use transform
instead of predict
. That is because we are not predicting anything.
df2[cols] = scaler.transform(df[cols])
Notice how the four columns “horsepower”, “weight”, “model_year”, “cylinders” have changed dramatically.
df2
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | pred1 | pred2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 1.483947 | 307.0 | 0.664133 | 0.620540 | 12.0 | -1.625315 | usa | chevrolet chevelle malibu | 19.416046 | 15.263206 |
1 | 15.0 | 1.483947 | 350.0 | 1.574594 | 0.843334 | 11.5 | -1.625315 | usa | buick skylark 320 | 13.891480 | 13.950775 |
2 | 18.0 | 1.483947 | 318.0 | 1.184397 | 0.540382 | 11.0 | -1.625315 | usa | plymouth satellite | 16.259151 | 15.617580 |
3 | 16.0 | 1.483947 | 304.0 | 1.184397 | 0.536845 | 12.0 | -1.625315 | usa | amc rebel sst | 16.259151 | 15.636404 |
4 | 17.0 | 1.483947 | 302.0 | 0.924265 | 0.555706 | 10.5 | -1.625315 | usa | ford torino | 17.837598 | 15.572160 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | -0.864014 | 140.0 | -0.480448 | -0.221125 | 15.6 | 1.636410 | usa | ford mustang gl | 26.361214 | 29.372683 |
394 | 44.0 | -0.864014 | 97.0 | -1.364896 | -0.999134 | 24.6 | 1.636410 | europe | vw pickup | 31.727935 | 33.636849 |
395 | 32.0 | -0.864014 | 135.0 | -0.532474 | -0.804632 | 11.6 | 1.636410 | usa | dodge rampage | 26.676903 | 32.485855 |
396 | 28.0 | -0.864014 | 120.0 | -0.662540 | -0.415627 | 18.6 | 1.636410 | usa | ford ranger | 27.466127 | 30.433302 |
397 | 31.0 | -0.864014 | 119.0 | -0.584501 | -0.303641 | 19.4 | 1.636410 | usa | chevy s-10 | 26.992593 | 29.826367 |
392 rows × 11 columns
Those four columns now have mean
very close to 0. (For K-means clustering, there is no need to change the mean, because we are subtracting one from the other, so any shift by a constant amount will disappear.)
(I didn’t see any warnings on Deepnote, but the following raises warnings about using numeric columns only.)
df2.mean(axis=0)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/3639053033.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
df2.mean(axis=0)
mpg 2.344592e+01
cylinders -1.087565e-16
displacement 1.944120e+02
horsepower -1.812609e-16
weight -1.812609e-17
acceleration 1.554133e+01
model_year -1.160070e-15
pred1 2.344592e+01
pred2 2.344592e+01
dtype: float64
Notice how the standard deviations are close to 1. (As far as I know, there is no specific meaning to the actual numbers that show up. All that’s important are that they are close to 1.)
df2.std(axis=0)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_29943/713155030.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
df2.std(axis=0)
mpg 7.805007
cylinders 1.001278
displacement 104.644004
horsepower 1.001278
weight 1.001278
acceleration 2.758864
model_year 1.001278
pred1 6.075627
pred2 7.017807
dtype: float64
We can now perform the same procedure as above.
reg3 = LinearRegression()
reg3.fit(df2[cols], df2["mpg"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
reg3.feature_names_in_
array(['horsepower', 'weight', 'model_year', 'cylinders'], dtype=object)
The relative magnitudes of the coefficients in reg2
were not meaningful (as far as I know), because the scales of the input features were different (in fact, they all had different units). By rescaling the data, the following magnitudes become meaningful. For example, because the scaled “weight” coefficient has the biggest absolute value, it should be interpreted as the most important of these four features with respect to mpg. That was not at all obvious from the numbers we saw above.
reg3.coef_
array([-0.1389689 , -5.32288337, 2.74688485, -0.21752852])
Here is an elegant way to group those numbers and feature names together into a pandas Series. As a first step, we can make a pandas Series.
pd.Series(reg3.coef_)
0 -0.138969
1 -5.322883
2 2.746885
3 -0.217529
dtype: float64
Here is how we can assign names to each of the numbers.
pd.Series(reg3.coef_, index=reg3.feature_names_in_)
horsepower -0.138969
weight -5.322883
model_year 2.746885
cylinders -0.217529
dtype: float64
Here we sort the pandas Series, with the biggest values at the beginning. For this sorting, we only care about the size of the absolute value of the numbers; that is why we use key=abs
in the following sort_values
pandas Series method.
pd.Series(reg3.coef_, index=reg3.feature_names_in_).sort_values(ascending=False, key=abs)
weight -5.322883
model_year 2.746885
cylinders -0.217529
horsepower -0.138969
dtype: float64
Linear Regression using a categorical variable#
Again perform linear regression, this time also including “origin” as a predictor. Use a
OneHotEncoder
object.Remove the
intercept
(also called bias) when we instantiate theLinearRegression
object.
(Aside. It’s not obvious to me whether we should rescale this new categorical feature. For now we won’t rescale it. It’s also not obvious to me if we should rescale the output variable. Some quick Google searches suggest there are pros and cons to both.)