Week 10 Wednesday

Week 10 Wednesday#

Announcements#

The Course Project is due a week from today.
Make sure you’re completing the project in the Project Template file in the Course Project folder on Deepnote. (Don’t complete your project on the Worksheet 17 template.)
My goal today is to show how I could use Bing chat to help me learn a new topic (how to use OneHotEncoder as part of a scikit-learn pipeline).

Cautions#

Bing chat and ChatGPT are amazing tools, but you should use them with caution.

Bing chat always sounds confident, but is often wrong.
Even when Bing chat is correct, it is often using outdated methods. (Several times I have learned how to code something from Bing chat, and later found a much simpler way elsewhere, such as on StackOverflow.)
Last I checked, Bing chat is nearly useless for solving mathematical problems (unless it is a classic problem where the solution can be found verbatim on the internet).

Setup#

I want to perform linear regression on the cars dataset from Seaborn, using “mpg” as the target. I want to include information from the “origin” column among the input features. Because the “origin” column contains strings, we will need to convert it into (multiple) numeric columns. We did this with city names using pd.get_dummies. Today I want to use scikit-learn’s OneHotEncoder.

import seaborn as sns
import pandas as pd

df = sns.load_dataset("mpg")

df.head(4)

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst

df["origin"].value_counts()

usa       249
japan      79
europe     70
Name: origin, dtype: int64

With help from Bing chat#

In the following screenshots, you can see my questions (on the right side) and Bing’s responses on the left side. (I made a mistake and asked about “mpg” when I meant “origin”. It doesn’t really make sense to use one-hot encoding with the “mpg” column, because there are so many different numbers in that column. One-hot encoding makes the most sense when there are a limited number of values.)

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

This OneHotEncoder works pretty similarly to PolynomialFeatures which we have used earlier in this class. One difference is that we didn’t need to pass keyword arguments when instantiating the object, unlike for example PolynomialFeatures(degree=3, include_bias=False).

Maybe a more intimidating difference is that the output isn’t a traditional NumPy array. The reason behind this strange “sparse matrix” output is that “most” of the entries in the result will be 0, especially if there are many columns, so you can imagine possibly saving storage space by not actually storing the entire matrix at the same time.

enc.fit_transform(df[["origin"]])

<398x3 sparse matrix of type '<class 'numpy.float64'>'
	with 398 stored elements in Compressed Sparse Row format>

But anyway, we can get a “traditional” NumPy array, by calling the toarray method.

arr = enc.fit_transform(df[["origin"]]).toarray()
arr

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

Let’s get a sense of what these numbers mean. Here are 5 random origin values.

df.sample(5, random_state=2)["origin"]

      usa
      usa
   japan
  europe
  europe
Name: origin, dtype: object

Here are the same random 5 rows from the output of our enc.fit_transform call. (We got these rows by using the index attribute of the DataFrame we saw above. This only works because the index labels are the same as the row numbers.)

Notice how we have two “usa” values, then one “japan” value, then two “europe” values. We can tell from the following output, that the right-most column corresponds to “usa”, the middle column corresponds to “japan”, and the left-hand column corresponds to “europe”.

arr[df.sample(5, random_state=2).index]

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

The answers Bing chat gives are often somewhat out-of-date. Here is an example. On Deepnote, I got a FutureWarning. Here on my personal computer, I’m getting an actual error. We are supposed to use a different method. The Deepnote version instructed me to use get_feature_names_out. (The below error does not give instructions on what to do.)

enc.get_feature_names(["origin"])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 enc.get_feature_names(["origin"])

AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names'

Notice how these values are in the same order as we saw above.

enc.get_feature_names_out(["origin"])

array(['origin_europe', 'origin_japan', 'origin_usa'], dtype=object)

The main reason I didn’t introduce OneHotEncoder earlier (and instead used pd.get_dummies), was because getting OneHotEncoder into a Pipeline is not as straight-forward as getting a PolynomialFeatures into a Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ('onehot', OneHotEncoder()),
    ('regressor', LinearRegression())
])

X = df[['origin', 'horsepower', 'weight']]
y = df['mpg']

pipe.fit(X, y)

Pipeline(steps=[('onehot', OneHotEncoder()), ('regressor', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Here we can see that the first suggestion above does not work as wanted. Notice how many “horsepower” and “weight” columns have been added. We only wanted to do this one-hot encoding with the “origin” column, not with the numerical columns.

pipe['onehot'].get_feature_names_out()

array(['origin_europe', 'origin_japan', 'origin_usa', 'horsepower_46.0',
       'horsepower_48.0', 'horsepower_49.0', 'horsepower_52.0',
       'horsepower_53.0', 'horsepower_54.0', 'horsepower_58.0',
       'horsepower_60.0', 'horsepower_61.0', 'horsepower_62.0',
       'horsepower_63.0', 'horsepower_64.0', 'horsepower_65.0',
       'horsepower_66.0', 'horsepower_67.0', 'horsepower_68.0',
       'horsepower_69.0', 'horsepower_70.0', 'horsepower_71.0',
       'horsepower_72.0', 'horsepower_74.0', 'horsepower_75.0',
       'horsepower_76.0', 'horsepower_77.0', 'horsepower_78.0',
       'horsepower_79.0', 'horsepower_80.0', 'horsepower_81.0',
       'horsepower_82.0', 'horsepower_83.0', 'horsepower_84.0',
       'horsepower_85.0', 'horsepower_86.0', 'horsepower_87.0',
       'horsepower_88.0', 'horsepower_89.0', 'horsepower_90.0',
       'horsepower_91.0', 'horsepower_92.0', 'horsepower_93.0',
       'horsepower_94.0', 'horsepower_95.0', 'horsepower_96.0',
       'horsepower_97.0', 'horsepower_98.0', 'horsepower_100.0',
       'horsepower_102.0', 'horsepower_103.0', 'horsepower_105.0',
       'horsepower_107.0', 'horsepower_108.0', 'horsepower_110.0',
       'horsepower_112.0', 'horsepower_113.0', 'horsepower_115.0',
       'horsepower_116.0', 'horsepower_120.0', 'horsepower_122.0',
       'horsepower_125.0', 'horsepower_129.0', 'horsepower_130.0',
       'horsepower_132.0', 'horsepower_133.0', 'horsepower_135.0',
       'horsepower_137.0', 'horsepower_138.0', 'horsepower_139.0',
       'horsepower_140.0', 'horsepower_142.0', 'horsepower_145.0',
       'horsepower_148.0', 'horsepower_149.0', 'horsepower_150.0',
       'horsepower_152.0', 'horsepower_153.0', 'horsepower_155.0',
       'horsepower_158.0', 'horsepower_160.0', 'horsepower_165.0',
       'horsepower_167.0', 'horsepower_170.0', 'horsepower_175.0',
       'horsepower_180.0', 'horsepower_190.0', 'horsepower_193.0',
       'horsepower_198.0', 'horsepower_200.0', 'horsepower_208.0',
       'horsepower_210.0', 'horsepower_215.0', 'horsepower_220.0',
       'horsepower_225.0', 'horsepower_230.0', 'horsepower_nan',
       'weight_1613', 'weight_1649', 'weight_1755', 'weight_1760',
       'weight_1773', 'weight_1795', 'weight_1800', 'weight_1825',
       'weight_1834', 'weight_1835', 'weight_1836', 'weight_1845',
       'weight_1850', 'weight_1867', 'weight_1875', 'weight_1915',
       'weight_1925', 'weight_1937', 'weight_1940', 'weight_1945',
       'weight_1950', 'weight_1955', 'weight_1963', 'weight_1965',
       'weight_1968', 'weight_1970', 'weight_1975', 'weight_1980',
       'weight_1985', 'weight_1990', 'weight_1995', 'weight_2000',
       'weight_2003', 'weight_2019', 'weight_2020', 'weight_2025',
       'weight_2035', 'weight_2045', 'weight_2046', 'weight_2050',
       'weight_2051', 'weight_2065', 'weight_2070', 'weight_2074',
       'weight_2075', 'weight_2085', 'weight_2100', 'weight_2108',
       'weight_2110', 'weight_2120', 'weight_2123', 'weight_2124',
       'weight_2125', 'weight_2126', 'weight_2130', 'weight_2135',
       'weight_2144', 'weight_2145', 'weight_2150', 'weight_2155',
       'weight_2158', 'weight_2160', 'weight_2164', 'weight_2171',
       'weight_2188', 'weight_2189', 'weight_2190', 'weight_2200',
       'weight_2202', 'weight_2205', 'weight_2210', 'weight_2215',
       'weight_2219', 'weight_2220', 'weight_2223', 'weight_2226',
       'weight_2228', 'weight_2230', 'weight_2234', 'weight_2245',
       'weight_2246', 'weight_2254', 'weight_2255', 'weight_2264',
       'weight_2265', 'weight_2278', 'weight_2279', 'weight_2288',
       'weight_2290', 'weight_2295', 'weight_2300', 'weight_2310',
       'weight_2320', 'weight_2330', 'weight_2335', 'weight_2350',
       'weight_2370', 'weight_2372', 'weight_2375', 'weight_2379',
       'weight_2380', 'weight_2385', 'weight_2391', 'weight_2395',
       'weight_2401', 'weight_2405', 'weight_2408', 'weight_2420',
       'weight_2430', 'weight_2434', 'weight_2451', 'weight_2464',
       'weight_2472', 'weight_2489', 'weight_2490', 'weight_2500',
       'weight_2506', 'weight_2511', 'weight_2515', 'weight_2525',
       'weight_2542', 'weight_2545', 'weight_2556', 'weight_2560',
       'weight_2565', 'weight_2572', 'weight_2575', 'weight_2582',
       'weight_2585', 'weight_2587', 'weight_2592', 'weight_2595',
       'weight_2600', 'weight_2605', 'weight_2615', 'weight_2620',
       'weight_2625', 'weight_2634', 'weight_2635', 'weight_2639',
       'weight_2640', 'weight_2648', 'weight_2660', 'weight_2665',
       'weight_2670', 'weight_2671', 'weight_2672', 'weight_2678',
       'weight_2694', 'weight_2700', 'weight_2702', 'weight_2711',
       'weight_2720', 'weight_2725', 'weight_2735', 'weight_2740',
       'weight_2745', 'weight_2755', 'weight_2774', 'weight_2789',
       'weight_2790', 'weight_2795', 'weight_2800', 'weight_2807',
       'weight_2815', 'weight_2830', 'weight_2833', 'weight_2835',
       'weight_2855', 'weight_2865', 'weight_2868', 'weight_2870',
       'weight_2875', 'weight_2890', 'weight_2900', 'weight_2901',
       'weight_2904', 'weight_2905', 'weight_2910', 'weight_2914',
       'weight_2930', 'weight_2933', 'weight_2945', 'weight_2950',
       'weight_2957', 'weight_2962', 'weight_2965', 'weight_2979',
       'weight_2984', 'weight_2990', 'weight_3003', 'weight_3012',
       'weight_3015', 'weight_3021', 'weight_3035', 'weight_3039',
       'weight_3060', 'weight_3070', 'weight_3085', 'weight_3086',
       'weight_3102', 'weight_3121', 'weight_3139', 'weight_3140',
       'weight_3150', 'weight_3155', 'weight_3158', 'weight_3160',
       'weight_3169', 'weight_3190', 'weight_3193', 'weight_3205',
       'weight_3210', 'weight_3211', 'weight_3221', 'weight_3230',
       'weight_3233', 'weight_3245', 'weight_3250', 'weight_3264',
       'weight_3265', 'weight_3270', 'weight_3278', 'weight_3282',
       'weight_3288', 'weight_3302', 'weight_3329', 'weight_3336',
       'weight_3353', 'weight_3360', 'weight_3365', 'weight_3380',
       'weight_3381', 'weight_3399', 'weight_3410', 'weight_3415',
       'weight_3420', 'weight_3425', 'weight_3430', 'weight_3432',
       'weight_3433', 'weight_3436', 'weight_3439', 'weight_3445',
       'weight_3449', 'weight_3459', 'weight_3465', 'weight_3504',
       'weight_3520', 'weight_3525', 'weight_3530', 'weight_3535',
       'weight_3563', 'weight_3570', 'weight_3574', 'weight_3605',
       'weight_3609', 'weight_3613', 'weight_3620', 'weight_3630',
       'weight_3632', 'weight_3645', 'weight_3651', 'weight_3664',
       'weight_3672', 'weight_3693', 'weight_3725', 'weight_3730',
       'weight_3735', 'weight_3755', 'weight_3761', 'weight_3777',
       'weight_3781', 'weight_3785', 'weight_3820', 'weight_3821',
       'weight_3830', 'weight_3840', 'weight_3850', 'weight_3870',
       'weight_3880', 'weight_3892', 'weight_3897', 'weight_3900',
       'weight_3907', 'weight_3940', 'weight_3955', 'weight_3962',
       'weight_3988', 'weight_4042', 'weight_4054', 'weight_4055',
       'weight_4060', 'weight_4077', 'weight_4080', 'weight_4082',
       'weight_4096', 'weight_4098', 'weight_4100', 'weight_4129',
       'weight_4135', 'weight_4140', 'weight_4141', 'weight_4154',
       'weight_4165', 'weight_4190', 'weight_4209', 'weight_4215',
       'weight_4220', 'weight_4237', 'weight_4257', 'weight_4274',
       'weight_4278', 'weight_4294', 'weight_4295', 'weight_4312',
       'weight_4325', 'weight_4335', 'weight_4341', 'weight_4354',
       'weight_4360', 'weight_4363', 'weight_4376', 'weight_4380',
       'weight_4382', 'weight_4385', 'weight_4422', 'weight_4425',
       'weight_4440', 'weight_4456', 'weight_4457', 'weight_4464',
       'weight_4498', 'weight_4499', 'weight_4502', 'weight_4615',
       'weight_4633', 'weight_4638', 'weight_4654', 'weight_4657',
       'weight_4668', 'weight_4699', 'weight_4732', 'weight_4735',
       'weight_4746', 'weight_4906', 'weight_4951', 'weight_4952',
       'weight_4955', 'weight_4997', 'weight_5140'], dtype=object)

The following error is not the fault of Bing, it’s because our data has missing values in it.

from sklearn.compose import ColumnTransformer

pipe = Pipeline([
    ('prep', ColumnTransformer([
        ('onehot', OneHotEncoder(), ['origin']),
        ('passthrough', 'passthrough', ['horsepower', 'weight'])
    ])),
    ('reg', LinearRegression())
])

X = df[['origin', 'horsepower', 'weight']]
y = df['mpg']

pipe.fit(X, y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 14
X = df[['origin', 'horsepower', 'weight']]
y = df['mpg']
---> 14 pipe.fit(X, y)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
   if self._final_estimator != "passthrough":
       fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405         self._final_estimator.fit(Xt, y, **fit_params_last_step)
return self

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/linear_model/_base.py:648, in LinearRegression.fit(self, X, y, sample_weight)
n_jobs_ = self.n_jobs
accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 648 X, y = self._validate_data(
   X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
)
sample_weight = _check_sample_weight(
   sample_weight, X, dtype=X.dtype, only_non_negative=True
)
X, y, X_offset, y_offset, X_scale = _preprocess_data(
   X,
   y,
   (...)
   sample_weight=sample_weight,
)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
       y = check_array(y, input_name="y", **check_y_params)
   else:
--> 584         X, y = check_X_y(X, y, **check_params)
   out = X, y
if not no_val_X and check_params.get("ensure_2d", True):

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/utils/validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       estimator_name = _check_estimator_name(estimator)
   raise ValueError(
       f"{estimator_name} requires y to be passed, but the target y is None"
   )
-> 1106 X = check_array(
   X,
   accept_sparse=accept_sparse,
   accept_large_sparse=accept_large_sparse,
   dtype=dtype,
   order=order,
   copy=copy,
   force_all_finite=force_all_finite,
   ensure_2d=ensure_2d,
   allow_nd=allow_nd,
   ensure_min_samples=ensure_min_samples,
   ensure_min_features=ensure_min_features,
   estimator=estimator,
   input_name="X",
)
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
check_consistent_length(X, y)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
       raise ValueError(
           "Found array with dim %d. %s expected <= 2."
           % (array.ndim, estimator_name)
       )
   if force_all_finite:
--> 921         _assert_all_finite(
           array,
           input_name=input_name,
           estimator_name=estimator_name,
           allow_nan=force_all_finite == "allow-nan",
       )
if ensure_min_samples > 0:
   n_samples = _num_samples(array)

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
if estimator_name and input_name == "X" and has_nan_error:
   # Improve the error message on how to handle missing values in
   # scikit-learn.
   msg_err += (
       f"\n{estimator_name} does not accept missing values"
       " encoded as NaN natively. For supervised learning, you might want"
   (...)
       "#estimators-that-handle-nan-values"
   )
--> 161 raise ValueError(msg_err)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Here we drop the rows with missing values. I think this is safer than filling in the missing values.

df = df.dropna(axis=0)

Now we use the second suggestion shown in the screenshot above. I have not completely understood the meaning of these “passthrough” columns; at this point, I am following Bing’s instructions.

from sklearn.compose import ColumnTransformer

pipe = Pipeline([
    ('prep', ColumnTransformer([
        ('onehot', OneHotEncoder(), ['origin']),
        ('passthrough', 'passthrough', ['horsepower', 'weight'])
    ])),
    ('reg', LinearRegression())
])

X = df[['origin', 'horsepower', 'weight']]
y = df['mpg']

pipe.fit(X, y)

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('onehot', OneHotEncoder(),
                                                  ['origin']),
                                                 ('passthrough', 'passthrough',
                                                  ['horsepower', 'weight'])])),
                ('reg', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Remember that we can’t get the linear regression coefficients directly from the Pipeline object.

pipe.coef_

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 1
----> 1 pipe.coef_

AttributeError: 'Pipeline' object has no attribute 'coef_'

Instead, we have to access the LinearRegression step, which we do here using pipe["reg"].

pipe["reg"].coef_

array([-0.27333471,  1.50778661, -1.2344519 , -0.05354417, -0.00484275])

I’m not quite following the zip and print suggestion of Bing. But the main thing I wanted to learn was about getting the feature_names variable that Bing illustrates above, by calling the get_feature_names_out method on the preprocessor step.

pd.Series(
    data=pipe["reg"].coef_,
    index=pipe["prep"].get_feature_names_out()
)

onehot__origin_europe     -0.273335
onehot__origin_japan       1.507787
onehot__origin_usa        -1.234452
passthrough__horsepower   -0.053544
passthrough__weight       -0.004843
dtype: float64

Bing didn’t quite understand what I meant, and it gave me a scatterplot description.

Here we try plotting. It looks terrible, let’s try getting rid of the “horsepower” portion, since it’s not one of our axes here.

import plotly.express as px

y_pred = pipe.predict(X)

fig = px.line(df, x='weight', y=y_pred, color='origin')
fig.show()

I’m not sure what the error is below.

from sklearn.compose import ColumnTransformer

pipe = Pipeline([
    ('prep', ColumnTransformer([
        ('onehot', OneHotEncoder(), ['origin']),
        ('passthrough', ['weight'])
    ])),
    ('reg', LinearRegression())
])

X = df[['origin', 'weight']]
y = df['mpg']

pipe.fit(X, y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [29], line 14
     11 X = df[['origin', 'weight']]
     12 y = df['mpg']
---> 14 pipe.fit(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/pipeline.py:378, in Pipeline.fit(self, X, y, **fit_params)
    352 """Fit the model.
    353 
    354 Fit all the transformers one after the other and transform the
   (...)
    375     Pipeline with fitted steps.
    376 """
    377 fit_params_steps = self._check_fit_params(**fit_params)
--> 378 Xt = self._fit(X, y, **fit_params_steps)
    379 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    380     if self._final_estimator != "passthrough":

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/pipeline.py:336, in Pipeline._fit(self, X, y, **fit_params_steps)
    334     cloned_transformer = clone(transformer)
    335 # Fit or load from cache the current transformer
--> 336 X, fitted_transformer = fit_transform_one_cached(
    337     cloned_transformer,
    338     X,
    339     y,
    340     None,
    341     message_clsname="Pipeline",
    342     message=self._log_message(step_idx),
    343     **fit_params_steps[name],
    344 )
    345 # Replace the transformer of the step with the fitted
    346 # transformer. This is necessary when loading the transformer
    347 # from the cache.
    348 self.steps[step_idx] = (name, fitted_transformer)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    348 def __call__(self, *args, **kwargs):
--> 349     return self.func(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/pipeline.py:870, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    868 with _print_elapsed_time(message_clsname, message):
    869     if hasattr(transformer, "fit_transform"):
--> 870         res = transformer.fit_transform(X, y, **fit_params)
    871     else:
    872         res = transformer.fit(X, y, **fit_params).transform(X)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:686, in ColumnTransformer.fit_transform(self, X, y)
    684 # set n_features_in_ attribute
    685 self._check_n_features(X, reset=True)
--> 686 self._validate_transformers()
    687 self._validate_column_callables(X)
    688 self._validate_remainder(X)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:346, in ColumnTransformer._validate_transformers(self)
    343 if not self.transformers:
    344     return
--> 346 names, transformers, _ = zip(*self.transformers)
    348 # validate names
    349 self._validate_names(names)

ValueError: not enough values to unpack (expected 3, got 2)

Honestly it still looks worse than I expected. The only difference between the origins is that they have different y-intercepts (or biases). In my head, I had expected them to have different slopes, but that is clearly wrong.

pipe = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('onehot', OneHotEncoder(), ['origin']),
        ('passthrough', 'passthrough', ['weight'])
    ])),
    ('regressor', LinearRegression())
])

X = df[['origin', 'weight']]
y = df['mpg']

pipe.fit(X, y)

y_pred = pipe.predict(X)

fig = px.line(df, x='weight', y=y_pred, color='origin')
fig.show()

I don’t really think the with_mean part is correct, but I didn’t have time to investigate it during class.

from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('onehot', OneHotEncoder(), ['origin']),
        ('scaler', StandardScaler(with_mean=False), ['weight'])
    ])),
    ('regressor', LinearRegression(fit_intercept=False))
])

X = df[['origin', 'weight']]
y = df['mpg']

pipe.fit(X, y)

y_pred = pipe.predict(X)

fig = px.line(df, x='weight', y=y_pred, color='origin')
fig.show()

Here are the corresponding coefficients.

pipe["regressor"].coef_

array([44.70314174, 46.05938605, 43.73223615, -5.9612044 ])

Notice how we don’t have an intercept_ any more (or more accurately, it is 0). That is because we instantiated the LinearRegression object using LinearRegression(fit_intercept=False).

pipe["regressor"].intercept_

0.0