Week 7 Monday
Contents
Week 7 Monday#
Announcements#
The next videos and video quizzes are posted. Due Monday of Week 8 (because Friday is a holiday this week).
Worksheets 9 and 10 due tomorrow.
In-class quiz tomorrow based on K-means clustering and also has one question on
StandardScaler
(which William discussed on Tuesday and we discussed on Friday).I have office hours after class at 11am, next door in ALP 3610.
I posted sample solutions to Midterm 1. They’re linked from the Course Updates section on our Canvas homepage.
import pandas as pd
The pandas DataFrame merge
method#
The merge
method could have been covered in the first half of Math 10. We are covering it now because it appears on today’s worksheet. There is also a related method join
, but I am in the habit of using merge
.
Combine the data from the following two DataFrames using
df_menu.merge
together withhow="inner"
.Practice question (won’t go over today): can you make the same result using the
map
method, like what we did with the stocks dataset?
df_menu = pd.DataFrame({
"item": ["BLT", "Egg sandwich", "Ramen", "Pumpkin Pie",
"Caesar salad", "House salad", "Falafel wrap"],
"food_type": ["Sandwich", "Sandwich", "Soup", "Dessert",
"Salad", "Salad", "Sandwich"],
"Vegan": [False, False, False, False, False, True, True]
})
df_price = pd.DataFrame({
"food_type": ["Drink", "Salad", "Sandwich", "Soup"],
"price": [5, 16, 12, 10]
})
df_menu
item | food_type | Vegan | |
---|---|---|---|
0 | BLT | Sandwich | False |
1 | Egg sandwich | Sandwich | False |
2 | Ramen | Soup | False |
3 | Pumpkin Pie | Dessert | False |
4 | Caesar salad | Salad | False |
5 | House salad | Salad | True |
6 | Falafel wrap | Sandwich | True |
df_price
food_type | price | |
---|---|---|
0 | Drink | 5 |
1 | Salad | 16 |
2 | Sandwich | 12 |
3 | Soup | 10 |
The following doesn’t raise an error, but it also doesn’t do what we want. For example, the prices are not aligned with the correct categories.
# doesn't work
pd.concat((df_menu, df_price), axis=1)
item | food_type | Vegan | food_type | price | |
---|---|---|---|---|---|
0 | BLT | Sandwich | False | Drink | 5.0 |
1 | Egg sandwich | Sandwich | False | Salad | 16.0 |
2 | Ramen | Soup | False | Sandwich | 12.0 |
3 | Pumpkin Pie | Dessert | False | Soup | 10.0 |
4 | Caesar salad | Salad | False | NaN | NaN |
5 | House salad | Salad | True | NaN | NaN |
6 | Falafel wrap | Sandwich | True | NaN | NaN |
The following how="left"
syntax and its variants are related to database methods. In this case, because the df_menu
DataFrame is the DataFrame that occurs on the left side of the code, we are telling pandas to keep all of the rows from that DataFrame. We match them to rows in the other DataFrame using the "food_type"
column.
df_menu.merge(df_price, on="food_type", how="left")
item | food_type | Vegan | price | |
---|---|---|---|---|
0 | BLT | Sandwich | False | 12.0 |
1 | Egg sandwich | Sandwich | False | 12.0 |
2 | Ramen | Soup | False | 10.0 |
3 | Pumpkin Pie | Dessert | False | NaN |
4 | Caesar salad | Salad | False | 16.0 |
5 | House salad | Salad | True | 16.0 |
6 | Falafel wrap | Sandwich | True | 12.0 |
The following is similar, but we use all of the rows from the df_price
DataFrame. Notice how the “Salad” row occurs twice, because it occurs twice in the df_menu
DataFrame. Notice how the “Pumpkin Pie” row does not show up, because “Dessert” does not occur in the right DataFrame. (Don’t worry about the order of the rows. I’m not certain how that is determined.)
df_menu.merge(df_price, on="food_type", how="right")
item | food_type | Vegan | price | |
---|---|---|---|---|
0 | NaN | Drink | NaN | 5 |
1 | Caesar salad | Salad | False | 16 |
2 | House salad | Salad | True | 16 |
3 | BLT | Sandwich | False | 12 |
4 | Egg sandwich | Sandwich | False | 12 |
5 | Falafel wrap | Sandwich | True | 12 |
6 | Ramen | Soup | False | 10 |
The keyword argument how="inner"
is like an intersection. We only keep rows if the corresponding “food_type” value occurs in both DataFrames. Notice how there are no missing values in the following DataFrame, unlike the two previous DataFrames which did have nan
values.
df_menu.merge(df_price, on="food_type", how="inner")
item | food_type | Vegan | price | |
---|---|---|---|---|
0 | BLT | Sandwich | False | 12 |
1 | Egg sandwich | Sandwich | False | 12 |
2 | Falafel wrap | Sandwich | True | 12 |
3 | Ramen | Soup | False | 10 |
4 | Caesar salad | Salad | False | 16 |
5 | House salad | Salad | True | 16 |
The following is like a union.
df_menu.merge(df_price, on="food_type", how="outer")
item | food_type | Vegan | price | |
---|---|---|---|---|
0 | BLT | Sandwich | False | 12.0 |
1 | Egg sandwich | Sandwich | False | 12.0 |
2 | Falafel wrap | Sandwich | True | 12.0 |
3 | Ramen | Soup | False | 10.0 |
4 | Pumpkin Pie | Dessert | False | NaN |
5 | Caesar salad | Salad | False | 16.0 |
6 | House salad | Salad | True | 16.0 |
7 | NaN | Drink | NaN | 5.0 |
What if we don’t specify the how
keyword argument? We can check the documentation, which shows us that "inner"
is the default value of this keyword argument. So if we don’t specify how
, it will be the same as specifying how="inner"
.
help(df_menu.merge)
Help on method merge in module pandas.core.frame:
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
Merge DataFrame or named Series objects with a database-style join.
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.
Parameters
----------
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
Type of merge to be performed.
* left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
* right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
* outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
* cross: creates the cartesian product from both frames, preserves the order
of the left keys.
.. versionadded:: 1.2.0
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : list-like, default is ("_x", "_y")
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
`left` and `right` respectively. Pass a value of `None` instead
of a string to indicate that the column name from `left` or
`right` should be left as-is, with no suffix. At least one of the
values must not be None.
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
If True, adds a column to the output DataFrame called "_merge" with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of "left_only" for observations whose merge key only
appears in the left DataFrame, "right_only" for observations
whose merge key only appears in the right DataFrame, and "both"
if the observation's merge key is found in both DataFrames.
validate : str, optional
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left
dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right
dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.
Returns
-------
DataFrame
A DataFrame of the two merged objects.
See Also
--------
merge_ordered : Merge with optional filling/interpolation.
merge_asof : Merge on nearest keys.
DataFrame.join : Similar method using indices.
Notes
-----
Support for specifying index levels as the `on`, `left_on`, and
`right_on` parameters was added in version 0.23.0
Support for merging named Series objects was added in version 0.24.0
Examples
--------
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
>>> df1
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
>>> df2
rkey value
0 foo 5
1 bar 6
2 baz 7
3 foo 8
Merge df1 and df2 on the lkey and rkey columns. The value columns have
the default suffixes, _x and _y, appended.
>>> df1.merge(df2, left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Merge DataFrames df1 and df2 with specified left and right suffixes
appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey',
... suffixes=('_left', '_right'))
lkey value_left rkey value_right
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have
any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
>>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
>>> df1
a b
0 foo 1
1 bar 2
>>> df2
a c
0 foo 3
1 baz 4
>>> df1.merge(df2, how='inner', on='a')
a b c
0 foo 1 3
>>> df1.merge(df2, how='left', on='a')
a b c
0 foo 1 3.0
1 bar 2 NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
>>> df2 = pd.DataFrame({'right': [7, 8]})
>>> df1
left
0 foo
1 bar
>>> df2
right
0 7
1 8
>>> df1.merge(df2, how='cross')
left right
0 foo 7
1 foo 8
2 bar 7
3 bar 8
Some code from the last class#
Here is the base code we were using for linear regression on Friday.
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
from sklearn.linear_model import LinearRegression
df = sns.load_dataset("mpg").dropna(axis=0)
cols = ["horsepower", "weight", "model_year", "cylinders"]
reg = LinearRegression()
reg.fit(df[cols], df["mpg"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
pd.Series(reg.coef_, index=reg.feature_names_in_)
horsepower -0.003615
weight -0.006275
model_year 0.746632
cylinders -0.127687
dtype: float64
df.head(3)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
Linear Regression using a categorical variable#
Again perform linear regression, this time also including “origin” as an additional predictor. Use a
OneHotEncoder
object.These three new origin columns act like a separate intercept for each origin (one intercept for “europe”, one intercept for “japan”, one intercept for “usa”). Thus it makes sense to remove the
intercept
(also called bias) when we instantiate theLinearRegression
object.
df["origin"].unique()
array(['usa', 'japan', 'europe'], dtype=object)
It would not make sense to perform linear regression using the “origin” column, because it contains strings. Even if we convert those three strings to for example 0
, 1
, 2
, that is still not a good idea, because it is forcing an order on the strings, as well as forcing the spacing between them. Instead we will add a column to the DataFrame for each possible value in this column.
from sklearn.preprocessing import OneHotEncoder
This syntax is similar to the StandardScaler
syntax.
encoder = OneHotEncoder()
Here is our common error, of passing a one-dimensional object (a Series) instead of a two-dimensional object (a DataFrame).
encoder.fit(df["origin"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [20], line 1
----> 1 encoder.fit(df["origin"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:818, in OneHotEncoder.fit(self, X, y)
800 """
801 Fit OneHotEncoder to X.
802
(...)
815 Fitted encoder.
816 """
817 self._validate_keywords()
--> 818 fit_results = self._fit(
819 X,
820 handle_unknown=self.handle_unknown,
821 force_all_finite="allow-nan",
822 return_counts=self._infrequent_enabled,
823 )
824 if self._infrequent_enabled:
825 self._fit_infrequent_category_mapping(
826 fit_results["n_samples"], fit_results["category_counts"]
827 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:80, in _BaseEncoder._fit(self, X, handle_unknown, force_all_finite, return_counts)
78 self._check_n_features(X, reset=True)
79 self._check_feature_names(X, reset=True)
---> 80 X_list, n_samples, n_features = self._check_X(
81 X, force_all_finite=force_all_finite
82 )
83 self.n_features_in_ = n_features
85 if self.categories != "auto":
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:45, in _BaseEncoder._check_X(self, X, force_all_finite)
32 """
33 Perform custom check_array:
34 - convert list of strings to object dtype
(...)
41
42 """
43 if not (hasattr(X, "iloc") and getattr(X, "ndim", 0) == 2):
44 # if not a dataframe, do normal check_array validation
---> 45 X_temp = check_array(X, dtype=None, force_all_finite=force_all_finite)
46 if not hasattr(X, "dtype") and np.issubdtype(X_temp.dtype, np.str_):
47 X = check_array(X, dtype=object, force_all_finite=force_all_finite)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
877 # If input is 1D raise error
878 if array.ndim == 1:
--> 879 raise ValueError(
880 "Expected 2D array, got 1D array instead:\narray={}.\n"
881 "Reshape your data either using array.reshape(-1, 1) if "
882 "your data has a single feature or array.reshape(1, -1) "
883 "if it contains a single sample.".format(array)
884 )
886 if dtype_numeric and array.dtype.kind in "USV":
887 raise ValueError(
888 "dtype='numeric' is not compatible with arrays of bytes/strings."
889 "Convert your data to numeric values explicitly instead."
890 )
ValueError: Expected 2D array, got 1D array instead:
array=['usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'japan' 'europe' 'europe' 'europe'
'europe' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'japan'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'europe' 'europe' 'europe' 'japan' 'japan'
'europe' 'usa' 'japan' 'usa' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'usa' 'europe'
'europe' 'europe' 'europe' 'usa' 'japan' 'japan' 'usa' 'japan' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan'
'usa' 'japan' 'japan' 'usa' 'usa' 'europe' 'usa' 'usa' 'europe' 'europe'
'europe' 'europe' 'usa' 'europe' 'japan' 'usa' 'usa' 'usa' 'usa' 'japan'
'usa' 'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'europe' 'europe' 'europe' 'japan' 'japan' 'usa' 'europe' 'europe'
'japan' 'japan' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa'
'japan' 'europe' 'japan' 'usa' 'europe' 'usa' 'europe' 'europe' 'europe'
'europe' 'japan' 'europe' 'europe' 'usa' 'usa' 'europe' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe' 'japan' 'usa' 'usa'
'usa' 'usa' 'europe' 'japan' 'japan' 'usa' 'europe' 'usa' 'europe'
'japan' 'europe' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'europe' 'usa'
'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'europe' 'usa' 'japan' 'usa' 'usa' 'usa' 'japan' 'europe' 'japan'
'europe' 'japan' 'europe' 'usa' 'japan' 'japan' 'japan' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'japan' 'japan' 'usa' 'japan' 'usa' 'usa' 'japan'
'europe' 'europe' 'europe' 'europe' 'europe' 'japan' 'usa' 'usa' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe'
'japan' 'usa' 'usa' 'europe' 'usa' 'europe' 'usa' 'usa' 'usa' 'japan'
'europe' 'usa' 'usa' 'usa' 'usa' 'europe' 'japan' 'usa' 'japan' 'usa'
'usa' 'usa' 'usa' 'europe' 'japan' 'japan' 'japan' 'japan' 'japan' 'usa'
'japan' 'europe' 'europe' 'europe' 'europe' 'japan' 'japan' 'europe'
'japan' 'japan' 'europe' 'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan'
'usa' 'japan' 'japan' 'japan' 'japan' 'japan' 'usa' 'usa' 'usa' 'europe'
'japan' 'japan' 'japan' 'japan' 'europe' 'europe' 'japan' 'japan' 'usa'
'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe'
'japan' 'japan' 'usa' 'usa' 'japan' 'japan' 'japan' 'japan' 'japan'
'japan' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'europe' 'usa'
'usa' 'usa'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the correct way to pass this single column.
encoder.fit(df[["origin"]])
OneHotEncoder()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder()
We now transform the data. (Notice how we use the method transform
instead of predict
. We are not in the Machine Learning portion at the moment, we are not predicting anything. Instead we are in the preprocessing or the data cleaning stage.)
encoder.transform(df[["origin"]])
<392x3 sparse matrix of type '<class 'numpy.float64'>'
with 392 stored elements in Compressed Sparse Row format>
The produced object is a little strange, but we can convert it to a NumPy array. (The array will contain mostly zeros, and so it can be more efficient to not store all of those rows separately. That is why, by default, scikit-learn uses this sparse matrix object instead of a usual NumPy array.)
encoder.transform(df[["origin"]]).toarray()
array([[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
...,
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.]])
There are three columns, corresponding to three values in the “origin” column. Which column corresponds to which value? We can use the following method to check.
encoder.get_feature_names_out()
array(['origin_europe', 'origin_japan', 'origin_usa'], dtype=object)
Let’s add these three new columns to our DataFrame. To be safe, we will make the changes in a new DataFrame.
df2 = df.copy()
At some point (I don’t remember exactly where), we will get an error if we have the new column names in a NumPy array instead of a list, so here we convert it to a list.
new_cols = list(encoder.get_feature_names_out())
new_cols
['origin_europe', 'origin_japan', 'origin_usa']
Here we put the above NumPy array into these three new columns.
df2[new_cols] = encoder.transform(df[["origin"]]).toarray()
Think of these last three columns as like Boolean Series. For example, everywhere we see a 1
in the “origin_europe” column, that signifies that the origin value is “europe”. These columns don’t contain any new information (it was all already present in the “origin” column), but they are numeric, so they can be used for linear regression (or clustering or many other Machine Learning techniques that require numerical values).
df2.sample(5, random_state=12)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | origin_europe | origin_japan | origin_usa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
146 | 28.0 | 4 | 90.0 | 75.0 | 2125 | 14.5 | 74 | usa | dodge colt | 0.0 | 0.0 | 1.0 |
240 | 30.5 | 4 | 97.0 | 78.0 | 2190 | 14.1 | 77 | europe | volkswagen dasher | 1.0 | 0.0 | 0.0 |
82 | 23.0 | 4 | 120.0 | 97.0 | 2506 | 14.5 | 72 | japan | toyouta corona mark ii (sw) | 0.0 | 1.0 | 0.0 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 | 0.0 | 0.0 | 1.0 |
345 | 35.1 | 4 | 81.0 | 60.0 | 1760 | 16.1 | 81 | japan | honda civic 1300 | 0.0 | 1.0 | 0.0 |
I think that on Wednesday we’ll perform linear regression to predict the miles-per-gallon of a car using these new columns as well as our cols
list of columns from before.