Week 7 Monday#

Announcements#

  • The next videos and video quizzes are posted. Due Monday of Week 8 (because Friday is a holiday this week).

  • Worksheets 9 and 10 due tomorrow.

  • In-class quiz tomorrow based on K-means clustering and also has one question on StandardScaler (which William discussed on Tuesday and we discussed on Friday).

  • I have office hours after class at 11am, next door in ALP 3610.

  • I posted sample solutions to Midterm 1. They’re linked from the Course Updates section on our Canvas homepage.

import pandas as pd

The pandas DataFrame merge method#

The merge method could have been covered in the first half of Math 10. We are covering it now because it appears on today’s worksheet. There is also a related method join, but I am in the habit of using merge.

  • Combine the data from the following two DataFrames using df_menu.merge together with how="inner".

  • Practice question (won’t go over today): can you make the same result using the map method, like what we did with the stocks dataset?

df_menu = pd.DataFrame({
                    "item": ["BLT", "Egg sandwich", "Ramen", "Pumpkin Pie",
                            "Caesar salad", "House salad", "Falafel wrap"],
                    "food_type": ["Sandwich", "Sandwich", "Soup", "Dessert",
                            "Salad", "Salad", "Sandwich"],
                    "Vegan": [False, False, False, False, False, True, True]
                    })

df_price = pd.DataFrame({
                    "food_type": ["Drink", "Salad", "Sandwich", "Soup"],
                    "price": [5, 16, 12, 10]
                    })
df_menu
item food_type Vegan
0 BLT Sandwich False
1 Egg sandwich Sandwich False
2 Ramen Soup False
3 Pumpkin Pie Dessert False
4 Caesar salad Salad False
5 House salad Salad True
6 Falafel wrap Sandwich True
df_price
food_type price
0 Drink 5
1 Salad 16
2 Sandwich 12
3 Soup 10

The following doesn’t raise an error, but it also doesn’t do what we want. For example, the prices are not aligned with the correct categories.

# doesn't work
pd.concat((df_menu, df_price), axis=1)
item food_type Vegan food_type price
0 BLT Sandwich False Drink 5.0
1 Egg sandwich Sandwich False Salad 16.0
2 Ramen Soup False Sandwich 12.0
3 Pumpkin Pie Dessert False Soup 10.0
4 Caesar salad Salad False NaN NaN
5 House salad Salad True NaN NaN
6 Falafel wrap Sandwich True NaN NaN

The following how="left" syntax and its variants are related to database methods. In this case, because the df_menu DataFrame is the DataFrame that occurs on the left side of the code, we are telling pandas to keep all of the rows from that DataFrame. We match them to rows in the other DataFrame using the "food_type" column.

df_menu.merge(df_price, on="food_type", how="left")
item food_type Vegan price
0 BLT Sandwich False 12.0
1 Egg sandwich Sandwich False 12.0
2 Ramen Soup False 10.0
3 Pumpkin Pie Dessert False NaN
4 Caesar salad Salad False 16.0
5 House salad Salad True 16.0
6 Falafel wrap Sandwich True 12.0

The following is similar, but we use all of the rows from the df_price DataFrame. Notice how the “Salad” row occurs twice, because it occurs twice in the df_menu DataFrame. Notice how the “Pumpkin Pie” row does not show up, because “Dessert” does not occur in the right DataFrame. (Don’t worry about the order of the rows. I’m not certain how that is determined.)

df_menu.merge(df_price, on="food_type", how="right")
item food_type Vegan price
0 NaN Drink NaN 5
1 Caesar salad Salad False 16
2 House salad Salad True 16
3 BLT Sandwich False 12
4 Egg sandwich Sandwich False 12
5 Falafel wrap Sandwich True 12
6 Ramen Soup False 10

The keyword argument how="inner" is like an intersection. We only keep rows if the corresponding “food_type” value occurs in both DataFrames. Notice how there are no missing values in the following DataFrame, unlike the two previous DataFrames which did have nan values.

df_menu.merge(df_price, on="food_type", how="inner")
item food_type Vegan price
0 BLT Sandwich False 12
1 Egg sandwich Sandwich False 12
2 Falafel wrap Sandwich True 12
3 Ramen Soup False 10
4 Caesar salad Salad False 16
5 House salad Salad True 16

The following is like a union.

df_menu.merge(df_price, on="food_type", how="outer")
item food_type Vegan price
0 BLT Sandwich False 12.0
1 Egg sandwich Sandwich False 12.0
2 Falafel wrap Sandwich True 12.0
3 Ramen Soup False 10.0
4 Pumpkin Pie Dessert False NaN
5 Caesar salad Salad False 16.0
6 House salad Salad True 16.0
7 NaN Drink NaN 5.0

What if we don’t specify the how keyword argument? We can check the documentation, which shows us that "inner" is the default value of this keyword argument. So if we don’t specify how, it will be the same as specifying how="inner".

help(df_menu.merge)
Help on method merge in module pandas.core.frame:

merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
    Merge DataFrame or named Series objects with a database-style join.
    
    The join is done on columns or indexes. If joining columns on
    columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
    on indexes or indexes on a column or columns, the index will be passed on.
    When performing a cross merge, no column specifications to merge on are
    allowed.
    
    Parameters
    ----------
    right : DataFrame or named Series
        Object to merge with.
    how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
        Type of merge to be performed.
    
        * left: use only keys from left frame, similar to a SQL left outer join;
          preserve key order.
        * right: use only keys from right frame, similar to a SQL right outer join;
          preserve key order.
        * outer: use union of keys from both frames, similar to a SQL full outer
          join; sort keys lexicographically.
        * inner: use intersection of keys from both frames, similar to a SQL inner
          join; preserve the order of the left keys.
        * cross: creates the cartesian product from both frames, preserves the order
          of the left keys.
    
          .. versionadded:: 1.2.0
    
    on : label or list
        Column or index level names to join on. These must be found in both
        DataFrames. If `on` is None and not merging on indexes then this defaults
        to the intersection of the columns in both DataFrames.
    left_on : label or list, or array-like
        Column or index level names to join on in the left DataFrame. Can also
        be an array or list of arrays of the length of the left DataFrame.
        These arrays are treated as if they are columns.
    right_on : label or list, or array-like
        Column or index level names to join on in the right DataFrame. Can also
        be an array or list of arrays of the length of the right DataFrame.
        These arrays are treated as if they are columns.
    left_index : bool, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels.
    right_index : bool, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index.
    sort : bool, default False
        Sort the join keys lexicographically in the result DataFrame. If False,
        the order of the join keys depends on the join type (how keyword).
    suffixes : list-like, default is ("_x", "_y")
        A length-2 sequence where each element is optionally a string
        indicating the suffix to add to overlapping column names in
        `left` and `right` respectively. Pass a value of `None` instead
        of a string to indicate that the column name from `left` or
        `right` should be left as-is, with no suffix. At least one of the
        values must not be None.
    copy : bool, default True
        If False, avoid copy if possible.
    indicator : bool or str, default False
        If True, adds a column to the output DataFrame called "_merge" with
        information on the source of each row. The column can be given a different
        name by providing a string argument. The column will have a Categorical
        type with the value of "left_only" for observations whose merge key only
        appears in the left DataFrame, "right_only" for observations
        whose merge key only appears in the right DataFrame, and "both"
        if the observation's merge key is found in both DataFrames.
    
    validate : str, optional
        If specified, checks if merge is of specified type.
    
        * "one_to_one" or "1:1": check if merge keys are unique in both
          left and right datasets.
        * "one_to_many" or "1:m": check if merge keys are unique in left
          dataset.
        * "many_to_one" or "m:1": check if merge keys are unique in right
          dataset.
        * "many_to_many" or "m:m": allowed, but does not result in checks.
    
    Returns
    -------
    DataFrame
        A DataFrame of the two merged objects.
    
    See Also
    --------
    merge_ordered : Merge with optional filling/interpolation.
    merge_asof : Merge on nearest keys.
    DataFrame.join : Similar method using indices.
    
    Notes
    -----
    Support for specifying index levels as the `on`, `left_on`, and
    `right_on` parameters was added in version 0.23.0
    Support for merging named Series objects was added in version 0.24.0
    
    Examples
    --------
    >>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
    ...                     'value': [1, 2, 3, 5]})
    >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
    ...                     'value': [5, 6, 7, 8]})
    >>> df1
        lkey value
    0   foo      1
    1   bar      2
    2   baz      3
    3   foo      5
    >>> df2
        rkey value
    0   foo      5
    1   bar      6
    2   baz      7
    3   foo      8
    
    Merge df1 and df2 on the lkey and rkey columns. The value columns have
    the default suffixes, _x and _y, appended.
    
    >>> df1.merge(df2, left_on='lkey', right_on='rkey')
      lkey  value_x rkey  value_y
    0  foo        1  foo        5
    1  foo        1  foo        8
    2  foo        5  foo        5
    3  foo        5  foo        8
    4  bar        2  bar        6
    5  baz        3  baz        7
    
    Merge DataFrames df1 and df2 with specified left and right suffixes
    appended to any overlapping columns.
    
    >>> df1.merge(df2, left_on='lkey', right_on='rkey',
    ...           suffixes=('_left', '_right'))
      lkey  value_left rkey  value_right
    0  foo           1  foo            5
    1  foo           1  foo            8
    2  foo           5  foo            5
    3  foo           5  foo            8
    4  bar           2  bar            6
    5  baz           3  baz            7
    
    Merge DataFrames df1 and df2, but raise an exception if the DataFrames have
    any overlapping columns.
    
    >>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
    Traceback (most recent call last):
    ...
    ValueError: columns overlap but no suffix specified:
        Index(['value'], dtype='object')
    
    >>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
    >>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
    >>> df1
          a  b
    0   foo  1
    1   bar  2
    >>> df2
          a  c
    0   foo  3
    1   baz  4
    
    >>> df1.merge(df2, how='inner', on='a')
          a  b  c
    0   foo  1  3
    
    >>> df1.merge(df2, how='left', on='a')
          a  b  c
    0   foo  1  3.0
    1   bar  2  NaN
    
    >>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
    >>> df2 = pd.DataFrame({'right': [7, 8]})
    >>> df1
        left
    0   foo
    1   bar
    >>> df2
        right
    0   7
    1   8
    
    >>> df1.merge(df2, how='cross')
       left  right
    0   foo      7
    1   foo      8
    2   bar      7
    3   bar      8

Some code from the last class#

Here is the base code we were using for linear regression on Friday.

import pandas as pd
import numpy as np

import altair as alt
import seaborn as sns

from sklearn.linear_model import LinearRegression
df = sns.load_dataset("mpg").dropna(axis=0)

cols = ["horsepower", "weight", "model_year", "cylinders"]
reg = LinearRegression()
reg.fit(df[cols], df["mpg"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pd.Series(reg.coef_, index=reg.feature_names_in_)
horsepower   -0.003615
weight       -0.006275
model_year    0.746632
cylinders    -0.127687
dtype: float64
df.head(3)
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite

Linear Regression using a categorical variable#

  • Again perform linear regression, this time also including “origin” as an additional predictor. Use a OneHotEncoder object.

  • These three new origin columns act like a separate intercept for each origin (one intercept for “europe”, one intercept for “japan”, one intercept for “usa”). Thus it makes sense to remove the intercept (also called bias) when we instantiate the LinearRegression object.

df["origin"].unique()
array(['usa', 'japan', 'europe'], dtype=object)

It would not make sense to perform linear regression using the “origin” column, because it contains strings. Even if we convert those three strings to for example 0, 1, 2, that is still not a good idea, because it is forcing an order on the strings, as well as forcing the spacing between them. Instead we will add a column to the DataFrame for each possible value in this column.

from sklearn.preprocessing import OneHotEncoder

This syntax is similar to the StandardScaler syntax.

encoder = OneHotEncoder()

Here is our common error, of passing a one-dimensional object (a Series) instead of a two-dimensional object (a DataFrame).

encoder.fit(df["origin"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [20], line 1
----> 1 encoder.fit(df["origin"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:818, in OneHotEncoder.fit(self, X, y)
    800 """
    801 Fit OneHotEncoder to X.
    802 
   (...)
    815     Fitted encoder.
    816 """
    817 self._validate_keywords()
--> 818 fit_results = self._fit(
    819     X,
    820     handle_unknown=self.handle_unknown,
    821     force_all_finite="allow-nan",
    822     return_counts=self._infrequent_enabled,
    823 )
    824 if self._infrequent_enabled:
    825     self._fit_infrequent_category_mapping(
    826         fit_results["n_samples"], fit_results["category_counts"]
    827     )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:80, in _BaseEncoder._fit(self, X, handle_unknown, force_all_finite, return_counts)
     78 self._check_n_features(X, reset=True)
     79 self._check_feature_names(X, reset=True)
---> 80 X_list, n_samples, n_features = self._check_X(
     81     X, force_all_finite=force_all_finite
     82 )
     83 self.n_features_in_ = n_features
     85 if self.categories != "auto":

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:45, in _BaseEncoder._check_X(self, X, force_all_finite)
     32 """
     33 Perform custom check_array:
     34 - convert list of strings to object dtype
   (...)
     41 
     42 """
     43 if not (hasattr(X, "iloc") and getattr(X, "ndim", 0) == 2):
     44     # if not a dataframe, do normal check_array validation
---> 45     X_temp = check_array(X, dtype=None, force_all_finite=force_all_finite)
     46     if not hasattr(X, "dtype") and np.issubdtype(X_temp.dtype, np.str_):
     47         X = check_array(X, dtype=object, force_all_finite=force_all_finite)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    877     # If input is 1D raise error
    878     if array.ndim == 1:
--> 879         raise ValueError(
    880             "Expected 2D array, got 1D array instead:\narray={}.\n"
    881             "Reshape your data either using array.reshape(-1, 1) if "
    882             "your data has a single feature or array.reshape(1, -1) "
    883             "if it contains a single sample.".format(array)
    884         )
    886 if dtype_numeric and array.dtype.kind in "USV":
    887     raise ValueError(
    888         "dtype='numeric' is not compatible with arrays of bytes/strings."
    889         "Convert your data to numeric values explicitly instead."
    890     )

ValueError: Expected 2D array, got 1D array instead:
array=['usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'japan' 'europe' 'europe' 'europe'
 'europe' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'japan'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'europe' 'europe' 'europe' 'japan' 'japan'
 'europe' 'usa' 'japan' 'usa' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'usa' 'europe'
 'europe' 'europe' 'europe' 'usa' 'japan' 'japan' 'usa' 'japan' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan'
 'usa' 'japan' 'japan' 'usa' 'usa' 'europe' 'usa' 'usa' 'europe' 'europe'
 'europe' 'europe' 'usa' 'europe' 'japan' 'usa' 'usa' 'usa' 'usa' 'japan'
 'usa' 'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'europe' 'europe' 'europe' 'japan' 'japan' 'usa' 'europe' 'europe'
 'japan' 'japan' 'europe' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa'
 'japan' 'europe' 'japan' 'usa' 'europe' 'usa' 'europe' 'europe' 'europe'
 'europe' 'japan' 'europe' 'europe' 'usa' 'usa' 'europe' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe' 'japan' 'usa' 'usa'
 'usa' 'usa' 'europe' 'japan' 'japan' 'usa' 'europe' 'usa' 'europe'
 'japan' 'europe' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'europe' 'usa'
 'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'europe' 'usa' 'japan' 'usa' 'usa' 'usa' 'japan' 'europe' 'japan'
 'europe' 'japan' 'europe' 'usa' 'japan' 'japan' 'japan' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'japan' 'japan' 'usa' 'japan' 'usa' 'usa' 'japan'
 'europe' 'europe' 'europe' 'europe' 'europe' 'japan' 'usa' 'usa' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe'
 'japan' 'usa' 'usa' 'europe' 'usa' 'europe' 'usa' 'usa' 'usa' 'japan'
 'europe' 'usa' 'usa' 'usa' 'usa' 'europe' 'japan' 'usa' 'japan' 'usa'
 'usa' 'usa' 'usa' 'europe' 'japan' 'japan' 'japan' 'japan' 'japan' 'usa'
 'japan' 'europe' 'europe' 'europe' 'europe' 'japan' 'japan' 'europe'
 'japan' 'japan' 'europe' 'japan' 'usa' 'usa' 'usa' 'usa' 'usa' 'japan'
 'usa' 'japan' 'japan' 'japan' 'japan' 'japan' 'usa' 'usa' 'usa' 'europe'
 'japan' 'japan' 'japan' 'japan' 'europe' 'europe' 'japan' 'japan' 'usa'
 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'usa' 'europe'
 'japan' 'japan' 'usa' 'usa' 'japan' 'japan' 'japan' 'japan' 'japan'
 'japan' 'usa' 'usa' 'usa' 'usa' 'japan' 'usa' 'usa' 'usa' 'europe' 'usa'
 'usa' 'usa'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Here is the correct way to pass this single column.

encoder.fit(df[["origin"]])
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We now transform the data. (Notice how we use the method transform instead of predict. We are not in the Machine Learning portion at the moment, we are not predicting anything. Instead we are in the preprocessing or the data cleaning stage.)

encoder.transform(df[["origin"]])
<392x3 sparse matrix of type '<class 'numpy.float64'>'
	with 392 stored elements in Compressed Sparse Row format>

The produced object is a little strange, but we can convert it to a NumPy array. (The array will contain mostly zeros, and so it can be more efficient to not store all of those rows separately. That is why, by default, scikit-learn uses this sparse matrix object instead of a usual NumPy array.)

encoder.transform(df[["origin"]]).toarray()
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

There are three columns, corresponding to three values in the “origin” column. Which column corresponds to which value? We can use the following method to check.

encoder.get_feature_names_out()
array(['origin_europe', 'origin_japan', 'origin_usa'], dtype=object)

Let’s add these three new columns to our DataFrame. To be safe, we will make the changes in a new DataFrame.

df2 = df.copy()

At some point (I don’t remember exactly where), we will get an error if we have the new column names in a NumPy array instead of a list, so here we convert it to a list.

new_cols = list(encoder.get_feature_names_out())
new_cols
['origin_europe', 'origin_japan', 'origin_usa']

Here we put the above NumPy array into these three new columns.

df2[new_cols] = encoder.transform(df[["origin"]]).toarray()

Think of these last three columns as like Boolean Series. For example, everywhere we see a 1 in the “origin_europe” column, that signifies that the origin value is “europe”. These columns don’t contain any new information (it was all already present in the “origin” column), but they are numeric, so they can be used for linear regression (or clustering or many other Machine Learning techniques that require numerical values).

df2.sample(5, random_state=12)
mpg cylinders displacement horsepower weight acceleration model_year origin name origin_europe origin_japan origin_usa
146 28.0 4 90.0 75.0 2125 14.5 74 usa dodge colt 0.0 0.0 1.0
240 30.5 4 97.0 78.0 2190 14.1 77 europe volkswagen dasher 1.0 0.0 0.0
82 23.0 4 120.0 97.0 2506 14.5 72 japan toyouta corona mark ii (sw) 0.0 1.0 0.0
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320 0.0 0.0 1.0
345 35.1 4 81.0 60.0 1760 16.1 81 japan honda civic 1300 0.0 1.0 0.0

I think that on Wednesday we’ll perform linear regression to predict the miles-per-gallon of a car using these new columns as well as our cols list of columns from before.