House prices in King County Washington

We’ll start the class with PowerPoint slides covering a definition of what it means for an algorithm to “learn from data”. (The slides are available here.)

Using a restricted decision tree regressor

import pandas as pd
import numpy as np
import altair as alt
df = pd.read_csv("../data/kc_house_data.csv")
df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

We need our features for decision trees to be numerical. All of the columns in this datasest are already numerical except for the “date” column. We will just ignore that one (and the “id” column).

df.dtypes
id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Here is an easy way to get all the columns, starting with the column at index 3.

cols = df.columns[3:]

To see if we can “learn from data” in this case, we will measure our performance on a test set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df[cols],
    df['price'],
    random_state=0,
    train_size=0.8
)

This is a large dataset; our training set has 17290 instances.

X_train.shape
(17290, 18)

We will model the price of a house using a decision tree.

from sklearn.tree import DecisionTreeRegressor

Creating the DecisionTreeRegressor object is similar to creating a DecisionTreeClassifier object, but here we specify to use Mean Absolute Error as our metric; that is what the criterion="absolute_error" keyword argument is doing. You can read about the different options, as well as the default being “squared_error”, in the scikit-learn documentation.

One motivation for using Mean Absolute Error is that if we are off by (for example) 2 million dollars on a house price prediction, we don’t want that error getting squared, which would (in my estimation) over-emphasize these expensive houses.

reg = DecisionTreeRegressor(max_depth=15, max_leaf_nodes=20, criterion="absolute_error")

Mean absolute error is closely connected to median, so that is what we will use for our baseline prediction. (Exercise: show that if you have a list of numbers, the estimate which minimizes Mean Absolute Error is the median, and the estimate which minimizes the Mean Squared Error is the mean.)

df["price"].median()
450000.0

Let’s start out measuring the performance of our baseline model, which always predicts the median value. There are many ways to make a list (or array, etc) of the same number. We will use a method that is reminiscent of the ones function from Matlab.

df.price.median()*np.ones(4)
array([450000., 450000., 450000., 450000.])
from sklearn.metrics import mean_absolute_error

Here is the performance of our baseline algorithm.

# baseline
mean_absolute_error(df.price.median()*np.ones(len(y_test)), y_test)
215282.15660421003

If we get a smaller (better) mean absolute error when using our decision tree regressor, then we can say that our algorithm “learned from data”.

reg.fit(X_train, y_train)
DecisionTreeRegressor(criterion='absolute_error', max_depth=15,
                      max_leaf_nodes=20)

Because the following value is lower than the baseline (significantly lower), we have “learned from data” in this case.

# decision tree
mean_absolute_error(reg.predict(X_test), y_test)
109940.3937080731

As with the decision tree classifiers, we can visualize the decision tree. (I wouldn’t try making these visualizations with a much larger tree.) One difference is that in this case, the Mean Absolute Error is reported. (If we had used a different metric, that other metric would be reported.) Another difference is that the outputs are numbers, not classes (and the shading colors represent the size of those numbers). The same number is output on each leaf (the median of the prices for the corresponding region).

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(200,100))
_ = plot_tree(
    reg,
    feature_names=reg.feature_names_in_,
    filled=True
)
../_images/Week10-Wednesday_31_0.png

An unrestricted decision tree

Here we briefly try out a decision tree with no restrictions (it can have as large of depth and as many leaf nodes as it wants). We should expect there will be overfitting in this case, although that’s not our focus here.

reg2 = DecisionTreeRegressor(criterion="absolute_error")
reg2.fit(X_train, y_train)
DecisionTreeRegressor(criterion='absolute_error')

What depth does this unrestricted tree have, and how many leaves?

reg2.get_depth()
44

You should interpret the next result as saying that this decision tree is subdividing the feature space into 16440 regions.

reg2.get_n_leaves()
16438

If we just evaluate feature_importances_, it is difficult to interpret; which numbers belong to which columns?

reg2.feature_importances_
array([0.00934508, 0.01640741, 0.17232836, 0.03754872, 0.00519084,
       0.01303593, 0.01244028, 0.00887619, 0.19561318, 0.03874942,
       0.01306063, 0.04515848, 0.00383778, 0.02297416, 0.24546671,
       0.07791932, 0.0507127 , 0.0313348 ])
reg2.feature_names_in_
array(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
       'long', 'sqft_living15', 'sqft_lot15'], dtype=object)

Alternatively, we could evaluate cols, but the feature_names_in_ procedure is more reliable.

cols
Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

If all we care about is the correspondence between numbers and feature names, it would be more natural to put this information into a pandas Series.

s = pd.Series(reg2.feature_importances_, index=reg2.feature_names_in_)
s
bedrooms         0.009345
bathrooms        0.016407
sqft_living      0.172328
sqft_lot         0.037549
floors           0.005191
waterfront       0.013036
view             0.012440
condition        0.008876
grade            0.195613
sqft_above       0.038749
sqft_basement    0.013061
yr_built         0.045158
yr_renovated     0.003838
zipcode          0.022974
lat              0.245467
long             0.077919
sqft_living15    0.050713
sqft_lot15       0.031335
dtype: float64

But eventually we want to plot this information using Altair, so putting this data into two distinct columns in a pandas DataFrame is more useful.

s.to_frame().reset_index()
index 0
0 bedrooms 0.009345
1 bathrooms 0.016407
2 sqft_living 0.172328
3 sqft_lot 0.037549
4 floors 0.005191
5 waterfront 0.013036
6 view 0.012440
7 condition 0.008876
8 grade 0.195613
9 sqft_above 0.038749
10 sqft_basement 0.013061
11 yr_built 0.045158
12 yr_renovated 0.003838
13 zipcode 0.022974
14 lat 0.245467
15 long 0.077919
16 sqft_living15 0.050713
17 sqft_lot15 0.031335

Another way is to make the DataFrame directly (without first making the pandas Series).

df_feat = pd.DataFrame({"importance": reg2.feature_importances_, "feature": reg2.feature_names_in_})
df_feat
importance feature
0 0.009345 bedrooms
1 0.016407 bathrooms
2 0.172328 sqft_living
3 0.037549 sqft_lot
4 0.005191 floors
5 0.013036 waterfront
6 0.012440 view
7 0.008876 condition
8 0.195613 grade
9 0.038749 sqft_above
10 0.013061 sqft_basement
11 0.045158 yr_built
12 0.003838 yr_renovated
13 0.022974 zipcode
14 0.245467 lat
15 0.077919 long
16 0.050713 sqft_living15
17 0.031335 sqft_lot15

Using a bar chart to visualize this information is very natural. We will use horizontal bars just for something different.

alt.Chart(df_feat).mark_bar().encode(
    x="importance",
    y="feature"
)

The reason I wanted to make this chart using an unrestricted tree is so that more (in fact all) of the columns would be non-zero. I didn’t say this during lecture, but because there is overfitting (almost surely), these feature importances might not be as meaningful as if we had a somewhat less flexible decision tree.