House prices in King County Washington¶

We’ll start the class with PowerPoint slides covering a definition of what it means for an algorithm to “learn from data”. (The slides are available here.)

Can we “learn from data” using decision tree regression on the King County dataset? (Here are the column definitions for this dataset.)

Using a restricted decision tree regressor¶

import pandas as pd
import numpy as np
import altair as alt

df = pd.read_csv("../data/kc_house_data.csv")

df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

We need our features for decision trees to be numerical. All of the columns in this datasest are already numerical except for the “date” column. We will just ignore that one (and the “id” column).

df.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Here is an easy way to get all the columns, starting with the column at index 3.

cols = df.columns[3:]

To see if we can “learn from data” in this case, we will measure our performance on a test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[cols],
    df['price'],
    random_state=0,
    train_size=0.8
)

This is a large dataset; our training set has 17290 instances.

X_train.shape

(17290, 18)

We will model the price of a house using a decision tree.

from sklearn.tree import DecisionTreeRegressor

Creating the DecisionTreeRegressor object is similar to creating a DecisionTreeClassifier object, but here we specify to use Mean Absolute Error as our metric; that is what the criterion="absolute_error" keyword argument is doing. You can read about the different options, as well as the default being “squared_error”, in the scikit-learn documentation.

One motivation for using Mean Absolute Error is that if we are off by (for example) 2 million dollars on a house price prediction, we don’t want that error getting squared, which would (in my estimation) over-emphasize these expensive houses.

reg = DecisionTreeRegressor(max_depth=15, max_leaf_nodes=20, criterion="absolute_error")

Mean absolute error is closely connected to median, so that is what we will use for our baseline prediction. (Exercise: show that if you have a list of numbers, the estimate which minimizes Mean Absolute Error is the median, and the estimate which minimizes the Mean Squared Error is the mean.)

df["price"].median()

450000.0

Let’s start out measuring the performance of our baseline model, which always predicts the median value. There are many ways to make a list (or array, etc) of the same number. We will use a method that is reminiscent of the ones function from Matlab.

df.price.median()*np.ones(4)

array([450000., 450000., 450000., 450000.])

from sklearn.metrics import mean_absolute_error

Here is the performance of our baseline algorithm.

# baseline
mean_absolute_error(df.price.median()*np.ones(len(y_test)), y_test)

215282.15660421003

If we get a smaller (better) mean absolute error when using our decision tree regressor, then we can say that our algorithm “learned from data”.

reg.fit(X_train, y_train)

DecisionTreeRegressor(criterion='absolute_error', max_depth=15,
                      max_leaf_nodes=20)

Because the following value is lower than the baseline (significantly lower), we have “learned from data” in this case.

# decision tree
mean_absolute_error(reg.predict(X_test), y_test)

109940.3937080731

As with the decision tree classifiers, we can visualize the decision tree. (I wouldn’t try making these visualizations with a much larger tree.) One difference is that in this case, the Mean Absolute Error is reported. (If we had used a different metric, that other metric would be reported.) Another difference is that the outputs are numbers, not classes (and the shading colors represent the size of those numbers). The same number is output on each leaf (the median of the prices for the corresponding region).

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(200,100))
_ = plot_tree(
    reg,
    feature_names=reg.feature_names_in_,
    filled=True
)

An unrestricted decision tree¶

Here we briefly try out a decision tree with no restrictions (it can have as large of depth and as many leaf nodes as it wants). We should expect there will be overfitting in this case, although that’s not our focus here.

reg2 = DecisionTreeRegressor(criterion="absolute_error")

reg2.fit(X_train, y_train)

DecisionTreeRegressor(criterion='absolute_error')

What depth does this unrestricted tree have, and how many leaves?

reg2.get_depth()

You should interpret the next result as saying that this decision tree is subdividing the feature space into 16440 regions.

reg2.get_n_leaves()

If we just evaluate feature_importances_, it is difficult to interpret; which numbers belong to which columns?

reg2.feature_importances_

array([0.00934508, 0.01640741, 0.17232836, 0.03754872, 0.00519084,
       0.01303593, 0.01244028, 0.00887619, 0.19561318, 0.03874942,
       0.01306063, 0.04515848, 0.00383778, 0.02297416, 0.24546671,
       0.07791932, 0.0507127 , 0.0313348 ])

reg2.feature_names_in_

array(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
       'long', 'sqft_living15', 'sqft_lot15'], dtype=object)

Alternatively, we could evaluate cols, but the feature_names_in_ procedure is more reliable.

cols

Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

If all we care about is the correspondence between numbers and feature names, it would be more natural to put this information into a pandas Series.

s = pd.Series(reg2.feature_importances_, index=reg2.feature_names_in_)
s

bedrooms         0.009345
bathrooms        0.016407
sqft_living      0.172328
sqft_lot         0.037549
floors           0.005191
waterfront       0.013036
view             0.012440
condition        0.008876
grade            0.195613
sqft_above       0.038749
sqft_basement    0.013061
yr_built         0.045158
yr_renovated     0.003838
zipcode          0.022974
lat              0.245467
long             0.077919
sqft_living15    0.050713
sqft_lot15       0.031335
dtype: float64

But eventually we want to plot this information using Altair, so putting this data into two distinct columns in a pandas DataFrame is more useful.

s.to_frame().reset_index()

	index	0
0	bedrooms	0.009345
1	bathrooms	0.016407
2	sqft_living	0.172328
3	sqft_lot	0.037549
4	floors	0.005191
5	waterfront	0.013036
6	view	0.012440
7	condition	0.008876
8	grade	0.195613
9	sqft_above	0.038749
10	sqft_basement	0.013061
11	yr_built	0.045158
12	yr_renovated	0.003838
13	zipcode	0.022974
14	lat	0.245467
15	long	0.077919
16	sqft_living15	0.050713
17	sqft_lot15	0.031335

Another way is to make the DataFrame directly (without first making the pandas Series).

df_feat = pd.DataFrame({"importance": reg2.feature_importances_, "feature": reg2.feature_names_in_})

df_feat

	importance	feature
0	0.009345	bedrooms
1	0.016407	bathrooms
2	0.172328	sqft_living
3	0.037549	sqft_lot
4	0.005191	floors
5	0.013036	waterfront
6	0.012440	view
7	0.008876	condition
8	0.195613	grade
9	0.038749	sqft_above
10	0.013061	sqft_basement
11	0.045158	yr_built
12	0.003838	yr_renovated
13	0.022974	zipcode
14	0.245467	lat
15	0.077919	long
16	0.050713	sqft_living15
17	0.031335	sqft_lot15

Using a bar chart to visualize this information is very natural. We will use horizontal bars just for something different.

alt.Chart(df_feat).mark_bar().encode(
    x="importance",
    y="feature"
)

The reason I wanted to make this chart using an unrestricted tree is so that more (in fact all) of the columns would be non-zero. I didn’t say this during lecture, but because there is overfitting (almost surely), these feature importances might not be as meaningful as if we had a somewhat less flexible decision tree.

UC Irvine Math 10 S22

House prices in King County Washington

Contents

House prices in King County Washington¶

Using a restricted decision tree regressor¶

An unrestricted decision tree¶