House prices in King County Washington
Contents
House prices in King County Washington¶
We’ll start the class with PowerPoint slides covering a definition of what it means for an algorithm to “learn from data”. (The slides are available here.)
Can we “learn from data” using decision tree regression on the King County dataset? (Here are the column definitions for this dataset.)
Using a restricted decision tree regressor¶
import pandas as pd
import numpy as np
import altair as alt
df = pd.read_csv("../data/kc_house_data.csv")
df.columns
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15'],
dtype='object')
We need our features for decision trees to be numerical. All of the columns in this datasest are already numerical except for the “date” column. We will just ignore that one (and the “id” column).
df.dtypes
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
floors float64
waterfront int64
view int64
condition int64
grade int64
sqft_above int64
sqft_basement int64
yr_built int64
yr_renovated int64
zipcode int64
lat float64
long float64
sqft_living15 int64
sqft_lot15 int64
dtype: object
Here is an easy way to get all the columns, starting with the column at index 3.
cols = df.columns[3:]
To see if we can “learn from data” in this case, we will measure our performance on a test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[cols],
df['price'],
random_state=0,
train_size=0.8
)
This is a large dataset; our training set has 17290 instances.
X_train.shape
(17290, 18)
We will model the price of a house using a decision tree.
from sklearn.tree import DecisionTreeRegressor
Creating the DecisionTreeRegressor
object is similar to creating a DecisionTreeClassifier
object, but here we specify to use Mean Absolute Error as our metric; that is what the criterion="absolute_error"
keyword argument is doing. You can read about the different options, as well as the default being “squared_error”, in the scikit-learn documentation.
One motivation for using Mean Absolute Error is that if we are off by (for example) 2 million dollars on a house price prediction, we don’t want that error getting squared, which would (in my estimation) over-emphasize these expensive houses.
reg = DecisionTreeRegressor(max_depth=15, max_leaf_nodes=20, criterion="absolute_error")
Mean absolute error is closely connected to median, so that is what we will use for our baseline prediction. (Exercise: show that if you have a list of numbers, the estimate which minimizes Mean Absolute Error is the median, and the estimate which minimizes the Mean Squared Error is the mean.)
df["price"].median()
450000.0
Let’s start out measuring the performance of our baseline model, which always predicts the median value. There are many ways to make a list (or array, etc) of the same number. We will use a method that is reminiscent of the ones
function from Matlab.
df.price.median()*np.ones(4)
array([450000., 450000., 450000., 450000.])
from sklearn.metrics import mean_absolute_error
Here is the performance of our baseline algorithm.
# baseline
mean_absolute_error(df.price.median()*np.ones(len(y_test)), y_test)
215282.15660421003
If we get a smaller (better) mean absolute error when using our decision tree regressor, then we can say that our algorithm “learned from data”.
reg.fit(X_train, y_train)
DecisionTreeRegressor(criterion='absolute_error', max_depth=15,
max_leaf_nodes=20)
Because the following value is lower than the baseline (significantly lower), we have “learned from data” in this case.
# decision tree
mean_absolute_error(reg.predict(X_test), y_test)
109940.3937080731
As with the decision tree classifiers, we can visualize the decision tree. (I wouldn’t try making these visualizations with a much larger tree.) One difference is that in this case, the Mean Absolute Error is reported. (If we had used a different metric, that other metric would be reported.) Another difference is that the outputs are numbers, not classes (and the shading colors represent the size of those numbers). The same number is output on each leaf (the median of the prices for the corresponding region).
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(200,100))
_ = plot_tree(
reg,
feature_names=reg.feature_names_in_,
filled=True
)
An unrestricted decision tree¶
Here we briefly try out a decision tree with no restrictions (it can have as large of depth and as many leaf nodes as it wants). We should expect there will be overfitting in this case, although that’s not our focus here.
reg2 = DecisionTreeRegressor(criterion="absolute_error")
reg2.fit(X_train, y_train)
DecisionTreeRegressor(criterion='absolute_error')
What depth does this unrestricted tree have, and how many leaves?
reg2.get_depth()
44
You should interpret the next result as saying that this decision tree is subdividing the feature space into 16440 regions.
reg2.get_n_leaves()
16438
If we just evaluate feature_importances_
, it is difficult to interpret; which numbers belong to which columns?
reg2.feature_importances_
array([0.00934508, 0.01640741, 0.17232836, 0.03754872, 0.00519084,
0.01303593, 0.01244028, 0.00887619, 0.19561318, 0.03874942,
0.01306063, 0.04515848, 0.00383778, 0.02297416, 0.24546671,
0.07791932, 0.0507127 , 0.0313348 ])
reg2.feature_names_in_
array(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
'waterfront', 'view', 'condition', 'grade', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
'long', 'sqft_living15', 'sqft_lot15'], dtype=object)
Alternatively, we could evaluate cols
, but the feature_names_in_
procedure is more reliable.
cols
Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
'waterfront', 'view', 'condition', 'grade', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
'sqft_living15', 'sqft_lot15'],
dtype='object')
If all we care about is the correspondence between numbers and feature names, it would be more natural to put this information into a pandas Series.
s = pd.Series(reg2.feature_importances_, index=reg2.feature_names_in_)
s
bedrooms 0.009345
bathrooms 0.016407
sqft_living 0.172328
sqft_lot 0.037549
floors 0.005191
waterfront 0.013036
view 0.012440
condition 0.008876
grade 0.195613
sqft_above 0.038749
sqft_basement 0.013061
yr_built 0.045158
yr_renovated 0.003838
zipcode 0.022974
lat 0.245467
long 0.077919
sqft_living15 0.050713
sqft_lot15 0.031335
dtype: float64
But eventually we want to plot this information using Altair, so putting this data into two distinct columns in a pandas DataFrame is more useful.
s.to_frame().reset_index()
index | 0 | |
---|---|---|
0 | bedrooms | 0.009345 |
1 | bathrooms | 0.016407 |
2 | sqft_living | 0.172328 |
3 | sqft_lot | 0.037549 |
4 | floors | 0.005191 |
5 | waterfront | 0.013036 |
6 | view | 0.012440 |
7 | condition | 0.008876 |
8 | grade | 0.195613 |
9 | sqft_above | 0.038749 |
10 | sqft_basement | 0.013061 |
11 | yr_built | 0.045158 |
12 | yr_renovated | 0.003838 |
13 | zipcode | 0.022974 |
14 | lat | 0.245467 |
15 | long | 0.077919 |
16 | sqft_living15 | 0.050713 |
17 | sqft_lot15 | 0.031335 |
Another way is to make the DataFrame directly (without first making the pandas Series).
df_feat = pd.DataFrame({"importance": reg2.feature_importances_, "feature": reg2.feature_names_in_})
df_feat
importance | feature | |
---|---|---|
0 | 0.009345 | bedrooms |
1 | 0.016407 | bathrooms |
2 | 0.172328 | sqft_living |
3 | 0.037549 | sqft_lot |
4 | 0.005191 | floors |
5 | 0.013036 | waterfront |
6 | 0.012440 | view |
7 | 0.008876 | condition |
8 | 0.195613 | grade |
9 | 0.038749 | sqft_above |
10 | 0.013061 | sqft_basement |
11 | 0.045158 | yr_built |
12 | 0.003838 | yr_renovated |
13 | 0.022974 | zipcode |
14 | 0.245467 | lat |
15 | 0.077919 | long |
16 | 0.050713 | sqft_living15 |
17 | 0.031335 | sqft_lot15 |
Using a bar chart to visualize this information is very natural. We will use horizontal bars just for something different.
alt.Chart(df_feat).mark_bar().encode(
x="importance",
y="feature"
)
The reason I wanted to make this chart using an unrestricted tree is so that more (in fact all) of the columns would be non-zero. I didn’t say this during lecture, but because there is overfitting (almost surely), these feature importances might not be as meaningful as if we had a somewhat less flexible decision tree.