Worksheet 14#

Authors (3 maximum; use your full names): BLANK

In this worksheet, we will investigate the King County dataset which contains sale prices for houses sold in King County in Washington State. (Here are the column definitions for this dataset.)

This house price data contains houses which are much more expensive than typical. These outliers can have a big impact on our Machine Learning models.

Importing the data#

Import the attached file kc_house_data.csv as df.
Define the sub-DataFrame dfX as consisting of every row in df and consisting of the columns from “bedrooms” to the end. Use dfX = df.loc[???, ???] with two slices.
Define sery as the “price” column from df. (We call it sery instead of dfy because this is a pandas Series.)

This data is not normally distributed#

(I don’t know much about statistics, so please let me know if anything here is wrong!)

Many quantities approximately follow a normal distribution, but house prices definitely do not. Let’s briefly see that in the context of this dataset.

How many total houses are there in this dataset?
What is the mean price value?
What is the standard deviation of the price values?
What is the maximum house price in this dataset?
How many houses in this data set have a price over three million dollars? (First make a Boolean Series containing True if the house price is greater than three million and False otherwise, and then call the sum method on this Boolean Series.)
In a normal distribution with this mean and this standard deviation, what percentage of houses what percentage of houses would have value under three million? First import the norm object from scipy.stats, and then use the following code. (The :.15% tells Python to display the number as a percentage, not as a probability, and to include 15 decimal places.)

f"{norm.cdf(3000000, loc=???, scale=???):.15%}"

Price predictions using a Decision Tree with Mean Absolute Error#

Create an instance of either DecisionTreeRegressor or DecisionTreeClassifier from sklearn.tree. (Which makes sense for this supervised learning task?) Name the object reg or clf, as appropriate. Specify the following keyword arguments when you instantiate this object.

Set criterion="???". Look up the options, and choose the option corresponding to Mean Absolute Error.
Set max_depth=3.
Set max_leaf_nodes=5.

Fit the object using dfX as the input features and using sery as the target. (This took about two minutes when I tried. It’s not obvious why, but the fitting will be much faster below when we use Mean Squared Error instead.)

Make a diagram illustrating the fit tree, using the following code.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(10,10))
_ = plot_tree(
    reg,
    feature_names=reg.feature_names_in_,
    filled=True
)

Refer to the above tree diagram when answering the following questions in a markdown cell (not in a Python comment).

Which features (columns) were used by this tree?
Do you see how max_depth=3 is reflected in this diagram? What about max_leaf_nodes=5?

Another tree diagram, using Mean Squared Error#

Make a new instance, this time using Mean Squared Error as the criterion, but keeping the other parameters the same.
Again fit the object. (The computation should be much faster than in the example using Mean Absolute Error. I don’t know why that is.)
Again make a tree diagram.

Answer the following in a markdown cell (not as a Python comment).

Notice how one of the leaf nodes contains very few samples. Was the same true in the previous diagram?
Which tends to be more influenced by outliers, Mean Squared Error or Mean Absolute Error? How does that relate to the previous question?

Dividing the data#

We want to illustrate how decision trees are prone to overfitting. This will be more obvious if we use less than the full 20,000 rows in the dataset, and if we don’t use the most relevant columns. (For example, instead of using the size of the house, we will use the size of the basement.)

Call the sample method of df to select 2000 random rows. Specify random_state=34. (This value was chosen to make sure some of the most expensive houses are included in the sample.) Name this 2000 row DataFrame df2.
Define sery2 to be the “price” column from df2, and define dfX2 to be the following columns from df2: “sqft_basement”, “zipcode”, “yr_built”, “condition”, “view”, “waterfront”, “condition”. (We are choosing these columns because they are somewhat random and thus do a good job of illustrating overfitting.)
Divide these into a training set and a test set using the following:

X_train, X_test, y_train, y_test = train_test_split(
    dfX2,
    sery2,
    train_size=0.8,
    random_state=4
)

(We are setting random_state=4 because that value will make a nice U-shaped test error curve below. Other values I tried also make a U-shaped test error curve, but 4 is the value I found where the U-shape is most clear.)

Training errors and Test errors#

Define two empty dictionaries, train_dict and test_dict. (To define an empty dictionary, you can use {}.)

For each integer value of n from 2 to 300, do the following. (In other words, each step will be repeated 299 times. Use a for loop, and put all of the following inside the body of the for loop. The entire for loop should only take about 10 seconds to run. This is quite a contrast to the situation above, when we were using Mean Absolute Error.)

Instantiate a new DecisionTreeRegressor using Mean Squared Error for the criterion, using max_depth=50, and using max_leaf_nodes as n.
Fit the regressor using X_train and y_train.
Using mean_squared_error from sklearn.metrics, compute the Mean Squared Error between y_train and the predicted values. Put this error as a value into the train_dict dictionary with the key n.
Using mean_squared_error from sklearn.metrics, compute the Mean Squared Error between y_test and the predicted values. (Be sure you do not fit the regressor again! We should never fit a model using the test set.) Put this error as a value into the test_dict dictionary with the key n.

U-shaped test error curve#

Use the following code to show the corresponding training loss and test loss. We want the training loss curve to be colored differently from the test loss curve. The result should reflect the U-shaped test error curve that often occurs in situations of overfitting.

train_ser = pd.Series(train_dict)
test_ser = pd.Series(test_dict)
train_ser.name = "train"
test_ser.name = "test"
df_loss = pd.concat((train_ser, test_ser), axis=1)
df_loss.reset_index(inplace=True)
df_loss.rename({"index": "max_leaf_nodes"}, axis=1, inplace=True)
df_melted = df_loss.melt(id_vars="max_leaf_nodes", var_name="Type", value_name="Loss")

followed by

alt.Chart(df_melted).mark_line().encode(
    x=???, # bigger values = more flexible
    y=???, # bigger values = worse performance
    color=??? # The train curve in one color, the test curve in another.
)

Hint. To help you make the chart, try evaluating df_melted to see what column names it has.

Interpretation questions#

Answer the following in a markdown cell (not as a Python comment).

Where does the model seem to be underfitting the data?
Where does the model seem to be overfitting the data?
If you had to use one of these 299 models on new data, which would you choose? Why?
What direction in the chart corresponds to models with “more flexibility”?

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same link).

Be sure you’ve included the (full) names of you and your group members at the top after “Authors”. You might consider restarting this notebook (the Deepnote refresh icon, not your browser refresh icon) and running it again to make sure it’s working.

Submission#

Using the Share button at the top right, enable Comment access level for anyone with the link. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

Worksheet 14

Contents