Homework 8
Contents
Homework 8¶
Author: BLANK
Collaborators: BLANK
In this homework, we will use Decision Trees and Random Forests to investigate the penguins dataset.
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
alt.data_transformers.enable(max_rows=10000)
DataTransformerRegistry.enable('default')
df = sns.load_dataset("penguins").dropna()
Part 1 - Decision boundaries for decision trees¶
Here is a visualization of data from the penguins dataset. We use only the two columns specified in cols
as our predictors.
cols = ["bill_length_mm", "bill_depth_mm"]
alt.Chart(df).mark_circle(size=100, opacity=1).encode(
x=alt.X(cols[0], scale=alt.Scale(zero=False)),
y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
color="species"
).configure_axis(
grid=False
).properties(
title="True data"
)
Question 1A¶
Write a function
make_clf
which takes as input a numbern
and as output returns aDecisionTreeClassifier
withmax_depth=7
, withmax_leaf_nodes=n
, and which isfit
using the data incols
as predictors and the data in the “species” column as target.
Question 1B¶
As in the Tuesday, Week 9, Discussion Section, use
tree.plot_tree
to make a diagram corresponding tomake_clf(2)
.Do the same thing with
make_clf(5)
.Regarding the
make_clf(5)
classifier, if you try to predict the species of a penguin with a bill length of 44mm and a bill depth of 20mm, according to the decision tree diagram, what will the model predict is the probability of that penguin being Adelie, being Chinstrap, and being Gentoo?Again regarding the
make_clf(5)
classifier, check your answer to the previous part usingpredict_proba
.
Question 1C¶
Write a function
draw_dec_bdry
which takes as input a number n, and as output returns an Altair chart using 10000 sample points in the region with “bill_length_mm” between 32 and 60 with step size 0.1, and with “bill_depth_mm” between 13 and 22 with step size 0.1 (use NumPy’sarange
).For this Altair chart, copy the “True data”
alt.Chart
code from above, except make the following changes: (i) Use the simulated data points and the predicted species; (ii) Get rid of thesize=100
part; (iii) Change the title to “Decision boundary with n leaf nodes”, where n gets replaced by its numerical value (use f-strings), so for example, when you calldraw_dec_bdry(5)
, the title will be “Decision boundary with 5 leaf nodes”.
Follow the approach from the Week 9 videos, especially this video. The notebook is posted on Deepnote and in the course notes.
Your function should call the make_clf
function from above (definitely do not copy any of the make_clf
code).
Question 1D¶
Show an example of a
draw_dec_bdry
chart which seems to be underfitting the true data. (You can see the true data at the top of this notebook.)Show an example of a
draw_dec_bdry
chart which seems to be overfitting the true data.
Part 2: Train Error and Test Error curves for decision trees¶
In this part of the homework, we divide the penguins dataset into a training set and a test set, and then measure the training error and test error (using log loss, also called cross entropy). This is similar to what we did above, except that we use all the numerical columns as predictors, not just bill length and bill depth.
Question 2A¶
Divide the penguins data into
X_train, X_test, y_train, y_test
, where theX
data comes fromdf[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']]
, where we usetrain_size=0.8
, and where we userandom_state=0
.
Question 2B¶
Adapt the method from the later Week 8 Videos to display training error and the test error for decision tree classifiers as follows.
Create empty dictionaries,
train_error_dict
for training error andtest_error_dict
for test error.Use a for loop with
n
from 2 (inclusive) to 21 (exclusive).Instantiate a
DecisionTreeClassifier
using amax_depth
of 10 and amax_leaf_nodes
of n.Fit the classifier on the training data.
Evaluate the log loss on the training data, and include it in the training error dictionary, using
n
as the key.Do the same thing for the test data, including it in the test error dictionary.
Question 2C¶
Draw the test error curve and the training error curve, as in the last Week 8 Video. (Make changes to the code where necessary. For example, do not use a domain of (0,2000) for the y-axis.)
Based on the image, for which value(s) do you think we are underfitting the data?
Based on the image, for which value(s) do you think the decision tree classifier will be the most accurate (on unseen data)?
Part 3: Random Forest¶
This part of the homework is shorter. We compare the performance of a random forest to the performance of individual decision trees from above.
Question 3A¶
Instantiate a
RandomForestClassifier
, usingn_estimators=1000
, and specifying a maximum depth of 7 and a maximum number of leaf nodes of 20.Fit this classifier on the same
X_train
,y_train
as were created above.
Question 3B¶
Evaluate the log loss of this classifier on the test data.
How does this log loss compare to
min(test_error_dict.values())
?What does the expression “wisdom of crowds” mean, and how is it related to random forests?
Submission¶
Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.