Homework 8

Author: BLANK

Collaborators: BLANK

In this homework, we will use Decision Trees and Random Forests to investigate the penguins dataset.

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

alt.data_transformers.enable(max_rows=10000)
DataTransformerRegistry.enable('default')
df = sns.load_dataset("penguins").dropna()

Part 1 - Decision boundaries for decision trees

Here is a visualization of data from the penguins dataset. We use only the two columns specified in cols as our predictors.

cols = ["bill_length_mm", "bill_depth_mm"]
alt.Chart(df).mark_circle(size=100, opacity=1).encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species"
).configure_axis(
    grid=False
).properties(
    title="True data"
)

Question 1A

  • Write a function make_clf which takes as input a number n and as output returns a DecisionTreeClassifier with max_depth=7, with max_leaf_nodes=n, and which is fit using the data in cols as predictors and the data in the “species” column as target.

Question 1B

  • As in the Tuesday, Week 9, Discussion Section, use tree.plot_tree to make a diagram corresponding to make_clf(2).

  • Do the same thing with make_clf(5).

  • Regarding the make_clf(5) classifier, if you try to predict the species of a penguin with a bill length of 44mm and a bill depth of 20mm, according to the decision tree diagram, what will the model predict is the probability of that penguin being Adelie, being Chinstrap, and being Gentoo?

  • Again regarding the make_clf(5) classifier, check your answer to the previous part using predict_proba.

Question 1C

  • Write a function draw_dec_bdry which takes as input a number n, and as output returns an Altair chart using 10000 sample points in the region with “bill_length_mm” between 32 and 60 with step size 0.1, and with “bill_depth_mm” between 13 and 22 with step size 0.1 (use NumPy’s arange).

  • For this Altair chart, copy the “True data” alt.Chart code from above, except make the following changes: (i) Use the simulated data points and the predicted species; (ii) Get rid of the size=100 part; (iii) Change the title to “Decision boundary with n leaf nodes”, where n gets replaced by its numerical value (use f-strings), so for example, when you call draw_dec_bdry(5), the title will be “Decision boundary with 5 leaf nodes”.

Follow the approach from the Week 9 videos, especially this video. The notebook is posted on Deepnote and in the course notes.

Your function should call the make_clf function from above (definitely do not copy any of the make_clf code).

Question 1D

  • Show an example of a draw_dec_bdry chart which seems to be underfitting the true data. (You can see the true data at the top of this notebook.)

  • Show an example of a draw_dec_bdry chart which seems to be overfitting the true data.

Part 2: Train Error and Test Error curves for decision trees

In this part of the homework, we divide the penguins dataset into a training set and a test set, and then measure the training error and test error (using log loss, also called cross entropy). This is similar to what we did above, except that we use all the numerical columns as predictors, not just bill length and bill depth.

Question 2A

  • Divide the penguins data into X_train, X_test, y_train, y_test, where the X data comes from df[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']], where we use train_size=0.8, and where we use random_state=0.

Question 2B

Adapt the method from the later Week 8 Videos to display training error and the test error for decision tree classifiers as follows.

  • Create empty dictionaries, train_error_dict for training error and test_error_dict for test error.

  • Use a for loop with n from 2 (inclusive) to 21 (exclusive).

  • Instantiate a DecisionTreeClassifier using a max_depth of 10 and a max_leaf_nodes of n.

  • Fit the classifier on the training data.

  • Evaluate the log loss on the training data, and include it in the training error dictionary, using n as the key.

  • Do the same thing for the test data, including it in the test error dictionary.

Question 2C

  • Draw the test error curve and the training error curve, as in the last Week 8 Video. (Make changes to the code where necessary. For example, do not use a domain of (0,2000) for the y-axis.)

  • Based on the image, for which value(s) do you think we are underfitting the data?

  • Based on the image, for which value(s) do you think the decision tree classifier will be the most accurate (on unseen data)?

Part 3: Random Forest

This part of the homework is shorter. We compare the performance of a random forest to the performance of individual decision trees from above.

Question 3A

  • Instantiate a RandomForestClassifier, using n_estimators=1000, and specifying a maximum depth of 7 and a maximum number of leaf nodes of 20.

  • Fit this classifier on the same X_train, y_train as were created above.

Question 3B

  • Evaluate the log loss of this classifier on the test data.

  • How does this log loss compare to min(test_error_dict.values())?

  • What does the expression “wisdom of crowds” mean, and how is it related to random forests?

Submission

Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.