  • Visualizing and interpreting decision trees

Part 1: Data cleaning

The dataset included in this project contains grades of Math 2B students. Check to see if there are any missing values. Warning: At first there may not appear to be any missing values.

import pandas as pd

df = pd.read_csv("../data/Math2B_grades.csv")
Quiz 1 Quiz 2 Midterm 1 Quiz 3 Quiz 4 Midterm 2 Quiz 5 Final exam Webwork Total
0 7 3 29 5 7 22 6 26 29.4 F
1 7 10 43 10 10 41 9 68 73.0 B
2 8 7 32 9 9 42 8 63 67.0 C
3 10 10 47 10 10 47 9 90 75.0 A
4 8 8 42 10 9 46 7 73 74.0 B

Check the dtypes of df. Does anything appear off?

Quiz 1          int64
Quiz 2         object
Midterm 1      object
Quiz 3          int64
Quiz 4         object
Midterm 2       int64
Quiz 5          int64
Final exam      int64
Webwork       float64
Total          object
dtype: object

Try running the following code; this is how we can begin to see that there are certain “bad” values in our data.

pd.to_numeric(df["Midterm 1])

pd.to_numeric(df["Midterm 1"])
Discuss: Why haven’t we had to use the na_values keyword argument recently?

Using na_values and dropna(), remove the bad values from our data. Check the dtypes again and think about why it’s not a problem that “Total” is not numeric.

df = pd.read_csv("../data/Math2B_grades.csv",na_values = "ex").dropna()
Quiz 1          int64
Quiz 2        float64
Midterm 1     float64
Quiz 3          int64
Quiz 4        float64
Midterm 2       int64
Quiz 5          int64
Final exam      int64
Webwork       float64
Total          object
dtype: object

Part 2: Decision tree fitting

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

Our target in this case will be “Total”. Create a list called cols that contains all the column names of df except “Total”. Fit the model with df[cols] as the input and df["Total"] as the target.

cols = [c for c in df.columns if c != "Total"]

Check the accuracy of the model using clf.score(df[cols], df["Total"]). Why should we be a concerned about this result?


Here are two ways that we can restrict the tree:

  • Maximum depth

  • Maximum number of leaf nodes

Instantiate a new classifier clf2 that sets max_depth = 5 and max_leaf_nodes = 10, then fit it to our input and target. Finally, check the accuracy.

clf2 = DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10)
clf2.score(df[cols], df["Total"])

Part 3: Visualizing and interpeting a decision tree

from sklearn import tree
import matplotlib.pyplot as plt

Try running the code below. It corresponds to the overfit tree.

fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf, 

Now try running the code for the restricted tree.

fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf2, 
#This is the situation of a student with a midterm of 40 and a final of 80
#We see a probability of about 77%
student = pd.Series(0,index=cols)
student[["Midterm 1","Final exam"]] = [40,80]
Quiz 1         0
Quiz 2         0
Midterm 1     40
Quiz 3         0
Quiz 4         0
Midterm 2      0
Quiz 5         0
Final exam    80
Webwork        0
dtype: int64
array([[0.16129032, 0.77419355, 0.06451613, 0.        , 0.        ]])
array(['B'], dtype=object)

Don’t worry much about what “gini” means, but everything else should make sense. The top goal for the class should be able to read this kind of tree.

Some questions:

  • This diagram shows quickly what features are the most important. Which two would you say are the most important? Why do you think Quiz 1 shows up only on the leftmost side, Webwork shows up in the middle, but neither on the right side?

  • If a student scored 80 on the Final and 40 on the Midterm, what will the classifier clf2 predict? What would be its predicted probabilities? (Can you replicate this using clf2.predict_proba?)

  • Say a tree has depth 3. How many classes at maximum can it predict?

  • How would you identify a “lucky” or an “unlucky” student from this picture? (A student who got a higher or lower grade than most of their classmates with similar scores.)

  • How many of the features got used by the classifier?

  • What if we train a new classifier without the Final Exam among the features. How do things change? Does it get less accurate?

  • How do things change if we use DecisionTreeRegressor instead (say for predicting final exam scores)?