Week 9, Tuesday Discussion¶

Reminders:

Homework #7 due tonight at 11:59pm
Course evaluations are now open :)

Today:

Pass back old quizzes
Visualizing and interpreting decision trees

Part 1: Data cleaning¶

The dataset included in this project contains grades of Math 2B students. Check to see if there are any missing values. Warning: At first there may not appear to be any missing values.

import pandas as pd

df = pd.read_csv("../data/Math2B_grades.csv")
df.head()

	Quiz 1	Quiz 2	Midterm 1	Quiz 3	Quiz 4	Midterm 2	Quiz 5	Final exam	Webwork	Total
0	7	3	29	5	7	22	6	26	29.4	F
1	7	10	43	10	10	41	9	68	73.0	B
2	8	7	32	9	9	42	8	63	67.0	C
3	10	10	47	10	10	47	9	90	75.0	A
4	8	8	42	10	9	46	7	73	74.0	B

df.isna().any().any()

False

Check the dtypes of df. Does anything appear off?

df.dtypes

Quiz 1          int64
Quiz 2         object
Midterm 1      object
Quiz 3          int64
Quiz 4         object
Midterm 2       int64
Quiz 5          int64
Final exam      int64
Webwork       float64
Total          object
dtype: object

Try running the following code; this is how we can begin to see that there are certain “bad” values in our data.

pd.to_numeric(df["Midterm 1])

pd.to_numeric(df["Midterm 1"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "ex"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_34199/418289826.py in <module>
----> 1 pd.to_numeric(df["Midterm 1"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
    182         try:
    183             values, _ = lib.maybe_convert_numeric(
--> 184                 values, set(), coerce_numeric=coerce_numeric
    185             )
    186         except (ValueError, TypeError):

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "ex" at position 167

Discuss: Why haven’t we had to use the na_values keyword argument recently?

Using na_values and dropna(), remove the bad values from our data. Check the dtypes again and think about why it’s not a problem that “Total” is not numeric.

df = pd.read_csv("../data/Math2B_grades.csv",na_values = "ex").dropna()

df.dtypes

Quiz 1          int64
Quiz 2        float64
Midterm 1     float64
Quiz 3          int64
Quiz 4        float64
Midterm 2       int64
Quiz 5          int64
Final exam      int64
Webwork       float64
Total          object
dtype: object

Part 2: Decision tree fitting¶

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

Our target in this case will be “Total”. Create a list called cols that contains all the column names of df except “Total”. Fit the model with df[cols] as the input and df["Total"] as the target.

cols = [c for c in df.columns if c != "Total"]

clf.fit(df[cols],df["Total"])

DecisionTreeClassifier()

Check the accuracy of the model using clf.score(df[cols], df["Total"]). Why should we be a concerned about this result?

clf.score(df[cols],df["Total"])

1.0

Here are two ways that we can restrict the tree:

Maximum depth
Maximum number of leaf nodes

Instantiate a new classifier clf2 that sets max_depth = 5 and max_leaf_nodes = 10, then fit it to our input and target. Finally, check the accuracy.

clf2 = DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10)

clf2.fit(df[cols],df["Total"])

DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10)

clf2.score(df[cols], df["Total"])

0.8235294117647058

Part 3: Visualizing and interpeting a decision tree¶

from sklearn import tree
import matplotlib.pyplot as plt

Try running the code below. It corresponds to the overfit tree.

fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf, 
                   feature_names=clf.feature_names_in_,  
                   class_names=clf.classes_,
                   filled=True)

Now try running the code for the restricted tree.

fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf2, 
                   feature_names=clf.feature_names_in_,  
                   class_names=clf.classes_,
                   filled=True)

#This is the situation of a student with a midterm of 40 and a final of 80
#We see a probability of about 77%
48/62

0.7741935483870968

student = pd.Series(0,index=cols)
student[["Midterm 1","Final exam"]] = [40,80]
student

Quiz 1         0
Quiz 2         0
Midterm 1     40
Quiz 3         0
Quiz 4         0
Midterm 2      0
Quiz 5         0
Final exam    80
Webwork        0
dtype: int64

clf2.predict_proba([student.to_numpy()])

/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  "X does not have valid feature names, but"

array([[0.16129032, 0.77419355, 0.06451613, 0.        , 0.        ]])

clf2.predict([student.to_numpy()])

/Users/christopherdavis/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  "X does not have valid feature names, but"

array(['B'], dtype=object)

Don’t worry much about what “gini” means, but everything else should make sense. The top goal for the class should be able to read this kind of tree.

Some questions:

This diagram shows quickly what features are the most important. Which two would you say are the most important? Why do you think Quiz 1 shows up only on the leftmost side, Webwork shows up in the middle, but neither on the right side?
If a student scored 80 on the Final and 40 on the Midterm, what will the classifier clf2 predict? What would be its predicted probabilities? (Can you replicate this using clf2.predict_proba?)
Say a tree has depth 3. How many classes at maximum can it predict?
How would you identify a “lucky” or an “unlucky” student from this picture? (A student who got a higher or lower grade than most of their classmates with similar scores.)
How many of the features got used by the classifier?
What if we train a new classifier without the Final Exam among the features. How do things change? Does it get less accurate?
How do things change if we use DecisionTreeRegressor instead (say for predicting final exam scores)?

UC Irvine Math 10 S22

Week 9, Tuesday Discussion

Contents

Week 9, Tuesday Discussion¶

Part 1: Data cleaning¶

Part 2: Decision tree fitting¶

Part 3: Visualizing and interpeting a decision tree¶