Week 9 Monday

I want to go over two short topics (where to find datasets, log loss) before we get to the main topic of the week (decision trees).

Announcements

  • Midterms probably returned Thursday

  • Homework 7 due Tuesday night. Video quizzes due Thursday before discussion.

  • I might be a little late to my 11am-12:30pm Tuesday office hours (I’ll send an email if I know).

Places to find datasets

Reminder: for the course project, you need to find a new dataset (one we haven’t studied in Math 10).

  • Many of our datasets have come from Seaborn. You can see other options by evaluating sns.get_dataset_names().

  • The library vega_datasets has even more datasets, but you need to install it first: !pip install vega_datasets. Then you can use from vega_datasets import data followed by data.list_datasets() to see the options. Once you choose an option, like “iris”, you can load in that data using df = data.iris().

  • Our Spotify dataset and the stock index dataset both originally came from Kaggle (free account required). You can search for datasets in the Datasets section of Kaggle. (Warning. These datasets will generally be less clean than what we usually use. And some of these datasets might be too big to upload to Deepnote; 100mb is the maximum upload file size.) Once you find a dataset you like on Kaggle, there are often accompanying Code notebooks you can browse for ideas; please list any you use in the References section of the project.

  • UCI actually has probably the most famous collection of datasets in the world, the UCI Machine Learning Repository.

  • If you have an Excel file (with an extension .xlsx or .xls) instead of a csv file, you can try using pd.read_excel instead of pd.read_csv. That usually doesn’t work for me, but if I first try !pip install openpyxl and then try to use pd.read_excel, it usually works. It might be easier to just open the file in Excel, save it as a comma-separated csv file there, and then upload that csv file.

  • If it’s time-consuming to get the file loaded (because of for example a !pip install command), consider saving the pandas DataFrame as a csv file, for example using df.to_csv, and then you can have faster access to that data.

If you have data cleaning cells that you want to show in your final project, but not to execute, you can put them inside triple backticks in a markdown cell, like this.

df.to_csv("cleaned_data.csv", index=False)
more code

A loss function for classification

So far we have only evaluated classification performance using accuracy. This is not a loss function (because for a loss function, smaller values are better, but for the accuracy score, higher values are better). More importantly, accuracy is too coarse of a measurement. For example, if our model thinks the probabilities are [1,0,0] or [0.34, 0.33, 0.33], those predictions can’t be distinguished using accuracy. One commonly used loss function for classification is log loss, which is also called cross entropy.

Here is the definition of log loss. Assume we have n input-output pairs, \((X_1, y_1), \ldots, (X_n, y_n)\). Assume our model predicts \(\pi_{y_i}(X_i)\) is the probability that \(y_i\) is the output corresponding to the input \(X_i\). The corresponding log loss is:

\[ \text{log loss} = \frac{1}{n} \sum_{i=1}^n -\log(\pi_{y_i}(X_i)) \]

It takes some time to get comfortable with this formula, but it has some nice properties.

  • If we predict each value perfectly, so the probabilities are 1, then the log loss is 0. (Just like with mean squared error or mean absolute error, if the predictions are perfect, then the error is 0.)

  • If we make the worst possible prediction, that the probability is 0 for the true output, then the loss is undefined. You can think of it as being infinite. (In the scikit-learn implementation, it will be some relatively big positive number.)

Here is some example data. Say we have three possible outputs, the three classes of penguin, Adelie, Chinstrap, and Gentoo. Say we have three data points:

  • (X_1, Adelie), with predicted probabilities [prob of Adelie 0.8, prob of Chinstrap 0.1, prob of Gentoo 0.1]

  • (X_2, Gentoo), with predicted probabilities [0.1, 0.5, 0.4]

  • (X_3, Adelie), with predicted probabilities [0.6, 0.3, 0.1]

Let’s compute the log loss in this case.

pred_probs = [[0.8, 0.1, 0.1], [0.1, 0.5, 0.4], [0.6, 0.3, 0.1]]
n = 3
import numpy as np
(1/n)*(-np.log(0.8)+-np.log(0.4)+-np.log(0.6))
0.5500866356514518
# Alternative
from numpy import log 
log(2.78)
1.0224509277025455
# Alternative
from numpy import log as christopher
christopher(10)
2.302585092994046

In practice, we will not usually compute log loss by hand. (The above example was just to get a sense for what the formula means.) Instead, we will usually compute log loss using a function from scikit-learn.

y_true = ["Adelie", "Gentoo", "Adelie"]
pred_probs = [[0.8, 0.1, 0.1], [0.1, 0.5, 0.4], [0.6, 0.3, 0.1]]
from sklearn.metrics import log_loss

The following code will often work, but it doesn’t in this case. How would the log_loss function be able to interpret something like [0.8, 0.1, 0.1]. These are probabilities, but what are the classes? In y_true, only two class names show up.

The error message is helpful in this case.

log_loss(y_true, pred_probs)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_25626/1482533660.py in <module>
----> 1 log_loss(y_true, pred_probs)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/metrics/_classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   2428                 "Classes found in "
   2429                 "y_true: {2}".format(
-> 2430                     transformed_labels.shape[1], y_pred.shape[1], lb.classes_
   2431                 )
   2432             )

ValueError: y_true and y_pred contain different number of classes 2, 3. Please provide the true labels explicitly through the labels argument. Classes found in y_true: ['Adelie' 'Gentoo']
log_loss(y_true, pred_probs, labels=["Adelie", "Chinstrap", "Gentoo"])
0.5500866356514518

Something I find confusing is that the order of the provided labels keyword argument does not matter. In fact, scikit-learn will always alphabetize them, and the probabilities provided need to be in terms of the alphabetized labels. Luckily, if we are getting the predicted probabilities from something like clf.predict_proba, then the probabilities will automatically be listed in the correct (alphabetized) order.

log_loss(y_true, pred_probs, labels=["Adelie", "Gentoo", "Chinstrap"])
0.5500866356514518

Decision trees

There are decision trees for both classification and regression. The basic idea in both cases is the same. Divide the input space into different regions, and then make the same prediction in each region.

On the board, we showed an example of deciding whether to take a class based on whether it has a good professor, whether it’s required, and what time it is. If I were doing this example again, maybe I would include a variable for “number of quarters before graduation”, which would influence the “required” property.

We also talked about depth and number of leaf nodes in this context.

For the example in this notebook, we will use a new dataset, about Titanic passengers, taken from a famous Kaggle competition. Most Kaggle competitions include a training set (with labels) and a test set (without labels). Our csv file is the Titanic training set.

import pandas as pd
df = pd.read_csv("../data/titanic_train.csv")

It would be a mistake to already run dropna(), because notice how the “Cabin” column has many missing values. It would be a waste to delete over half the dataset just because of this column that we aren’t going to use anyway (since it contains strings).

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df.columns if is_numeric_dtype(c)]
num_cols
[]

The mistake above is that c is something like “PassengerID”, and is_numeric_dtype("PassengerID") will always return False. Instead we want to evaluate is_numeric_dtype(df["PassengerID"]).

num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols
['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
df2 = df[num_cols].dropna()
df2.shape
(714, 7)

Here is the list of columns we will use for prediction.

cols = [c for c in num_cols if c != 'Survived']
cols
['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
from sklearn.model_selection import train_test_split

In class I didn’t use the random_state keyword argument, but I want to here, so that I can write about the results without worrying the numbers will change. The outcomes are usually similar.

X_train, X_test, y_train, y_test = train_test_split(df2[cols], df2["Survived"], test_size=0.2, random_state=0)
clf.fit(X_train, y_train)
DecisionTreeClassifier()
clf.score(X_train, y_train)
1.0
len(X_train)
571

All 571 values were predicted correctly! Is that a good sign? No, it’s very strong evidence of overfitting. Let’s check on the test data.

clf.score(X_test, y_test)
0.6503496503496503

Only 64% accuracy on the test data! We might have gotten higher accuracy just by predicting nobody survived.

We can get a more accurate assessment of the performance of the model by evaluating log_loss. The log loss on the training set is essentially perfect (0).

log_loss(y_train, clf.predict_proba(X_train))
9.992007221626415e-16

But it is nowhere near perfect on the test set.

log_loss(y_test, clf.predict_proba(X_test))
12.076495242975763

One of the most dangerous properties of decision trees is that they are very prone to overfitting. (One of the best properties of decision trees is that they provide models whose outputs are very interpretable.)

We can combat overfitting by restricting the depth of the tree or the maximum number of leaf nodes of the tree. Here we restrict both.

clf2 = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=8)
clf2.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=8)

The performance on the training set is worse, but the performance on the test set is better. It is the performance on the test set that is important (because that relates to how the performance will be on new unseen data).

clf2.score(X_train, y_train)
0.7460595446584939

Probably we can get the following accuracy higher by further adjusting the parameters. There might still be some overfitting in this case.

clf2.score(X_test, y_test)
0.6993006993006993

Here is a fun example of visualizing the decision tree. Yasmeen will go over a much more impressive visualization of a decision tree in Discussion Section on Tuesday.

from sklearn.tree import export_text
r = export_text(clf2, feature_names=cols)
print(r)
|--- Pclass <= 2.50
|   |--- Fare <= 13.65
|   |   |--- class: 0
|   |--- Fare >  13.65
|   |   |--- PassengerId <= 179.50
|   |   |   |--- class: 0
|   |   |--- PassengerId >  179.50
|   |   |   |--- Age <= 42.50
|   |   |   |   |--- class: 1
|   |   |   |--- Age >  42.50
|   |   |   |   |--- class: 1
|--- Pclass >  2.50
|   |--- Age <= 32.50
|   |   |--- SibSp <= 1.50
|   |   |   |--- Age <= 9.50
|   |   |   |   |--- class: 1
|   |   |   |--- Age >  9.50
|   |   |   |   |--- class: 0
|   |   |--- SibSp >  1.50
|   |   |   |--- class: 0
|   |--- Age >  32.50
|   |   |--- class: 0