Worksheet 13
Contents
Worksheet 13#
The goal of this worksheet is to use a Decision Tree classifier to predict whether or not a passenger of the Titanic survived.
Many of the ideas in this worksheet come from this notebook on Kaggle by ZlatanKremonic. The dataset we use comes from a Kaggle competition.
Feature Engineering#
Load the attached Titanic dataset.
Using Boolean indexing, remove the rows where the “Embarked” column value is missing. (This should only remove two rows.)
Drop the “PassengerId” column using the
drop
method, withdrop("PassengerId", axis=???)
. You should probably use thecopy
method to prevent warnings in the next step.Add a column “AgeNull” which contains
True
if the value in the “Age” column is missing and containsFalse
otherwise. The code to do this is shorter than you might expect:df["AgeNull"] = df["Age"].isna()
.Fill in the missing values in the “Age” column with the median value from that column. Use the pandas Series method
fillna
. (Replace the “Age” column with this new column that does not have any missing values.)Add a column “IsFemale” which contains
True
if the value in the “Sex” column is"female"
. (As with the “AgeNull” column above, you shouldn’t need to usemap
or a for loop or anything like that.)Add a column “CabinLetter” which contains the first character in the string from the “Cabin” column. Use the pandas Series method
map
, together with a lambda function and thena_action
keyword argument so that missing values don’t raise an error. (Look up the documentation formap
to see what the possible values are forna_action
.)Check your answer. The current DataFrame should have 889 rows and 14 columns.
One-hot encoding#
Use scikit-learn’s
OneHotEncoding
class to perform one-hot encoding on the columns “Embarked” and “CabinLetter”. (Use both columns at once, for example,encoder.fit(df[["Embarked", "CabinLetter"]])
.)Include the transformed columns in the DataFrame. Use
encoder.get_feature_names_out()
to get the names of these columns.Check your answer. The current DataFrame should still have 889 rows, but now it should have 26 columns.
Splitting the data with train_test_split
#
Make a list
features
containing the names of all the numeric columns in the DataFrame except for the “Survived” column. (Use list comprehension together withis_numeric_dtype
frompandas.api.types
. You can do this all at once, or first make a list and then get rid of the “Survived” entry using the Python list methodremove
. Notice thatremove
changes this list in place.)Divide the data into a training set and a test set using
train_test_split
. For the input features, usedf[features]
. For the target, use the “Survived” column. For the size, usetrain_size
to specify that we should use 60% of the rows for the training set. Name the resulting objectsX_train, X_test, y_train, y_test
.
Predicting survival using a decision tree#
Instantiate a
DecisionTreeClassifier
objectclf
. Include restrictions on the complexity of the tree using the keyword argumentsmax_leaf_nodes
and/ormax_depth
when you instantiate the classifier. (There is more information on what we are looking for a few steps below.)Fit the classifier using
X_train
andy_train
.Try to experiment with different values of
max_leaf_nodes
and/ormax_depth
until you have a tree which seems to be performing well (say, over 80% accuracy on the test set, as calculated usingclf.score
) and which does not seem to be drastically overfitting the data (say, the accuracy on the training set should be within 5% of the accuracy on the test set). Be sure you are never calling thefit
method with the test set; you should only use thepredict
method or thescore
method with the test set.
Comment. We will later see a more refined way to detect overfitting, using log_loss
instead of score
.
When you have values of
max_leaf_nodes
and/ormax_depth
that seem to be working well, try running the code again, beginning with thetrain_test_split
step. This will give new training and testing data. The new tree should continue to work well.Create a pandas Series using the classifier’s
feature_importances_
attribute as the Series values, and using thefeature_names_in_
attribute as the Series index. Sort the values of this pandas Series from largest to smallest using thesort_values
method and theascending
keyword argument.Which features seem to be the most relevant to predicting survival?
Comment. Honestly the one-hot encoded columns do not seem very useful in the examples I have tried, so don’t be surprised if most or all of those show up with an importance of 0.
Saving the results#
When you have a classifier that is working well (i.e., is getting good results on the test set and not severely overfitting the training data), save the resulting
DecisionTreeClassifier
objectclf
as a pickle file named wkst13-ans.pickle, as in the Worksheet 1 instructions, and upload that pickle file to Canvas.If you want to check that your object saved correctly, you can use the following code. Then make sure the resulting
clf_saved
object seems correct (for example, it should perform well on the test data).
with open("wkst13-ans.pickle", "rb") as f:
clf_saved = pickle.load(f)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).
Submission#
Submit the pickle file on Canvas, as described in the instructions above.
Created in Deepnote