Worksheet 13
Contents
Worksheet 13#
The goal of this worksheet is to use a Decision Tree classifier to predict whether or not a passenger of the Titanic survived.
Many of the ideas in this worksheet come from this notebook on Kaggle by ZlatanKremonic. The dataset we use comes from a Kaggle competition.
Feature Engineering#
Load the attached Titanic dataset.
Using Boolean indexing, remove the rows where the “Embarked” column value is missing. (This should only remove two rows.)
Drop the “PassengerId” column using the
dropmethod, withdrop("PassengerId", axis=???). You should probably use thecopymethod to prevent warnings in the next step.Add a column “AgeNull” which contains
Trueif the value in the “Age” column is missing and containsFalseotherwise. The code to do this is shorter than you might expect:df["AgeNull"] = df["Age"].isna().Fill in the missing values in the “Age” column with the median value from that column. Use the pandas Series method
fillna. (Replace the “Age” column with this new column that does not have any missing values.)Add a column “IsFemale” which contains
Trueif the value in the “Sex” column is"female". (As with the “AgeNull” column above, you shouldn’t need to usemapor a for loop or anything like that.)Add a column “CabinLetter” which contains the first character in the string from the “Cabin” column. Use the pandas Series method
map, together with a lambda function and thena_actionkeyword argument so that missing values don’t raise an error. (Look up the documentation formapto see what the possible values are forna_action.)Check your answer. The current DataFrame should have 889 rows and 14 columns.
One-hot encoding#
Use scikit-learn’s
OneHotEncodingclass to perform one-hot encoding on the columns “Embarked” and “CabinLetter”. (Use both columns at once, for example,encoder.fit(df[["Embarked", "CabinLetter"]]).)Include the transformed columns in the DataFrame. Use
encoder.get_feature_names_out()to get the names of these columns.Check your answer. The current DataFrame should still have 889 rows, but now it should have 26 columns.
Splitting the data with train_test_split#
Make a list
featurescontaining the names of all the numeric columns in the DataFrame except for the “Survived” column. (Use list comprehension together withis_numeric_dtypefrompandas.api.types. You can do this all at once, or first make a list and then get rid of the “Survived” entry using the Python list methodremove. Notice thatremovechanges this list in place.)Divide the data into a training set and a test set using
train_test_split. For the input features, usedf[features]. For the target, use the “Survived” column. For the size, usetrain_sizeto specify that we should use 60% of the rows for the training set. Name the resulting objectsX_train, X_test, y_train, y_test.
Predicting survival using a decision tree#
Instantiate a
DecisionTreeClassifierobjectclf. Include restrictions on the complexity of the tree using the keyword argumentsmax_leaf_nodesand/ormax_depthwhen you instantiate the classifier. (There is more information on what we are looking for a few steps below.)Fit the classifier using
X_trainandy_train.Try to experiment with different values of
max_leaf_nodesand/ormax_depthuntil you have a tree which seems to be performing well (say, over 80% accuracy on the test set, as calculated usingclf.score) and which does not seem to be drastically overfitting the data (say, the accuracy on the training set should be within 5% of the accuracy on the test set). Be sure you are never calling thefitmethod with the test set; you should only use thepredictmethod or thescoremethod with the test set.
Comment. We will later see a more refined way to detect overfitting, using log_loss instead of score.
When you have values of
max_leaf_nodesand/ormax_depththat seem to be working well, try running the code again, beginning with thetrain_test_splitstep. This will give new training and testing data. The new tree should continue to work well.Create a pandas Series using the classifier’s
feature_importances_attribute as the Series values, and using thefeature_names_in_attribute as the Series index. Sort the values of this pandas Series from largest to smallest using thesort_valuesmethod and theascendingkeyword argument.Which features seem to be the most relevant to predicting survival?
Comment. Honestly the one-hot encoded columns do not seem very useful in the examples I have tried, so don’t be surprised if most or all of those show up with an importance of 0.
Saving the results#
When you have a classifier that is working well (i.e., is getting good results on the test set and not severely overfitting the training data), save the resulting
DecisionTreeClassifierobjectclfas a pickle file named wkst13-ans.pickle, as in the Worksheet 1 instructions, and upload that pickle file to Canvas.If you want to check that your object saved correctly, you can use the following code. Then make sure the resulting
clf_savedobject seems correct (for example, it should perform well on the test data).
with open("wkst13-ans.pickle", "rb") as f:
clf_saved = pickle.load(f)
Reminder#
Every group member needs to submit this on Canvas (even if you all submit the same file).
Submission#
Submit the pickle file on Canvas, as described in the instructions above.