Worksheet 13#

The goal of this worksheet is to use a Decision Tree classifier to predict whether or not a passenger of the Titanic survived.

Many of the ideas in this worksheet come from this notebook on Kaggle by ZlatanKremonic. The dataset we use comes from a Kaggle competition.

Feature Engineering#

  • Load the attached Titanic dataset.

  • Using Boolean indexing, remove the rows where the “Embarked” column value is missing. (This should only remove two rows.)

  • Drop the “PassengerId” column using the drop method, with drop("PassengerId", axis=???). You should probably use the copy method to prevent warnings in the next step.

  • Add a column “AgeNull” which contains True if the value in the “Age” column is missing and contains False otherwise. The code to do this is shorter than you might expect: df["AgeNull"] = df["Age"].isna().

  • Fill in the missing values in the “Age” column with the median value from that column. Use the pandas Series method fillna. (Replace the “Age” column with this new column that does not have any missing values.)

  • Add a column “IsFemale” which contains True if the value in the “Sex” column is "female". (As with the “AgeNull” column above, you shouldn’t need to use map or a for loop or anything like that.)

  • Add a column “CabinLetter” which contains the first character in the string from the “Cabin” column. Use the pandas Series method map, together with a lambda function and the na_action keyword argument so that missing values don’t raise an error. (Look up the documentation for map to see what the possible values are for na_action.)

  • Check your answer. The current DataFrame should have 889 rows and 14 columns.

One-hot encoding#

  • Use scikit-learn’s OneHotEncoding class to perform one-hot encoding on the columns “Embarked” and “CabinLetter”. (Use both columns at once, for example, encoder.fit(df[["Embarked", "CabinLetter"]]).)

  • Include the transformed columns in the DataFrame. Use encoder.get_feature_names_out() to get the names of these columns.

  • Check your answer. The current DataFrame should still have 889 rows, but now it should have 26 columns.

Splitting the data with train_test_split#

  • Make a list features containing the names of all the numeric columns in the DataFrame except for the “Survived” column. (Use list comprehension together with is_numeric_dtype from pandas.api.types. You can do this all at once, or first make a list and then get rid of the “Survived” entry using the Python list method remove. Notice that remove changes this list in place.)

  • Divide the data into a training set and a test set using train_test_split. For the input features, use df[features]. For the target, use the “Survived” column. For the size, use train_size to specify that we should use 60% of the rows for the training set. Name the resulting objects X_train, X_test, y_train, y_test.

Predicting survival using a decision tree#

  • Instantiate a DecisionTreeClassifier object clf. Include restrictions on the complexity of the tree using the keyword arguments max_leaf_nodes and/or max_depth when you instantiate the classifier. (There is more information on what we are looking for a few steps below.)

  • Fit the classifier using X_train and y_train.

  • Try to experiment with different values of max_leaf_nodes and/or max_depth until you have a tree which seems to be performing well (say, over 80% accuracy on the test set, as calculated using clf.score) and which does not seem to be drastically overfitting the data (say, the accuracy on the training set should be within 5% of the accuracy on the test set). Be sure you are never calling the fit method with the test set; you should only use the predict method or the score method with the test set.

Comment. We will later see a more refined way to detect overfitting, using log_loss instead of score.

  • When you have values of max_leaf_nodes and/or max_depth that seem to be working well, try running the code again, beginning with the train_test_split step. This will give new training and testing data. The new tree should continue to work well.

  • Create a pandas Series using the classifier’s feature_importances_ attribute as the Series values, and using the feature_names_in_ attribute as the Series index. Sort the values of this pandas Series from largest to smallest using the sort_values method and the ascending keyword argument.

  • Which features seem to be the most relevant to predicting survival?

Comment. Honestly the one-hot encoded columns do not seem very useful in the examples I have tried, so don’t be surprised if most or all of those show up with an importance of 0.

Saving the results#

  • When you have a classifier that is working well (i.e., is getting good results on the test set and not severely overfitting the training data), save the resulting DecisionTreeClassifier object clf as a pickle file named wkst13-ans.pickle, as in the Worksheet 1 instructions, and upload that pickle file to Canvas.

  • If you want to check that your object saved correctly, you can use the following code. Then make sure the resulting clf_saved object seems correct (for example, it should perform well on the test data).

with open("wkst13-ans.pickle", "rb") as f:
    clf_saved = pickle.load(f)

Reminder#

Every group member needs to submit this on Canvas (even if you all submit the same file).

Submission#

Submit the pickle file on Canvas, as described in the instructions above.

Created in deepnote.com Created in Deepnote