Worksheet 16#

Due Tuesday night (instead of the usual Monday due date), because of the Memorial Day holiday.

You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)

Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)

  • Names:

Introduction#

In this worksheet, we will use a random forest classifier to classify handwritten digits. In discussion section on Thursday of Week 7, we got a 93% accuracy on this dataset using logistic regression. This is already very good. (We didn’t use a test set at that time, so it is possible there was some overfitting.)

Question 0 - Setting up the workspace#

Make sure you are working in a (free) Pro Deepnote workspace; see the Worksheet 1 instructions. The computations are more memory-intensive than usual in this worksheet. It should say “Education” in the lower-left corner.

Question 1 - Loading the data#

  • Load the attached “mnist” dataset and assign it to the variable name df_pre.

Reminder: This dataset contains about half the rows from the “usual” MNIST dataset.

  • Assign the input features portion of the DataFrame to the variable X and the target portion of the DataFrame to the variable y. (See the Thursday Week 7 notebook on Deepnote.)

  • The target variable y contains numbers, but it would be a mistake to treat this as a regression problem. This is a classification problem. Why do you think this is a classification problem? Answer in a markdown cell.

Possible hint. The target values should not even be viewed as being ordered, or at least, the ordering is not relevant to our predictions. The target values should have a Nominal data type, in the Altair terminoloty.

  • How many bytes is X? Use the getsizeof function from the sys module. Make the number easier to read by using an f-string and including commas: print(???"The pandas DataFrame X is {???:,} bytes."). This is significantly bigger than our usual datasets (but it still probably does not qualify as “big data”).

  • Does that approximately match what is reported by calling the info method?

  • Divide X and y into a training set and a test set, X_train, X_test, y_train, y_test, by using train_test_split with a train_size of 0.8 and a random_state of 0.

  • Check your answer: y_test should be a length 8400 pandas Series. The first three values in y_test should be 3, 6, 9.

Question 2 - Displaying a handwritten digit#

  • We just saw that y_test.iloc[0] corresponds to the digit 3. Display the corresponding handwritten digit from X_test using ax.imshow where ax is a Matplotlib Axes object. Again, see the Week 7 Thursday notebook for a reminder of how to do this.

Question 3 - Fitting a random forest classifier#

Your goal in this question is to get a test accuracy of at least 92% using a random forest classifier.

  • Import the RandomForestClassifier class from scikit-learn’s ensemble module.

  • Create an instance of this class and name it rfc (for random forest classifier). Experiment with different values of n_estimators, max_depth and/or max_leaf_nodes. Also use random_state to get reproducible results. Fit the classifier to the training data. Try to find values which yield a test score (rfc.score(X_test, y_test)) of at least 0.92.

Warning. Be sure you are calling fit using the training data, not the full data and not the test data.

Warning. Start with small values and work your way up. If you start with even medium-sized values, the computer may run out of memory and you will have to restart the notebook. The fit step for my values took about 20 seconds.

Comment. This might seem to be performing worse than our logistic regression model, but that is partially because we did not use a test set with our logistic regression model.

Question 4 - The individual trees in the random forest#

  • What type of object is rfc.estimators_? (This and the following questions will only work if you have fit rfc in the previous step.)

  • What is the length of rfc.estimators_?

  • How does the length of rfc.estimators_ relate to the parameters you used above? Answer in a markdown cell.

  • What is the type of the zero-th element in rfc.estimators_?

  • Using list comprehension, make a list score_list containing clf.score(X_test.values, y_test.values) for each classifier in rfc.estimators_. (The individual trees in the random forest are trained without the feature names, presumably to save memory, so that’s why we use for example X_test.values instead of X_test.)

  • What is the maximum value in score_list?

  • How does this result relate to the expression, “greater than the sum of its parts”, or to the phrase, “the wisdom of crowds”? Answer in a markdown cell.

Question 5 - A DataFrame containing the results#

We now return back to rfc.

  • Make a pandas DataFrame df containing two columns. The first column should be called “digit” and should contain the values from y_test. The second column should be called “pred” and should contain the values predicted by rfc for the input X_test. (Reality check: df should have 8400 rows and 2 columns.)

Question 6 - Confusion matrix#

  • Begin making a confusion matrix for this DataFrame using the following code. This code consists of two Altair charts, a rectangle chart and a text chart.

Comment: For now, the chart will look a little strange.

import altair as alt
alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="digit:N",
    y="pred:N",
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="digit:N",
    y="pred:N",
    text="pred"
)

(c+c_text).properties(
    height=400,
    width=400
)
  • Specify that the color on the rectangle chart should correspond to "count()", and use one of the color schemes that seems appropriate, as in color=alt.Color("count()", scale=alt.Scale(scheme="???")). (Don’t use a categorical scheme and don’t use a cyclical scheme. Try to find a scheme where the differences among the smaller numbers are visible.)

  • You can also add reverse=True inside alt.Scale if you want the colors to go in the opposite order. Feel free to change the text color from white if it makes it easier to see.

  • Change the text on the text chart from "pred" to "count()".

Question 7 - Interpreting the confusion matrix#

Use the above confusion matrix to answer the following questions in markdown cells (not Python comments).

  • What is an example of a (true digit, predicted digit) pair that never occurs in the test data? (Hint/warning. Pay attention to the order. The “digit” column corresponds to the true digit.)

  • What are the two most common mistakes made by the classifier when attempting to classify a true 9 digit?

  • Does that mistake seem reasonable? Why or why not?

  • Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?

df.loc[df["digit"] == 9, "pred"].value_counts()

Question 8 - Feature importances#

In general, random forests are more difficult to interpret than decision trees. (There is no corresponding diagram, for example.) But random forests can still be used to identify feature importances.

  • Call the reshape method on the feature_importances_ attribute of rfc so that it becomes a 28-by-28 NumPy array. Store the result with the variable name arr.

  • Visualize arr using imshow, like what we used above to display the handwritten digit.

  • Why do you think the pixels around the perimeter are all the same color?

Question 9 - An incorrect digit#

  • Find an example of a digit in the test set that was mis-classified by rfc. (Hint. Start by using df["digit"] != df["pred"]. You could then use Boolean indexing with X_test.index, or you could use NumPy’s nonzero function to find the integer locations where the digits are mis-classified.)

  • Display the mis-classified handwritten digit again using imshow.

  • Display the true value (from y_test or from df["Digit"]).

  • Display the predicted value (using rfc.predict or using df["Pred"]).

  • Does the mistake by our random forest classifier seem reasonable? (Some will, some won’t. Find an example where the mistake seems reasonable.)

Submission#

  • Reminder: everyone needs to make a submission on Canvas.

  • Reminder: include everyone’s full name at the top, after Names.

  • Using the Share button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.