Worksheet 16#
Due Tuesday night (instead of the usual Monday due date), because of the Memorial Day holiday.
You are encouraged to work in groups of up to 3 total students, but each student should make their own submission on Canvas. (It’s fine for everyone in the group to have the same upload.)
Put the full names of everyone in your group (even if you’re working alone) here. (This makes grading easier.)
Names:
Introduction#
In this worksheet, we will use a random forest classifier to classify handwritten digits. In discussion section on Thursday of Week 7, we got a 93% accuracy on this dataset using logistic regression. This is already very good. (We didn’t use a test set at that time, so it is possible there was some overfitting.)
Question 0 - Setting up the workspace#
Make sure you are working in a (free) Pro Deepnote workspace; see the Worksheet 1 instructions. The computations are more memory-intensive than usual in this worksheet. It should say “Education” in the lower-left corner.
Question 1 - Loading the data#
Load the attached “mnist” dataset and assign it to the variable name
df_pre
.
Reminder: This dataset contains about half the rows from the “usual” MNIST dataset.
Assign the input features portion of the DataFrame to the variable
X
and the target portion of the DataFrame to the variabley
. (See the Thursday Week 7 notebook on Deepnote.)
The target variable
y
contains numbers, but it would be a mistake to treat this as a regression problem. This is a classification problem. Why do you think this is a classification problem? Answer in a markdown cell.
Possible hint. The target values should not even be viewed as being ordered, or at least, the ordering is not relevant to our predictions. The target values should have a Nominal data type, in the Altair terminoloty.
How many bytes is
X
? Use thegetsizeof
function from thesys
module. Make the number easier to read by using an f-string and including commas:print(???"The pandas DataFrame X is {???:,} bytes.")
. This is significantly bigger than our usual datasets (but it still probably does not qualify as “big data”).
Does that approximately match what is reported by calling the
info
method?
Divide
X
andy
into a training set and a test set,X_train, X_test, y_train, y_test
, by usingtrain_test_split
with atrain_size
of0.8
and arandom_state
of0
.
Check your answer:
y_test
should be a length 8400 pandas Series. The first three values iny_test
should be3
,6
,9
.
Question 2 - Displaying a handwritten digit#
We just saw that
y_test.iloc[0]
corresponds to the digit3
. Display the corresponding handwritten digit fromX_test
usingax.imshow
whereax
is a Matplotlib Axes object. Again, see the Week 7 Thursday notebook for a reminder of how to do this.
Question 3 - Fitting a random forest classifier#
Your goal in this question is to get a test accuracy of at least 92% using a random forest classifier.
Import the
RandomForestClassifier
class from scikit-learn’sensemble
module.Create an instance of this class and name it
rfc
(for random forest classifier). Experiment with different values ofn_estimators
,max_depth
and/ormax_leaf_nodes
. Also userandom_state
to get reproducible results. Fit the classifier to the training data. Try to find values which yield a test score (rfc.score(X_test, y_test)
) of at least0.92
.
Warning. Be sure you are calling fit
using the training data, not the full data and not the test data.
Warning. Start with small values and work your way up. If you start with even medium-sized values, the computer may run out of memory and you will have to restart the notebook. The fit
step for my values took about 20 seconds.
Comment. This might seem to be performing worse than our logistic regression model, but that is partially because we did not use a test set with our logistic regression model.
Question 4 - The individual trees in the random forest#
What type of object is
rfc.estimators_
? (This and the following questions will only work if you have fitrfc
in the previous step.)What is the length of
rfc.estimators_
?How does the length of
rfc.estimators_
relate to the parameters you used above? Answer in a markdown cell.What is the type of the zero-th element in
rfc.estimators_
?Using list comprehension, make a list
score_list
containingclf.score(X_test.values, y_test.values)
for each classifier inrfc.estimators_
. (The individual trees in the random forest are trained without the feature names, presumably to save memory, so that’s why we use for exampleX_test.values
instead ofX_test
.)What is the maximum value in
score_list
?How does this result relate to the expression, “greater than the sum of its parts”, or to the phrase, “the wisdom of crowds”? Answer in a markdown cell.
Question 5 - A DataFrame containing the results#
We now return back to rfc
.
Make a pandas DataFrame
df
containing two columns. The first column should be called “digit” and should contain the values fromy_test
. The second column should be called “pred” and should contain the values predicted byrfc
for the inputX_test
. (Reality check:df
should have 8400 rows and 2 columns.)
Question 6 - Confusion matrix#
Begin making a confusion matrix for this DataFrame using the following code. This code consists of two Altair charts, a rectangle chart and a text chart.
Comment: For now, the chart will look a little strange.
import altair as alt
alt.data_transformers.enable('default', max_rows=15000)
c = alt.Chart(df).mark_rect().encode(
x="digit:N",
y="pred:N",
)
c_text = alt.Chart(df).mark_text(color="white").encode(
x="digit:N",
y="pred:N",
text="pred"
)
(c+c_text).properties(
height=400,
width=400
)
Specify that the color on the rectangle chart should correspond to
"count()"
, and use one of the color schemes that seems appropriate, as incolor=alt.Color("count()", scale=alt.Scale(scheme="???"))
. (Don’t use a categorical scheme and don’t use a cyclical scheme. Try to find a scheme where the differences among the smaller numbers are visible.)You can also add
reverse=True
insidealt.Scale
if you want the colors to go in the opposite order. Feel free to change the text color from white if it makes it easier to see.Change the text on the text chart from
"pred"
to"count()"
.
Question 7 - Interpreting the confusion matrix#
Use the above confusion matrix to answer the following questions in markdown cells (not Python comments).
What is an example of a (true digit, predicted digit) pair that never occurs in the test data? (Hint/warning. Pay attention to the order. The “digit” column corresponds to the true digit.)
What are the two most common mistakes made by the classifier when attempting to classify a true 9 digit?
Does that mistake seem reasonable? Why or why not?
Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?
df.loc[df["digit"] == 9, "pred"].value_counts()
Question 8 - Feature importances#
In general, random forests are more difficult to interpret than decision trees. (There is no corresponding diagram, for example.) But random forests can still be used to identify feature importances.
Call the
reshape
method on thefeature_importances_
attribute ofrfc
so that it becomes a 28-by-28 NumPy array. Store the result with the variable namearr
.Visualize
arr
usingimshow
, like what we used above to display the handwritten digit.Why do you think the pixels around the perimeter are all the same color?
Question 9 - An incorrect digit#
Find an example of a digit in the test set that was mis-classified by
rfc
. (Hint. Start by usingdf["digit"] != df["pred"]
. You could then use Boolean indexing withX_test.index
, or you could use NumPy’snonzero
function to find the integer locations where the digits are mis-classified.)Display the mis-classified handwritten digit again using
imshow
.Display the true value (from
y_test
or fromdf["Digit"]
).Display the predicted value (using
rfc.predict
or usingdf["Pred"]
).Does the mistake by our random forest classifier seem reasonable? (Some will, some won’t. Find an example where the mistake seems reasonable.)
Submission#
Reminder: everyone needs to make a submission on Canvas.
Reminder: include everyone’s full name at the top, after Names.
Using the
Share
button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.