Homework 6¶

Author: BLANK

Collaborators: BLANK

Logistic regression on the MNIST dataset¶

In this homework, we will use logistic regression to classify handwritten digits.

We first load the MNIST data and many useful scikit-learn functions by evaluating the cell below. It will probably take about one minute to execute. (Warning. I tried loading this twice, and I ran out of memory. So try to only evaluate this cell once per session.)

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version = 1)

Question 1¶

What is the type of mnist? (Hint. It’s not a type we have worked with before.)
You can find the different attributes of mnist by evaluating dir(mnist). For example, mnist.data should be thought of as the input data; it is a pandas DataFrame with 70000 rows and 784 columns.
Which attribute do you think holds the output data? Check your answer. It should be a pandas Series of size 70000.
Use X = mnist.data to save the input data with the variable name X, and similarly save the output data with the variable name y. (The convention is to use a capital letter for the input and a lower-case letter for the output, I think as a reminder that the input is a two-dimensional pandas DataFrame whereas the output is a one-dimensional pandas Series.)

Question 2¶

Define A to be the second row of X. (Start counting at 0. use iloc, not loc, even though they will both give the same answer in this case.)
Convert A to a NumPy array using the to_numpy() method.
Check your answer: A should now be a NumPy array of shape (784,) (that is how Python displays a tuple of length 1).
Notice that \(784 = 28^2\). Reshape A to have shape (28,28) by using the reshape method. (You can call help(A.reshape) to find out what the input should look like.)
Import matplotlib.pyplot as plt and then use the following code to visualize A.

fig, ax = plt.subplots()
ax.imshow(A)

Use the keyword argument cmap with the value of "binary" when calling imshow so that the image is displayed in black-and-white. You can also use "binary_r" if you want the colors reversed.
What is the corresponding value in y? Does it match what you expected?

Question 3¶

Divide X into X_train and X_test and y into y_train and y_test using train_test_split. Specify that the test_size should be 20% of the total data.

Question 4¶

This is a fairly big dataset, so we will change some default values when instantiating the LogisticRegression object. You will need to look up the keywords to use in the scikit-learn documentation for LogisticRegression.

Name the LogisticRegression object clf (for “classifier”).
Use a tolerance of 0.1. (This is larger than the default value, so it lets the fitting process stop earlier.)
For the solver, use one of the two solvers that the documentation mentions as a good choice for large datasets.
If necessary, you can increase the maximum number of iterations, but I didn’t need to.
Fit a logistic regression model using the training data.
How many total parameters are held in the clf.coef_? First convert clf.coef_ to a one-dimensional NumPy array using reshape(-1), then check the size.
What proportion of those parameters are at least 0.0001? At least 0.01? First make a Boolean array, and then call the sum method. (I think it is almost always better to use arr.sum() and not to use sum(arr).) When I tried this, my answers were approximately 0.3 and 0, respectively.

Question 5¶

Evaluate clf.score (i.e., the prediction accuracy) on the training set.
Compute this same number a second way, by calling mean() on a suitable Boolean Series.

Question 6¶

Evaluate clf.score on the test set.
Does the result suggest that overfitting is a concern in this case? (This is a difficult question to make precise. In general, if the two accuracies are similar, maybe off by 5%, then I would say that overfitting is probably not a serious concern. And if the test accuracy is higher than the training accuracy, then overfitting is definitely not a concern.)

Question 7¶

Make a pandas DataFrame df containing two columns. The first column should be called “Digit” and should contain the values from y_test. The second column should be called “Pred” and contain the predicted values corresponding to X_test. (Reality check: df should have 14000 rows and 2 columns.)

Question 8¶

Begin making a confusion matrix for this data using the following code.

import altair as alt
alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="Digit",
    y="Pred",
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="Digit",
    y="Pred",
    text="Pred"
)

(c+c_text).properties(
    height=400,
    width=400
)

Specify that the color on the rectangle chart should correspond to "count()", using the color scheme “turbo”.
Change the text on the text chart from “Pred” to "count()".

Question 9¶

Use the above confusion matrix to answer the following questions.

What is an example of a (true digit, predicted digit) pair that never occurs in the test data?
What are the two most common mistakes made by the classifier when attempting to classify a 9 digit?
Do those mistakes seem reasonable?
Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?

for a,b in df.groupby("Digit"):
    if a==9:
        break

display(b["Pred"].value_counts())

Question 10¶

Find an example of an incorrectly classified digit. Display that digit using imshow, as we did at the top of this homework. (Hint. If you’re using code like df["Digit"]==9, it won’t work; use "9" instead.)
Looking at the digit, does the mistake seem reasonable?

Submission¶

Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.

UC Irvine Math 10 S22

Homework 6

Contents