Homework 6

Author: BLANK

Collaborators: BLANK

Logistic regression on the MNIST dataset

In this homework, we will use logistic regression to classify handwritten digits.

We first load the MNIST data and many useful scikit-learn functions by evaluating the cell below. It will probably take about one minute to execute. (Warning. I tried loading this twice, and I ran out of memory. So try to only evaluate this cell once per session.)

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version = 1)

Question 1

  • What is the type of mnist? (Hint. It’s not a type we have worked with before.)

  • You can find the different attributes of mnist by evaluating dir(mnist). For example, mnist.data should be thought of as the input data; it is a pandas DataFrame with 70000 rows and 784 columns.

  • Which attribute do you think holds the output data? Check your answer. It should be a pandas Series of size 70000.

  • Use X = mnist.data to save the input data with the variable name X, and similarly save the output data with the variable name y. (The convention is to use a capital letter for the input and a lower-case letter for the output, I think as a reminder that the input is a two-dimensional pandas DataFrame whereas the output is a one-dimensional pandas Series.)

Question 2

  • Define A to be the second row of X. (Start counting at 0. use iloc, not loc, even though they will both give the same answer in this case.)

  • Convert A to a NumPy array using the to_numpy() method.

  • Check your answer: A should now be a NumPy array of shape (784,) (that is how Python displays a tuple of length 1).

  • Notice that \(784 = 28^2\). Reshape A to have shape (28,28) by using the reshape method. (You can call help(A.reshape) to find out what the input should look like.)

  • Import matplotlib.pyplot as plt and then use the following code to visualize A.

fig, ax = plt.subplots()
ax.imshow(A)
  • Use the keyword argument cmap with the value of "binary" when calling imshow so that the image is displayed in black-and-white. You can also use "binary_r" if you want the colors reversed.

  • What is the corresponding value in y? Does it match what you expected?

Question 3

  • Divide X into X_train and X_test and y into y_train and y_test using train_test_split. Specify that the test_size should be 20% of the total data.

Question 4

This is a fairly big dataset, so we will change some default values when instantiating the LogisticRegression object. You will need to look up the keywords to use in the scikit-learn documentation for LogisticRegression.

  • Name the LogisticRegression object clf (for “classifier”).

  • Use a tolerance of 0.1. (This is larger than the default value, so it lets the fitting process stop earlier.)

  • For the solver, use one of the two solvers that the documentation mentions as a good choice for large datasets.

  • If necessary, you can increase the maximum number of iterations, but I didn’t need to.

  • Fit a logistic regression model using the training data.

  • How many total parameters are held in the clf.coef_? First convert clf.coef_ to a one-dimensional NumPy array using reshape(-1), then check the size.

  • What proportion of those parameters are at least 0.0001? At least 0.01? First make a Boolean array, and then call the sum method. (I think it is almost always better to use arr.sum() and not to use sum(arr).) When I tried this, my answers were approximately 0.3 and 0, respectively.

Question 5

  • Evaluate clf.score (i.e., the prediction accuracy) on the training set.

  • Compute this same number a second way, by calling mean() on a suitable Boolean Series.

Question 6

  • Evaluate clf.score on the test set.

  • Does the result suggest that overfitting is a concern in this case? (This is a difficult question to make precise. In general, if the two accuracies are similar, maybe off by 5%, then I would say that overfitting is probably not a serious concern. And if the test accuracy is higher than the training accuracy, then overfitting is definitely not a concern.)

Question 7

  • Make a pandas DataFrame df containing two columns. The first column should be called “Digit” and should contain the values from y_test. The second column should be called “Pred” and contain the predicted values corresponding to X_test. (Reality check: df should have 14000 rows and 2 columns.)

Question 8

  • Begin making a confusion matrix for this data using the following code.

import altair as alt
alt.data_transformers.enable('default', max_rows=15000)

c = alt.Chart(df).mark_rect().encode(
    x="Digit",
    y="Pred",
)

c_text = alt.Chart(df).mark_text(color="white").encode(
    x="Digit",
    y="Pred",
    text="Pred"
)

(c+c_text).properties(
    height=400,
    width=400
)
  • Specify that the color on the rectangle chart should correspond to "count()", using the color scheme “turbo”.

  • Change the text on the text chart from “Pred” to "count()".

Question 9

Use the above confusion matrix to answer the following questions.

  • What is an example of a (true digit, predicted digit) pair that never occurs in the test data?

  • What are the two most common mistakes made by the classifier when attempting to classify a 9 digit?

  • Do those mistakes seem reasonable?

  • Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?

for a,b in df.groupby("Digit"):
    if a==9:
        break

display(b["Pred"].value_counts())

Question 10

  • Find an example of an incorrectly classified digit. Display that digit using imshow, as we did at the top of this homework. (Hint. If you’re using code like df["Digit"]==9, it won’t work; use "9" instead.)

  • Looking at the digit, does the mistake seem reasonable?

Submission

Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.