Homework 6
Contents
Homework 6¶
Author: BLANK
Collaborators: BLANK
Logistic regression on the MNIST dataset¶
In this homework, we will use logistic regression to classify handwritten digits.
We first load the MNIST data and many useful scikit-learn functions by evaluating the cell below. It will probably take about one minute to execute. (Warning. I tried loading this twice, and I ran out of memory. So try to only evaluate this cell once per session.)
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version = 1)
Question 1¶
What is the
typeofmnist? (Hint. It’s not a type we have worked with before.)You can find the different attributes of
mnistby evaluatingdir(mnist). For example,mnist.datashould be thought of as the input data; it is a pandas DataFrame with 70000 rows and 784 columns.Which attribute do you think holds the output data? Check your answer. It should be a pandas Series of
size70000.Use
X = mnist.datato save the input data with the variable nameX, and similarly save the output data with the variable namey. (The convention is to use a capital letter for the input and a lower-case letter for the output, I think as a reminder that the input is a two-dimensional pandas DataFrame whereas the output is a one-dimensional pandas Series.)
Question 2¶
Define
Ato be the second row ofX. (Start counting at 0. useiloc, notloc, even though they will both give the same answer in this case.)Convert
Ato a NumPy array using theto_numpy()method.Check your answer:
Ashould now be a NumPy array of shape(784,)(that is how Python displays a tuple of length 1).Notice that \(784 = 28^2\). Reshape
Ato have shape(28,28)by using thereshapemethod. (You can callhelp(A.reshape)to find out what the input should look like.)Import
matplotlib.pyplotaspltand then use the following code to visualizeA.
fig, ax = plt.subplots()
ax.imshow(A)
Use the keyword argument
cmapwith the value of"binary"when callingimshowso that the image is displayed in black-and-white. You can also use"binary_r"if you want the colors reversed.What is the corresponding value in
y? Does it match what you expected?
Question 3¶
Divide
XintoX_trainandX_testandyintoy_trainandy_testusingtrain_test_split. Specify that thetest_sizeshould be 20% of the total data.
Question 4¶
This is a fairly big dataset, so we will change some default values when instantiating the LogisticRegression object. You will need to look up the keywords to use in the scikit-learn documentation for LogisticRegression.
Name the
LogisticRegressionobjectclf(for “classifier”).Use a tolerance of
0.1. (This is larger than the default value, so it lets the fitting process stop earlier.)For the solver, use one of the two solvers that the documentation mentions as a good choice for large datasets.
If necessary, you can increase the maximum number of iterations, but I didn’t need to.
Fit a logistic regression model using the training data.
How many total parameters are held in the
clf.coef_? First convertclf.coef_to a one-dimensional NumPy array usingreshape(-1), then check thesize.What proportion of those parameters are at least 0.0001? At least 0.01? First make a Boolean array, and then call the
summethod. (I think it is almost always better to usearr.sum()and not to usesum(arr).) When I tried this, my answers were approximately 0.3 and 0, respectively.
Question 5¶
Evaluate
clf.score(i.e., the prediction accuracy) on the training set.Compute this same number a second way, by calling
mean()on a suitable Boolean Series.
Question 6¶
Evaluate
clf.scoreon the test set.Does the result suggest that overfitting is a concern in this case? (This is a difficult question to make precise. In general, if the two accuracies are similar, maybe off by 5%, then I would say that overfitting is probably not a serious concern. And if the test accuracy is higher than the training accuracy, then overfitting is definitely not a concern.)
Question 7¶
Make a pandas DataFrame
dfcontaining two columns. The first column should be called “Digit” and should contain the values fromy_test. The second column should be called “Pred” and contain the predicted values corresponding toX_test. (Reality check:dfshould have 14000 rows and 2 columns.)
Question 8¶
Begin making a confusion matrix for this data using the following code.
import altair as alt
alt.data_transformers.enable('default', max_rows=15000)
c = alt.Chart(df).mark_rect().encode(
x="Digit",
y="Pred",
)
c_text = alt.Chart(df).mark_text(color="white").encode(
x="Digit",
y="Pred",
text="Pred"
)
(c+c_text).properties(
height=400,
width=400
)
Specify that the color on the rectangle chart should correspond to
"count()", using the color scheme “turbo”.Change the text on the text chart from “Pred” to
"count()".
Question 9¶
Use the above confusion matrix to answer the following questions.
What is an example of a (true digit, predicted digit) pair that never occurs in the test data?
What are the two most common mistakes made by the classifier when attempting to classify a 9 digit?
Do those mistakes seem reasonable?
Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?
for a,b in df.groupby("Digit"):
if a==9:
break
display(b["Pred"].value_counts())
Question 10¶
Find an example of an incorrectly classified digit. Display that digit using
imshow, as we did at the top of this homework. (Hint. If you’re using code likedf["Digit"]==9, it won’t work; use"9"instead.)Looking at the digit, does the mistake seem reasonable?
Submission¶
Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.