Homework 6
Contents
Homework 6¶
Author: BLANK
Collaborators: BLANK
Logistic regression on the MNIST dataset¶
In this homework, we will use logistic regression to classify handwritten digits.
We first load the MNIST data and many useful scikit-learn functions by evaluating the cell below. It will probably take about one minute to execute. (Warning. I tried loading this twice, and I ran out of memory. So try to only evaluate this cell once per session.)
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version = 1)
Question 1¶
What is the
type
ofmnist
? (Hint. It’s not a type we have worked with before.)You can find the different attributes of
mnist
by evaluatingdir(mnist)
. For example,mnist.data
should be thought of as the input data; it is a pandas DataFrame with 70000 rows and 784 columns.Which attribute do you think holds the output data? Check your answer. It should be a pandas Series of
size
70000.Use
X = mnist.data
to save the input data with the variable nameX
, and similarly save the output data with the variable namey
. (The convention is to use a capital letter for the input and a lower-case letter for the output, I think as a reminder that the input is a two-dimensional pandas DataFrame whereas the output is a one-dimensional pandas Series.)
Question 2¶
Define
A
to be the second row ofX
. (Start counting at 0. useiloc
, notloc
, even though they will both give the same answer in this case.)Convert
A
to a NumPy array using theto_numpy()
method.Check your answer:
A
should now be a NumPy array of shape(784,)
(that is how Python displays a tuple of length 1).Notice that \(784 = 28^2\). Reshape
A
to have shape(28,28)
by using thereshape
method. (You can callhelp(A.reshape)
to find out what the input should look like.)Import
matplotlib.pyplot
asplt
and then use the following code to visualizeA
.
fig, ax = plt.subplots()
ax.imshow(A)
Use the keyword argument
cmap
with the value of"binary"
when callingimshow
so that the image is displayed in black-and-white. You can also use"binary_r"
if you want the colors reversed.What is the corresponding value in
y
? Does it match what you expected?
Question 3¶
Divide
X
intoX_train
andX_test
andy
intoy_train
andy_test
usingtrain_test_split
. Specify that thetest_size
should be 20% of the total data.
Question 4¶
This is a fairly big dataset, so we will change some default values when instantiating the LogisticRegression
object. You will need to look up the keywords to use in the scikit-learn documentation for LogisticRegression
.
Name the
LogisticRegression
objectclf
(for “classifier”).Use a tolerance of
0.1
. (This is larger than the default value, so it lets the fitting process stop earlier.)For the solver, use one of the two solvers that the documentation mentions as a good choice for large datasets.
If necessary, you can increase the maximum number of iterations, but I didn’t need to.
Fit a logistic regression model using the training data.
How many total parameters are held in the
clf.coef_
? First convertclf.coef_
to a one-dimensional NumPy array usingreshape(-1)
, then check thesize
.What proportion of those parameters are at least 0.0001? At least 0.01? First make a Boolean array, and then call the
sum
method. (I think it is almost always better to usearr.sum()
and not to usesum(arr)
.) When I tried this, my answers were approximately 0.3 and 0, respectively.
Question 5¶
Evaluate
clf.score
(i.e., the prediction accuracy) on the training set.Compute this same number a second way, by calling
mean()
on a suitable Boolean Series.
Question 6¶
Evaluate
clf.score
on the test set.Does the result suggest that overfitting is a concern in this case? (This is a difficult question to make precise. In general, if the two accuracies are similar, maybe off by 5%, then I would say that overfitting is probably not a serious concern. And if the test accuracy is higher than the training accuracy, then overfitting is definitely not a concern.)
Question 7¶
Make a pandas DataFrame
df
containing two columns. The first column should be called “Digit” and should contain the values fromy_test
. The second column should be called “Pred” and contain the predicted values corresponding toX_test
. (Reality check:df
should have 14000 rows and 2 columns.)
Question 8¶
Begin making a confusion matrix for this data using the following code.
import altair as alt
alt.data_transformers.enable('default', max_rows=15000)
c = alt.Chart(df).mark_rect().encode(
x="Digit",
y="Pred",
)
c_text = alt.Chart(df).mark_text(color="white").encode(
x="Digit",
y="Pred",
text="Pred"
)
(c+c_text).properties(
height=400,
width=400
)
Specify that the color on the rectangle chart should correspond to
"count()"
, using the color scheme “turbo”.Change the text on the text chart from “Pred” to
"count()"
.
Question 9¶
Use the above confusion matrix to answer the following questions.
What is an example of a (true digit, predicted digit) pair that never occurs in the test data?
What are the two most common mistakes made by the classifier when attempting to classify a 9 digit?
Do those mistakes seem reasonable?
Try evaluating the following code. Do you see why the pandas Series it displays relates to the confusion matrix?
for a,b in df.groupby("Digit"):
if a==9:
break
display(b["Pred"].value_counts())
Question 10¶
Find an example of an incorrectly classified digit. Display that digit using
imshow
, as we did at the top of this homework. (Hint. If you’re using code likedf["Digit"]==9
, it won’t work; use"9"
instead.)Looking at the digit, does the mistake seem reasonable?
Submission¶
Using the Share & publish link at the top right, enable public sharing in the “Share project” section, and enable Comment privileges. Then submit that link on Canvas.