Week 7 Monday#
Announcements#
In-class quiz tomorrow. Based on Worksheets due tonight.
Worksheet 13 is posted, on Logistic Regression (the topic of Prof. Zhou’s lecture on Friday and the lectures this week).
Not expected to know from Friday: gradient descent for finding the coefficients beta, Matplotlib.
Next week’s quiz will be based on Worksheets 13 and 14 and this week’s lectures. (We will review the most relevant parts of Prof. Zhou’s lecture.)
Plan for today: Approximately 15 minutes of lecture, then 10-15 minutes to work (Maya and Hanson are here to answer questions), then more lecture.
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
Predicting if a penguin is in the Chinstrap species#
Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.
df = sns.load_dataset("penguins").dropna(axis=0)
Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.
cols = ["flipper_length_mm", "bill_length_mm"]
Here is what the true data looks like. I chose two input variables because we can visualize this easily. With three or more input variables, I wouldn’t be able to draw a chart like the following showing all the information.
When you look at the following, you should be thinking that we will be inputting the flipper length and the bill length, and outputting a prediction for whether a penguin belongs to the Chinstrap species.
alt.Chart(df).mark_circle().encode(
x=alt.X(cols[0], scale=alt.Scale(zero=False)),
y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
color="species"
)
The procedure is basically the same as for Linear Regression.
First we import.
from sklearn.linear_model import LogisticRegression
Then we instantiate. The convention is to name this object clf
, for “classifier”, since we will be performing classification, not regression.
# clf instead of reg because it's for classification
clf = LogisticRegression()
Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.
df["isChinstrap"] = (df["species"] == "Chinstrap")
Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample
here rather than head
because the penguins at the beginning of the dataset are all the same species.
df.sample(10, random_state=1)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | |
---|---|---|---|---|---|---|---|---|
65 | Adelie | Biscoe | 41.6 | 18.0 | 192.0 | 3950.0 | Male | False |
276 | Gentoo | Biscoe | 43.8 | 13.9 | 208.0 | 4300.0 | Female | False |
186 | Chinstrap | Dream | 49.7 | 18.6 | 195.0 | 3600.0 | Male | True |
198 | Chinstrap | Dream | 50.1 | 17.9 | 190.0 | 3400.0 | Female | True |
293 | Gentoo | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | Female | False |
183 | Chinstrap | Dream | 54.2 | 20.8 | 201.0 | 4300.0 | Male | True |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | Female | False |
193 | Chinstrap | Dream | 46.2 | 17.5 | 187.0 | 3650.0 | Female | True |
95 | Adelie | Dream | 40.8 | 18.9 | 208.0 | 4300.0 | Male | False |
195 | Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3500.0 | Female | True |
We now do the fit
step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.
clf.fit(df[cols], df["isChinstrap"])
LogisticRegression()
Here is a sense of what the output looks like. It seems pretty accurate, because also in the real dataset, the Chinstrap penguins are all in the same part of the dataset.
clf.predict(df[cols])
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, True,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, True, True, True, True, True, False, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, False, True, False, True,
True, True, True, True, True, True, False, True, True,
False, True, True, True, False, True, True, True, True,
True, True, False, True, True, True, True, True, True,
True, True, True, False, True, True, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, True, False, False, False, False,
False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False])
We are putting exactly those values above into a new column named “pred”.
df["pred"] = clf.predict(df[cols])
Here we can compare the actual values of “isChinstrap” to the values of “pred”.
df.sample(10, random_state=1)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | pred | |
---|---|---|---|---|---|---|---|---|---|
65 | Adelie | Biscoe | 41.6 | 18.0 | 192.0 | 3950.0 | Male | False | False |
276 | Gentoo | Biscoe | 43.8 | 13.9 | 208.0 | 4300.0 | Female | False | False |
186 | Chinstrap | Dream | 49.7 | 18.6 | 195.0 | 3600.0 | Male | True | True |
198 | Chinstrap | Dream | 50.1 | 17.9 | 190.0 | 3400.0 | Female | True | True |
293 | Gentoo | Biscoe | 46.5 | 14.8 | 217.0 | 5200.0 | Female | False | False |
183 | Chinstrap | Dream | 54.2 | 20.8 | 201.0 | 4300.0 | Male | True | True |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | Female | False | False |
193 | Chinstrap | Dream | 46.2 | 17.5 | 187.0 | 3650.0 | Female | True | True |
95 | Adelie | Dream | 40.8 | 18.9 | 208.0 | 4300.0 | Male | False | False |
195 | Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3500.0 | Female | True | False |
Let’s see overall how accurate our predictions are. We first make a Boolean Series indicating whether the prediction was correct.
df["pred"] == df["isChinstrap"]
0 True
1 True
2 True
4 True
5 True
...
338 True
340 True
341 True
342 True
343 True
Length: 333, dtype: bool
A simple way to find the proportion of correct predictions is to call the mean
method of the above Boolean Series. (Remember that True
is treated as 1
and False
is treated as 0
.)
(df["pred"] == df["isChinstrap"]).mean()
0.954954954954955
There is also a way to use clf
directly to find the proportion of correct predictions, by calling its score
method. We pass as inputs to score
the input features along with the desired outputs. Notice that we get the same number as above. (Reminder: the reason both regression and classification are considered to be supervised machine learning is because we imagine some “supervisor” has given us the desired outputs for at least some of the data.)
clf.score(df[cols], df["isChinstrap"])
0.954954954954955
Now let’s see a systematic way to find all the rows of df
where the prediction is incorrect. We start with a Boolean Series which has True
for the rows where the prediction is incorrect. (Another way to create this same pandas Series would be to use ~df["pred"] == df["isChinstrap"]
.)
df["pred"] != df["isChinstrap"]
0 False
1 False
2 False
4 False
5 False
...
338 False
340 False
341 False
342 False
343 False
Length: 333, dtype: bool
Now we can use Boolean indexing like usual. The following is the sub-DataFrame containing all the rows where the prediction is wrong.
df[df["pred"] != df["isChinstrap"]]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | isChinstrap | pred | |
---|---|---|---|---|---|---|---|---|---|
19 | Adelie | Torgersen | 46.0 | 21.5 | 194.0 | 4200.0 | Male | False | True |
37 | Adelie | Dream | 42.2 | 18.5 | 180.0 | 3550.0 | Female | False | True |
111 | Adelie | Biscoe | 45.6 | 20.3 | 191.0 | 4600.0 | Male | False | True |
122 | Adelie | Torgersen | 40.2 | 17.0 | 176.0 | 3450.0 | Female | False | True |
157 | Chinstrap | Dream | 45.2 | 17.8 | 198.0 | 3950.0 | Female | True | False |
182 | Chinstrap | Dream | 40.9 | 16.6 | 187.0 | 3200.0 | Female | True | False |
184 | Chinstrap | Dream | 42.5 | 16.7 | 187.0 | 3350.0 | Female | True | False |
192 | Chinstrap | Dream | 49.0 | 19.5 | 210.0 | 3950.0 | Male | True | False |
195 | Chinstrap | Dream | 45.5 | 17.0 | 196.0 | 3500.0 | Female | True | False |
199 | Chinstrap | Dream | 49.0 | 19.6 | 212.0 | 4300.0 | Male | True | False |
206 | Chinstrap | Dream | 42.5 | 17.3 | 187.0 | 3350.0 | Female | True | False |
216 | Chinstrap | Dream | 43.5 | 18.1 | 202.0 | 3400.0 | Female | True | False |
253 | Gentoo | Biscoe | 59.6 | 17.0 | 230.0 | 6050.0 | Male | False | True |
318 | Gentoo | Biscoe | 48.4 | 14.4 | 203.0 | 4625.0 | Female | False | True |
327 | Gentoo | Biscoe | 53.4 | 15.8 | 219.0 | 5500.0 | Male | False | True |
There are 15 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be \(1-x\) where \(x\) is the score value we computed above.
# proportion wrong
15/len(df)
0.04504504504504504
If all we care about are the row labels where the prediction is wrong, we can use Boolean indexing on df.index
. This will give us the row labels where the prediction is wrong.
df.index[df["pred"] != df["isChinstrap"]]
Int64Index([19, 37, 111, 122, 157, 182, 184, 192, 195, 199, 206, 216, 253, 318,
327],
dtype='int64')
Let’s look at just three inputs, two of which have the correct output, but the input from the row with label 19
does not.
df.loc[[18, 19, 20], cols]
flipper_length_mm | bill_length_mm | |
---|---|---|
18 | 184.0 | 34.4 |
19 | 194.0 | 46.0 |
20 | 174.0 | 37.8 |
Here are the predicted outputs. Remember that the middle of these is wrong (so all three should be False, meaning none of these three are the Chinstrap species).
clf.predict(df.loc[[18, 19, 20], cols])
array([False, True, False])
A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba
method of the fit LogisticRegression
object.
Here are the corresponding probabilities for the three inputs above.
You should interpret the following as saying our logistic regression model is 99.977% confident in its False
prediction for the row with label 18
, but is only 67% confident in its prediction for the row with label 19
. This is good, because it was correct for the row with label 18
but was incorrect for the row with label 19
.
We will explain two cells below how we know which number is for False
and which number is for True
.
# predicted probabilities
clf.predict_proba(df.loc[[18, 19, 20], cols])
array([[9.99771697e-01, 2.28303114e-04],
[3.29559670e-01, 6.70440330e-01],
[7.71818762e-01, 2.28181238e-01]])
The column at index 0
corresponds to False
predictions and the column at index 1
corresponds to True
predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_
attribute.
clf.classes_
array([False, True])
The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit
).
In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.
Here are the coefficients. They are in the same order as the variables in cols
.
clf.coef_
array([[-0.34802208, 1.08405225]])
Here is the intercept.
clf.intercept_
array([18.36005773])
Here is a reminder of which input feature came first.
cols[0]
'flipper_length_mm'
Let \(x_1\) denote the flipper length and let \(x_2\) denote the bill length. We are plugging approximately \(18.36 + -0.35 \cdot x_1 + 1.08 \cdot x_2\) into the logistic function \(1/(1 + e^{-x})\). The result will be a probability estimate for the penguin being a Chinstrap penguin.
Let’s see this formula in action for the row with label 19
where we know our classifier makes a mistake.
df.loc[19, cols]
flipper_length_mm 194.0
bill_length_mm 46.0
Name: 19, dtype: object
When we plug in the values above, we get the following. Here we use np.exp(x)
to compute \(e^x\).
I was surprised that this is not 0.67
.
# I expected 0.67, will fix this on Wednesday
1/(1+np.exp(-(18.36+-0.35*194+1.08*46)))
0.5349429451582182
Added after class: My mistake was that I rounded too much. (I found it very difficult to catch this mistake.) Let’s try with one more decimal place and see that we really do get 0.67
, as when we called clf.predict_proba
.
1/(1+np.exp(-(18.36+-0.348*194+1.084*46)))
0.670842935992734