Week 7 Monday#

Announcements#

  • In-class quiz tomorrow. Based on Worksheets due tonight.

  • Worksheet 13 is posted, on Logistic Regression (the topic of Prof. Zhou’s lecture on Friday and the lectures this week).

  • Not expected to know from Friday: gradient descent for finding the coefficients beta, Matplotlib.

  • Next week’s quiz will be based on Worksheets 13 and 14 and this week’s lectures. (We will review the most relevant parts of Prof. Zhou’s lecture.)

  • Plan for today: Approximately 15 minutes of lecture, then 10-15 minutes to work (Maya and Hanson are here to answer questions), then more lecture.

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

Predicting if a penguin is in the Chinstrap species#

Today we will use Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.

df = sns.load_dataset("penguins").dropna(axis=0)

Here are the columns we’re going to use. I didn’t want to keep typing these repeatedly, so I stored them as a length two list.

cols = ["flipper_length_mm", "bill_length_mm"]

Here is what the true data looks like. I chose two input variables because we can visualize this easily. With three or more input variables, I wouldn’t be able to draw a chart like the following showing all the information.

When you look at the following, you should be thinking that we will be inputting the flipper length and the bill length, and outputting a prediction for whether a penguin belongs to the Chinstrap species.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species"
)

The procedure is basically the same as for Linear Regression.

First we import.

from sklearn.linear_model import LogisticRegression

Then we instantiate. The convention is to name this object clf, for “classifier”, since we will be performing classification, not regression.

# clf instead of reg because it's for classification
clf = LogisticRegression()

Here are the values we will be predicting. Logistic regression is easiest to understand (especially the formula is easiest to understand) when we are predicting a binary variable (meaning a variable with only two options). We will see later an example of predicting the “species” variable itself, which in this case has three options.

df["isChinstrap"] = (df["species"] == "Chinstrap")

Notice how there is a new Boolean column named “isChinstrap” which indicates whether or not it is a Chinstrap penguin. I’m using sample here rather than head because the penguins at the beginning of the dataset are all the same species.

df.sample(10, random_state=1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap
65 Adelie Biscoe 41.6 18.0 192.0 3950.0 Male False
276 Gentoo Biscoe 43.8 13.9 208.0 4300.0 Female False
186 Chinstrap Dream 49.7 18.6 195.0 3600.0 Male True
198 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female True
293 Gentoo Biscoe 46.5 14.8 217.0 5200.0 Female False
183 Chinstrap Dream 54.2 20.8 201.0 4300.0 Male True
98 Adelie Dream 33.1 16.1 178.0 2900.0 Female False
193 Chinstrap Dream 46.2 17.5 187.0 3650.0 Female True
95 Adelie Dream 40.8 18.9 208.0 4300.0 Male False
195 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female True

We now do the fit step. Notice how we are using two input features and we are using our new Boolean “isChinstrap” column for the target.

clf.fit(df[cols], df["isChinstrap"])
LogisticRegression()

Here is a sense of what the output looks like. It seems pretty accurate, because also in the real dataset, the Chinstrap penguins are all in the same part of the dataset.

clf.predict(df[cols])
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

We are putting exactly those values above into a new column named “pred”.

df["pred"] = clf.predict(df[cols])

Here we can compare the actual values of “isChinstrap” to the values of “pred”.

df.sample(10, random_state=1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap pred
65 Adelie Biscoe 41.6 18.0 192.0 3950.0 Male False False
276 Gentoo Biscoe 43.8 13.9 208.0 4300.0 Female False False
186 Chinstrap Dream 49.7 18.6 195.0 3600.0 Male True True
198 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female True True
293 Gentoo Biscoe 46.5 14.8 217.0 5200.0 Female False False
183 Chinstrap Dream 54.2 20.8 201.0 4300.0 Male True True
98 Adelie Dream 33.1 16.1 178.0 2900.0 Female False False
193 Chinstrap Dream 46.2 17.5 187.0 3650.0 Female True True
95 Adelie Dream 40.8 18.9 208.0 4300.0 Male False False
195 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female True False

Let’s see overall how accurate our predictions are. We first make a Boolean Series indicating whether the prediction was correct.

df["pred"] == df["isChinstrap"]
0      True
1      True
2      True
4      True
5      True
       ... 
338    True
340    True
341    True
342    True
343    True
Length: 333, dtype: bool

A simple way to find the proportion of correct predictions is to call the mean method of the above Boolean Series. (Remember that True is treated as 1 and False is treated as 0.)

(df["pred"] == df["isChinstrap"]).mean()
0.954954954954955

There is also a way to use clf directly to find the proportion of correct predictions, by calling its score method. We pass as inputs to score the input features along with the desired outputs. Notice that we get the same number as above. (Reminder: the reason both regression and classification are considered to be supervised machine learning is because we imagine some “supervisor” has given us the desired outputs for at least some of the data.)

clf.score(df[cols], df["isChinstrap"])
0.954954954954955

Now let’s see a systematic way to find all the rows of df where the prediction is incorrect. We start with a Boolean Series which has True for the rows where the prediction is incorrect. (Another way to create this same pandas Series would be to use ~df["pred"] == df["isChinstrap"].)

df["pred"] != df["isChinstrap"]
0      False
1      False
2      False
4      False
5      False
       ...  
338    False
340    False
341    False
342    False
343    False
Length: 333, dtype: bool

Now we can use Boolean indexing like usual. The following is the sub-DataFrame containing all the rows where the prediction is wrong.

df[df["pred"] != df["isChinstrap"]]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex isChinstrap pred
19 Adelie Torgersen 46.0 21.5 194.0 4200.0 Male False True
37 Adelie Dream 42.2 18.5 180.0 3550.0 Female False True
111 Adelie Biscoe 45.6 20.3 191.0 4600.0 Male False True
122 Adelie Torgersen 40.2 17.0 176.0 3450.0 Female False True
157 Chinstrap Dream 45.2 17.8 198.0 3950.0 Female True False
182 Chinstrap Dream 40.9 16.6 187.0 3200.0 Female True False
184 Chinstrap Dream 42.5 16.7 187.0 3350.0 Female True False
192 Chinstrap Dream 49.0 19.5 210.0 3950.0 Male True False
195 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female True False
199 Chinstrap Dream 49.0 19.6 212.0 4300.0 Male True False
206 Chinstrap Dream 42.5 17.3 187.0 3350.0 Female True False
216 Chinstrap Dream 43.5 18.1 202.0 3400.0 Female True False
253 Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male False True
318 Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female False True
327 Gentoo Biscoe 53.4 15.8 219.0 5500.0 Male False True

There are 15 rows above. If we want to know what proportion of predictions are incorrect, then we can divide by the length of the DataFrame. The following should be \(1-x\) where \(x\) is the score value we computed above.

# proportion wrong
15/len(df)
0.04504504504504504

If all we care about are the row labels where the prediction is wrong, we can use Boolean indexing on df.index. This will give us the row labels where the prediction is wrong.

df.index[df["pred"] != df["isChinstrap"]]
Int64Index([19, 37, 111, 122, 157, 182, 184, 192, 195, 199, 206, 216, 253, 318,
            327],
           dtype='int64')

Let’s look at just three inputs, two of which have the correct output, but the input from the row with label 19 does not.

df.loc[[18, 19, 20], cols]
flipper_length_mm bill_length_mm
18 184.0 34.4
19 194.0 46.0
20 174.0 37.8

Here are the predicted outputs. Remember that the middle of these is wrong (so all three should be False, meaning none of these three are the Chinstrap species).

clf.predict(df.loc[[18, 19, 20], cols])
array([False,  True, False])

A great feature of logistic regression is that it indicates not just a predicted output, but also a confidence. We can get these confidences by using the predict_proba method of the fit LogisticRegression object.

Here are the corresponding probabilities for the three inputs above.

You should interpret the following as saying our logistic regression model is 99.977% confident in its False prediction for the row with label 18, but is only 67% confident in its prediction for the row with label 19. This is good, because it was correct for the row with label 18 but was incorrect for the row with label 19.

We will explain two cells below how we know which number is for False and which number is for True.

# predicted probabilities
clf.predict_proba(df.loc[[18, 19, 20], cols])
array([[9.99771697e-01, 2.28303114e-04],
       [3.29559670e-01, 6.70440330e-01],
       [7.71818762e-01, 2.28181238e-01]])

The column at index 0 corresponds to False predictions and the column at index 1 corresponds to True predictions. I don’t think there is any way to know that in advance, but we can check it using the classes_ attribute.

clf.classes_
array([False,  True])

The whole point of logistic regression is to find parameters for estimating probabilities. That is also the difficult part (it happens when we call fit).

In our specific case, we have two input variables, so we need to find two coefficients and one intercept (or bias), so three parameters total.

Here are the coefficients. They are in the same order as the variables in cols.

clf.coef_
array([[-0.34802208,  1.08405225]])

Here is the intercept.

clf.intercept_
array([18.36005773])

Here is a reminder of which input feature came first.

cols[0]
'flipper_length_mm'

Let \(x_1\) denote the flipper length and let \(x_2\) denote the bill length. We are plugging approximately \(18.36 + -0.35 \cdot x_1 + 1.08 \cdot x_2\) into the logistic function \(1/(1 + e^{-x})\). The result will be a probability estimate for the penguin being a Chinstrap penguin.

Let’s see this formula in action for the row with label 19 where we know our classifier makes a mistake.

df.loc[19, cols]
flipper_length_mm    194.0
bill_length_mm        46.0
Name: 19, dtype: object

When we plug in the values above, we get the following. Here we use np.exp(x) to compute \(e^x\).

I was surprised that this is not 0.67.

# I expected 0.67, will fix this on Wednesday
1/(1+np.exp(-(18.36+-0.35*194+1.08*46)))
0.5349429451582182

Added after class: My mistake was that I rounded too much. (I found it very difficult to catch this mistake.) Let’s try with one more decimal place and see that we really do get 0.67, as when we called clf.predict_proba.

1/(1+np.exp(-(18.36+-0.348*194+1.084*46)))
0.670842935992734